Google Summer of Code/2022/Ideas/Java Big Data Infrastructure Improvements and Maintenance
The Spark overlay is an ebuild repository for JVM-based big data infrastructure systems. Currently, it enables users to install Apache Spark and the H2O machine learning platform to a Gentoo system easily via Portage. It is also the home of the first set of Kotlin library ebuilds that are built from source and Kotlin eclasses which allow more ebuilds for third-party Kotlin packages (e.g. okio, clikt) to be created.
The Spark overlay has featured in two previous GSoCs (2020, 2021) and is still being actively maintained. It has gone through a massive update of packages for Java 11 after it had been enabled for users on a stable keyword (bug #810613), a repository-wide migration to Log4j >=2.17.1 after it had been added to the official Gentoo ebuild repository (bug #830910), as well as several additional security updates to packages, including Jetty 9.4.44, Jersey 2.35, and Jackson 2.13.0. These maintenance efforts have been striving to match the quality of packages in the Spark overlay to Java packages in the Gentoo repository to the maximum possible extent.
Despite continuous maintenance activities, the Spark overlay could still use some improvements that the current maintainer has not made due to his limited availability. The list below might look overwhelming, but you are more than welcome to just plan to do a subset of the tasks in your project proposal, as long as the amount of work they might involve reasonably matches the GSoC program's length.
- The Apache Spark version shipped in the Spark overlay should be updated. The upstream has released version 3.2.1 in January 2022, whereas the Spark overlay currently provides 3.0.0-preview2, which is a pre-release version.
- Some packages in the Spark overlay are still on a vulnerable version and should be updated to a patched version. Affected packages include Hadoop 2.7.4, Netty 4.1.42, and possibly more.
- More H2O extensions should be added to the Spark overlay. Currently, packages for Algos and TargetEncoder extensions are offered. Some other key extensions that are not shipped in the Spark overlay yet include XGBoost and AutoML.
- The aforementioned Kotlin ecosystem for Gentoo has some potential areas of improvement, which have been documented in Kotlin/Open Challenges and Room for Improvement.
- Resolve some other issues in the Spark overlay's issue tracker.
- The Spark overlay currently does not have a reliable mechanism to report security issues of packages in it. The infamous Log4j 2 vulnerability disclosed in December 2021 has drawn attention from both software developers and non-professional users to security of Java packages. While critical vulnerabilities of vital JVM libraries like Log4j can usually be easily noticed by Spark overlay maintainers thanks to wide news coverage on such important events, other critical vulnerabilities of less commonly-used packages might not receive the maintainers’ attention in time. This caused unacceptable postponement in delivery of the Jetty 9.4.44, Jersey 2.35, and Jackson 2.13.0 security updates.
|Expected Project Size||Expected Outcomes|
175 hours or 350 hours, depending on what tasks are planned
Any subset of the following items, as long as it matches the project size:
Medium to hard, depending on what tasks are planned