Google Summer of Code/2022/Ideas/Java Big Data Infrastructure Improvements and Maintenance

From Gentoo Wiki
Jump to:navigation Jump to:search

Java Big Data Infrastructure Improvements and Maintenance

The Spark overlay is an ebuild repository for JVM-based big data infrastructure systems. Currently, it enables users to install Apache Spark and the H2O machine learning platform to a Gentoo system easily via Portage. It is also the home of the first set of Kotlin library ebuilds that are built from source and Kotlin eclasses which allow more ebuilds for third-party Kotlin packages (e.g. okio, clikt) to be created.

The Spark overlay has featured in two previous GSoCs (2020, 2021) and is still being actively maintained. It has gone through a massive update of packages for Java 11 after it had been enabled for users on a stable keyword (bug #810613), a repository-wide migration to Log4j >=2.17.1 after it had been added to the official Gentoo ebuild repository (bug #830910), as well as several additional security updates to packages, including Jetty 9.4.44, Jersey 2.35, and Jackson 2.13.0. These maintenance efforts have been striving to match the quality of packages in the Spark overlay to Java packages in the Gentoo repository to the maximum possible extent.

Despite continuous maintenance activities, the Spark overlay could still use some improvements that the current maintainer has not made due to his limited availability. The list below might look overwhelming, but you are more than welcome to just plan to do a subset of the tasks in your project proposal, as long as the amount of work they might involve reasonably matches the GSoC program's length.

  • The Apache Spark version shipped in the Spark overlay should be updated. The upstream has released version 3.2.1 in January 2022, whereas the Spark overlay currently provides 3.0.0-preview2, which is a pre-release version.
  • Some packages in the Spark overlay are still on a vulnerable version and should be updated to a patched version. Affected packages include Hadoop 2.7.4, Netty 4.1.42, and possibly more.
  • More H2O extensions should be added to the Spark overlay. Currently, packages for Algos and TargetEncoder extensions are offered. Some other key extensions that are not shipped in the Spark overlay yet include XGBoost and AutoML.
  • The aforementioned Kotlin ecosystem for Gentoo has some potential areas of improvement, which have been documented in Kotlin/Open Challenges and Room for Improvement.
  • Resolve some other issues in the Spark overlay's issue tracker.
  • The Spark overlay currently does not have a reliable mechanism to report security issues of packages in it. The infamous Log4j 2 vulnerability disclosed in December 2021 has drawn attention from both software developers and non-professional users to security of Java packages. While critical vulnerabilities of vital JVM libraries like Log4j can usually be easily noticed by Spark overlay maintainers thanks to wide news coverage on such important events, other critical vulnerabilities of less commonly-used packages might not receive the maintainers’ attention in time. This caused unacceptable postponement in delivery of the Jetty 9.4.44, Jersey 2.35, and Jackson 2.13.0 security updates.

Contacts Required Skills
  • Experience with at least one popular Java build automation tool, such as Maven or Gradle
  • Non-trivial knowledge about the Java compilation, linking, and loading process, including how to manually invoke javac directly to compile Java source files without using a build automation tool, and how tools like Maven and Gradle may invoke javac to build a project
  • Non-trivial experience in using Gentoo as a daily driver or in a mission-critical workflow
  • Bash (for working with ebuilds, eclasses, and some scripts used in automated processes)
  • Experience in ebuild writing
  • Non-trivial Git skills, which at least include proficiently using git rebase, and knowing how to keep the commit history linear (i.e. without any merge commits)
  • A spirit of eliminating and avoiding technical debts to the maximum possible extent
Expected Project Size Expected Outcomes

175 hours or 350 hours, depending on what tasks are planned

Any subset of the following items, as long as it matches the project size:

  • Updated Gentoo packages for the latest release of Apache Spark
  • Updated Gentoo packages for the latest releases of Apache Hadoop and Netty
  • Gentoo packages for H2O XGBoost and AutoML extensions
  • Resolutions to issues in Kotlin/Open Challenges and Room for Improvement or Spark overlay issue tracker
  • A system that reports security vulnerabilities of Gentoo packages in the Spark overlay to its maintainers in time
Project Difficulty

Medium to hard, depending on what tasks are planned