Google Summer of Code/2020/Ideas/Big Data Infrastructure by Gentoo

Big Data Infrastructure by Gentoo

The big data infrastructures are mostly built on the Java virtual machine ecosystem, most notably in Java and Scala.

Nevertheless, Java has not been adopted smoothly into GNU/Linux distributions. The packaging of Java software are considered difficult by the GNU/Linux community (e.g. Debian, Archlinux, Fedora). At the same time, the Java community has its own set of repositories like maven, functionally similar to packages in GNU/Linux distributions.

The Gentoo Java Project has done a good job laying out the framework of the Java ecosystem in Gentoo. At the same time, there are still thousands of useful Java packages to be packaged and maintained. The project will parse the metadata of maven packages and automatically write ebuilds compatible with the Java build system used in Gentoo. We are going to set up and maintain an automatically updated maven overlay every Gentoo user can use. The overlay will at least contain spark and hadoop. We aim to make Gentoo an attractive choice for data scientists, Java developers, users and big data system administrators.

A preliminary tool, java-ebuilder is available as app-portage/java-ebuilder. A proof-of-concept overlay is also made.

Contacts	Required Skills
Benda Xu	Bash script, experience in Ebuild writing. Basic Java, familiar with maven. Experience in using Java on Gentoo Basic system administration: rsyncd, web, git server setup up.