Google Summer of Code/2020/Ideas/Big Data Infrastructure by Gentoo
Big Data Infrastructure by Gentoo
The big data infrastructures are mostly built on the Java virtual machine ecosystem, most notably in Java and Scala.
Nevertheless, Java has not been adopted smoothly into GNU/Linux distributions. The packaging of Java software are considered difficult by the GNU/Linux community (e.g. Debian, Archlinux, Fedora). At the same time, the Java community has its own set of repositories like maven, functionally similar to packages in GNU/Linux distributions.
The Gentoo Java Project has done a good job laying out the framework of the Java ecosystem in Gentoo. At the same time, there are still thousands of useful Java packages to be packaged and maintained. The project will parse the metadata of maven packages and automatically write ebuilds compatible with the Java build system used in Gentoo. We are going to set up and maintain an automatically updated maven overlay every Gentoo user can use. The overlay will at least contain spark and hadoop. We aim to make Gentoo an attractive choice for data scientists, Java developers, users and big data system administrators.
A preliminary tool, java-ebuilder is available as app-portage/java-ebuilder. A proof-of-concept overlay is also made.