Google Summer of Code/2012/Ideas
Want to spend your summer contributing full-time to Gentoo, and get paid for it? Gentoo is in its 7th year in the Google Summer of Code. In the past, most of our successful students have become Gentoo developers, so your chances of becoming one are very good if you're accepted into this program.
Most ideas listed here have a contact person associated with them. Please get in touch with them earlier rather than later to develop your idea into a complete application. You can find many of them on Freenode's IRC network under the same username. If there is no contact information, please join the gentoo-soc mailing list or #gentoo-soc on the Freenode IRC network, and we will work with you to find a mentor and discuss your idea.
You don't have to apply for one of these ideas! You can come up with your own, and as long as it fits into Gentoo, we'll be happy to work with you to develop it. Remember, your project needs to have deliverables in less than 3 months of work in most cases. Be ambitious but not too ambitious ;)
Read this first
We have a custom application template that we will ask you to fill out. Here it is:
- 1 Read this first
- 2 Ideas
- 2.1 Cross Container Support
- 2.2 Dynamic documentation type generation
- 2.3 Ebuild Upstream Scanner
- 2.4 gentoo-x86 QA website
- 2.5 Gentoo@home
- 2.6 Improved binary package support
- 2.7 libbash
- 2.8 lmonade relocation and binary packages
- 2.9 OpenRC Extensions
- 2.10 Port Fedora UEFI Support
- 2.11 Porthole plug-ins and extensions
- 2.12 Recruiting Webapp Usability
- 2.13 Repoman cleanup
- 2.14 Repository of self-contained ebuild source packages
- 2.15 SELinux policy originator
- 2.16 Support for Fortran modules and libraries with multiple compilers
- 2.17 Cache sync
- 2.18 Package statistics reporting tool
- 2.19 Tags support for Portage
- 2.20 Automatically generated overlay of R packages
It is already possible to create fully working chroot using qemu-user and build quickly packages through it, the natural step further is to make it work as a normal container, providing a similar interface to manage it. This way build for arm targets can be done on faster systems and sidetracks also issues about python and perl not supporting proper cross compilation or widespread build systems such waf and cmake failing completely at the task.
The project aims in providing initscripts, management tools and canned recipes to generate container-like chroot, let developers easily import and export system images and further integrate it with crossdev.
An additional task is to support layered systems so native userspace can be used to further speed up the process (hybrid chroot).
- Knowledge of Qemu and lxc
- Good knowledge of the runtime linker
- Understanding of portage
The official gentoo documentation is currently offered through a self-maintained XML format, called GuideXML. Although XML is considered a powerful language for generating other types, we are currently limiting ourselves to HTML output only. We have frequent requests for PDF and, more recently, ePub.
The purpose would be to update our infrastructure to generate multiple output formats, including ePub, for our users.
- Django CMS
euscan (Ebuild Upstream Scanner) is an utility to check if an ebuild have new upstream version. It was designed to provide the same features as debian's uscan and DEHS (http://dehs.alioth.debian.org/). euscan is also a web interface that aggregates the result of euscan ran on all gentoo ebuilds (currently hosted at http://euscan.iksaif.net but will move to gentoo.org someday). There are a lot of things to do on euscan, here are some examples:
Currently euscan use shell scripts and GNU parallel to scan the portage tree. Using celery (a python task queue that integrates well with django) would allow a more flexible scan process. Scanning the tree or a package could be done with a simple django command starting a celery task, the web interface could rescan old packages on-demand too. While adding celery commands, it would be great to add simple commands to: create the local filesytem needed by euscan, add/sync/remove an overlay, etc... Celery is not really packaged for gentoo, but I got some ebuilds in my overlay, so another side of the project could be to push them to the portage tree.
Version Detection Enhancements
euscan currently use multiple heuristics to determine upstream version, but there is a lot to do to enhance that to remove false positive and be able to scan more package:
- create a script to gather statistics on euscan success and failure and use methods.
- use metadata.xml's <upstream><remote-id> tag with the appropriate euscan site-handler (pypi, pearl, cpan, etc...) and make sure that all packages have this tag correctly set in the portage tree
- create a new metadata tag <upstream><watch> that works like debian/watch and make euscan use it. Create a script to import informations from the associated debian/watch file.
- steal ideas from other tools (uscan, portscout) and use other source of data (youri, distrowatch, distromatch, whoas; Equivalent-Packages)
Euscan Web enhancements
Add an account system, and allow maintainer to register to subscribe to automatic weekly notifications. Integrate euscan with other webapps like http://portage.gentoo.org, http://znurt.org/, http://gpo.zugaina.org/, Gentoostats and try to create a single killer webapp . Add better overlay support and tweak django administration.
- Shell (optional)
- Celery (optional)
- HTML/CSS/JS (optional)
The idea is simple enough, take the QA results from various tools and present it via a searchable website. Think packages.gentoo.org, just for QA results. The implementation work required would primarily be building the website itself- the user could rely upon pkgcore-checks for the initial data stream (it can output it's results as a pickle stream) leaving the candidate to focus on generating a site providing insight into the status of current architectures, current stabling, etc.
One additional constraint would be that the underlying DB schema should be written in a fashion that allows multiple data imports to be used- while pkgcore-checks right now can provide data for a candidate to work with, the candidate should be designing a system also able to pull in other data sources (at some point repoman for example).
Finally, an additional feature could be designing the underlying schema and website to allow for the possibility of being able to handle multiple repositories- think about if the GNOME herd wanted their overlay to be scanned/accessible. This complicates the design a bit (specifically keeping it fast), but is likely to be desired functionality down the line.
The relevant gentoo-soc discussion (with a bit more details) is accessible on in the gentoo-soc archives.
- python (data importation)
- web frameworks this includes some HTML/CSS/JS knowledge, depending on how complex the candidate wishes to make the site
- RDBMs schema design
Gentoo packages are stabilized and architecture tested (keyworded) following a request coming from a package maintainer or a Gentoo developer. The Gentoo architecture teams then test by hand every requested package and its dependencies. For a very large fraction of Gentoo packages, this tedious manual process could be batch automated given some specifications. The idea is to build a framework for such a project, that would help both users by getting much better stabilization process and maintainers by lightening the workflow and find critical bugs. One idea would be to use an automated built of a tinderbox, and distribute the tinderbox with stabilization scripts to the Gentoo user community to install and test packages automatically. It could be done via volunteer computing from the Gentoo user community. Since it could turn into an involved project, it could be split into three projects.
- Sebastien Fabbro
- [Add yourself as a possible mentor]
- distributed computing
Gentoo better for derived binary distros. One of them is more intelligent handling of library versions with binpkgs (and installed packages, which are a form of binpkg). For example, it's possible to build a binpkg against an old version of a library, then install it against a new version and have it be broken by default because of a shared-library version bump. It's also possible to break reverse ABI dependencies when upgrading a package, and there is currently no convenient way for package managers to detect such breakage in advance. Ideally, a package would have a way to specify its ABI dependencies in the built state instead of just which versions it can build against from source. It is possible to create an ABI dependency abstraction that is flexible enough to cover all possible kinds of ABI dependencies. Using an ABI abstraction, it will not matter whether or not there exists a specific soname to be referenced by dependencies. See bug #192319.
Another problem is saving binpkgs with different USE flag and other build settings on the same host. See bug #150031. The way forward is one or more hashes of the metadata. A third problem is the lack of binpkg support for the kernel. This could be changed through modifying the kernel eclass to support a binary USE flag that also did configuration & build, or perhaps some kind of genkernel modification, or both. See bug #154495.
Two other minor problems:
- Compilation related messages are thrown to user when installing a binary package. This should be avoided somehow. Bad developers habit.
- It would be nice to have elog output stored in /var/db for later consumption and perhaps, have it embedded in xpak metadata. This would improve PackageKit support, which doesn't allow any output from package phases during install.
- Shell scripting
For the last two summers we have been developing a shared library for bash. We got pretty close last summer by parsing many ebuilds but not 100% due to unexpected problems with the grammar. Hopefully this will be the summer we finally nail the goal.
The lmonade project is a distribution of mathematical software based on Gentoo prefix. It can be installed without admistrative rights on different linux distributions or OSX. The main aim is to make scientific software available across platforms.
For a user not familiar with Unix and the shell, the quickest solution to get a working copy of some software is to download a binary. However, binary distribution for the all different flavors of GNU/Linux and OSX is too much of a burden for most software developers. A solution is provided by Gentoo Prefix and Portage's binary package feature. Gentoo Prefix can handle basic rewriting of absolute paths encoded in binaries present in these packages. Though there are still quite a few rough edges that require attention before this can be used in production.
The goal of this project is to bring all these pieces together to form an easy to use solution that will be useful to many software projects. The focus is on making things "just work" for both users and software projects. This task would involve
- modifying eclasses to generate relocatable files,
- writing scripts to fix absolute paths in binaries (e.g., 1 2),
- packaging in a user friendly format and
- finding creative solutions to platform dependent quirks.
Please e-mail us for more information.
- Python and Bash
- Good understanding of Portage and packaging
OpenRC is the default init system in Gentoo, it provides a large deal of features while staying mostly agnostic to the underlying implementation on /sbin/init.
The project aims to be a constructive criticism to the systemd approach by providing the few interesting features not already implemented by OpenRC as stand alone modules allowing integrator not to need to bend their system layouts to accomodate the init system.
Desired extensions include:
- A mechanism by which init scripts can configure OpenRC to detect runtime failures, log them and respond to them. The key response we want to enable is to give regular init scripts respawn functionality like we have in /etc/inittab
- Oom-killer protection via /proc/*/oom_adj
- The ability to perform some sort of maintenance action on a timer (e.g. restart)
- Knowledge of C, bash and sysvinit
- Knowledge of systemd, upstart, launchd and similar systems
- Understanding of the init process
Computer manufacturers are adopting UEFI as a BIOS replacement on amd64 systems, but Gentoo is currently unable to boot on such systems using GRUB 0.97. Intel wrote patches for UEFI support that were adopted by Fedora's GRUB fork. Porting those patches from Fedora's GRUB fork to sys-boot/grub is necessary if sys-boot/grub is to remain a viable bootloader in Gentoo.
There are two existing issues in sys-boot/grub that must be addressed in conjunction with this port. The first is that sys-boot/grub does not compile correctly with GCC 4.6, which is bug #360513. The second is that sys-boot/grub's grub-probe utility relies on /dev/root, which is newer versions of udev remove. A proper port must compile properly with GCC 4.6 without any dependence on /dev/root.
In addition, sys-boot/grub is GPLv2 licensed, so these improvements may not involve the use of any GPLv3 code.
- Understanding of Gentoo Linux boot process
- Knowledge of C and x86 assembly
- Knowledge of Operating Systems
- Knowledge of QEMU (or access to UEFI capable hardware)
Porthole is a GTK+-based frontend to Portage. This project would enable Porthole to improve its ability to manage remote computers, gather and report package statistics, and more. The work would encompass creating:
- A basic python based control interface API for linking to remote computers and groups of computers to gather information about installed packages, and install/update those using portage and/or pkgcore as remote backends. One possible means of making those connections is by using dev-python/pyro.
- Extending the work started in the public_api branch of portage to include running emerge via the public api.
- Create a simple cli for gathering/reporting update info for #1
- Create a porthole plug-in for connecting to the control interface, displaying the info in portholes views and and dispatching desired actions to the remotes via the control interface.
- gtk+, pygtk
The webapp for Gentoo recruitment at https://recruiting.gentoo.org was written two summers ago. It has developed a backlog for improved usability and it also needs updating for Rails 3.0.
Repoman (the Portage QA tool) could really use some attention. Before elaborating more on how to best solve this I need to check with zmedico.
This proposal is similar in scope to the Cache sync proposal, except that it will focus on implementing support for repositories that host self-contained ebuild source packages that are analogous to source RPMs (SRPMs). The repository layout will be similar to existing PORTAGE_BINHOST repositories (like those hosted at tinderbox.dev.gentoo.org), and will include a metadata index file which is similar to $PKGDIR/Packages. Each source package hosted in the repository will contain a single ebuild, its metadata, and all files it requires from the portage tree (including all inherited eclasses and any additional files such as patches from the files directory). A zip file will be a suitable container for one of these source packages.
A synchronization script will generate source packages from a source portage tree (or overlay), and it will avoid unnecessary regeneration of a given source package in cases when no relevant files have changed timestamps or been added/removed. This script will be run as often as the repository maintainer wants to synchronize with the source tree.
The syncronization script will have an option to preserve source packages that no longer exist in the source tree, perhaps updating such packages if they happen to contain outdated versions of eclasses. The ability to retain packages that no longer exist in the source tree may be useful for some cases of the stable tree idea which was proposed in GLEP 19.
In order to ensure that repository updates do not interfere with clients, it may be desirable to publish the repository as a series of snapshots that are contained in directories which are named corresponding to snapshot date/time. This will ensure that a previous snapshot is still accessible when a newer snapshot becomes available. So, clients will never have any trouble downloading a specific source package that corresponds to a metadata index which was downloaded earlier and used for a dependency calculation. Without some kind of snapshot mechanism like this (similar to RCU), a race condition would exist such that dependency calculations would be somewhat unreliable, and downloaded source packages might have different checksums and dependencies from those listed in a metadata index that was downloaded only a minute earlier.
- Python (portage/pkgcore) or C++ (paludis)
Gentoo Hardened is maturing its SELinux support rapidly. In SELinux, policies are written in a higher abstract format (dictated by the reference policy) and converted to the SELinux-specific rules (like allow, dontaudit, type transitions, etc.). For troubleshooting rights however, it is a very daunting task to find out why a particular rule is set (in other words, to find which higher level rule is causing the SELinux rule to exist).
The rules are converted in M4 language from constructs like "corenet_tcp_bind_http_port" to rules like:
- allow $1 http_port_t:tcp_socket name_bind
- allow $1 self:capability net_bind_service
It is the latter that end users can easily see (with tools such as sesearch) but it is not easy to find that "allow openvpn_t http_port_t:tcp_socket name_bind" comes from "corenet_tcp_bind_http_port(openvpn_t)" defined in the openvpn.te definition.
In this idea, we would like to find a way to register where these lines come from to improve debugging and troubleshooting
Fortran modules and libraries are highly compiler dependent. Even minor version change in gfortran renders them incompatible. However, we are often forced to work with multiple compilers at the same time (e.g. scientific software development). The project aim is to create a Fortran framework that would allow to install concurrent versions of Fortran binaries in a PMS (Package Manager Specification)-aware fashion, along with a configuration module for eselect to rule them all.
See also: Linux problems you never considered handling - Fortran90 modules for multiple compilers (Donnie's Berkholz blog post)
- Shell scripting
The portage tree and all its overlays keep growing. Right now only the official portage tree occupies more than 600Mb on a regular filesystem. However the package manager does not need the whole tree of full ebuilds, patches and manifests to perform most of its work. The idea would be to sync a smaller database or a cache of only needed information for global package manager operations, then fetch the required package only when needed. It would speed considerably tree synchronization and reduce the space occupied by portage tree (see the "Repository of Self-Contained Ebuild Source Packages" idea for an alternative approach). Currently the cache system in portage is also really slow and so is the search feature. The project could be inspired by the Debian or RPM system but with the usability and choices offered by Gentoo, and would probably include:
- design and implement automatic cache builder to be produced by a given repository/overlay
- make portage/paludis/pkgcore to do delta-sync with a local cache and fetch only the required files to be installed when requested
- Python (portage/pkgcore) or C++ (paludis)
A user end program to upload anonymous information about installed packages on a users machine to a database that package maintainers and developers have access to. Last year's effort is called Gentoostats.
This post from planet Gentoo titled, 'Gentoo: A critical look at the QA process' suggests that it is difficult for maintainers to decide whether or not to mark a package as stable. The main issue being: if it's not marked as stable then users wont use it, and it's very difficult for maintainers to test the package properly on their own. This creates a blurring between ~arch and arch. If stable packages are left in ~arch for long periods of time, users will begin to use ~arch as if it were stable more often, which defeats the purpose of ~arch in the first place. If maintainers push packages into arch because it works on their machines but breaks on users due to the high number of different system setups this makes gentoo look unreliable, further blurring ~arch and arch.
Here are some reasons why this project would help Gentoo:
- Now developers can't see when users are happy with a package, only when they are not.
- Finding incompatibilities between specific package versions
- General user interest in specific packages to help trim down unused packages from portage.
- A programming language and GUI toolkit
- Understanding of databases and SQL.
- Understanding of portage
Gentoo uses categories now. A package can only be in a single category, which is very limiting because generally things don't fit perfectly into one place without other possibilities. Tags could make it a lot easier to find packages they're looking for by doing Boolean searches like: kde AND mail. This project would add support for tags to Portage and would allow for backwards compatibility of categories as tags.
The R scientific language has a vast ecosystem of about 5000 packages and 40000 individual versions. It was attempted in the past to make Portage manage these packages directly. The last project was somewhat successful but suffered from some drawbacks. Some of them, like the lack of manifests, are a result of the particular kind of implementation which was chosen.
This project is a completely different implementation which consists in a set of scripts which pull all the necessary information from CRAN, BIOC, etc... mirrors and create a central overlay. This overlay will be hosted on Gentoo infrastructure and can be added using Layman by Gentoo users. As far as the regular user is concerned this overlay is no different from other overlays where ebuilds are written manually. The generated packages and eclasses must be compatible with all current package managers used in Gentoo.
In order for this project to be sucessful it will need to not only provide the system which generates the overlay, but also solve issues like fixing errors in R package metadata (quite common), R and system dependency verification, distfiles mirroring, manifest generation, automatic incremental overlay updates, proper logging, and a few others. The end result of this project must be something that the Gentoo infrastructure team will agree to install on Gentoo infrastructure. Discussion with them will be necessary way before the completion of the project to confirm it is going into the right direction. The demonstration of the system actually working will need to be done on GSoC dedicated hardware. Proper documentation of what the system does and how will be considered crucial.
It is not necessary to have strong R skills for this project. The little R knowledge which is actually needed can be picked up in a few hours of reading.
It is recommended you talk to one of the contacts below before you make an official application.
- Systems and networking