Hadoop packages for Debian
Description of the project: Debian currently does not include Hadoop. For obvious reasons, it would be good to have reliable Hadoop packages in Debian to make installation easier.
Confirmed Mentor:
How to contact the mentor:
Confirmed co-mentors:
Deliverables of the project: Debian build scripts for building Hadoop packages (including projects such as Spark, Mahout, etc.); either integrated with the Debian Java Packaging team, or the Apache Bigtop project. Wherever possible, the packages should use .jar files from existing Debian packages, instead of including copies of the .jar files in every package. The deliverables have to improve over the existing Bigtop packaging efforts, obviously.
Desirable skills: The student should have prior experience with packaging both of Debian software and of Java software; as these two packaging approaches need to be aligned carefully. Thus, it is beneficial to have worked with Apache Bigtop before (where the current packaging efforts are located).
What the student will learn: It will be essential to integrate different stakeholders. Hadoop developers want to move forward quickly; as such, Hadoop changes, new dependencies are introduced, and users are expected to build Hadoop from source. Maven will automatically download relevant .jar files, and use them; but does not use the distribution package manager. Hadoop users want the latest version, but also want backwards compatibility and upgrade paths. Distributors want reliable builds and upgrades, and want to avoid duplicated work for example on security upgrades. Many Java applications will come with duplicate .jar files of dependencies; for a clean Linux distribution, this is not desirable. Instead, distribution supplied packages should be used, and Java dependencies need to be translated into Debian dependencies. A possible technical direction would therefore to automate this process of dependency translation, or an integration of maven with debian package management. This project will likely not require major algorithmic skills - but it will require communication skills and planning. A major learning will be on the system administration side: how to manage software such that installation, administration and upgrades are easy to manage large-scale.