#language en
-----
= Recursively building Java dependencies from source =
-----
== Name ==
'''Ashutosh Agarwal''' (''Radsaggi'')
== Contact ==
* Email: ashutosh.ee12@iitp.ac.in, radsaggi-thecoder@yahoo.in
* IRC: ''radsaggi''
== Background ==
I am a sophomore undergraduation student at the Indian Institute of Technology, Patna, studying Computer Science and Engineering. <
>
I am proficient in Java with over 7 years of working exprience and have developed a couple of applications on Java back in my high school(all my previous projects are posted on [[https://github.com/Radsaggi|github|target="_blank"]]). I have working knowledge of Apache Ant, Apache Maven and am studying Jenkins for the purpose of this project. I have been using git for over a year now and am well aquainted with svn. I am a Linux fan and have been working on Linux as my primary OS for more than 4 years now. <
>
I make a strong applicant because I have the capability to think out of the box, question the dogma in everything, try the unexperimented and a keen inclination towards working on any project that I can fantasize even if I cannot perceive a way to accomplish it. I slowly work towards my goal, learning and testing, sometimes falling and finally reach the feat I had imagined. I am a fast learner. I have this inner compulsion of mine that impels me to find out how things work, why they exist and strive to relate my existing knowledge with the subject at hand so that I can grasp well. <
>
Other language skills include C, Scala, PHP, HTML, CSS.
----
== Project title ==
'''''RECURSIVELY BUILDING JAVA DEPENDENCIES FROM SOURCE'''''
== Project details ==
''Many Java projects use a combination of JAR file dependencies from other Java projects. In some cases, third party JARs are also used to provide custom tools for the build process (e.g. custom ant tasks or maven plugins). Can the entire heirarchy of dependencies and build tools and their transitive dependencies all be built from published source code? If not, we can not be certain that a particular dependency JAR is clean, free of malicious code, easy to fix or adapt for future Java versions. This project aims to develop automated mechanisms for cataloguing the portfolio of Java libraries on sites like github and the Maven Central Repository, creating a database of dependencies, mirroring their source repositories, removing binary JARs from their source trees and trying to build them using symlinks to JARs found in Debian or built by the same recursive process. This project may be partially automated using a tool like Jenkins. Data for some of the dependencies can be harvested from Maven pom.xml files.''
== Synopsis ==
The target of the project is to build a tool that can find source code and properly build from scratch the dependencies of any given project and publish them to a local repository. If the source cannot be located or the dependency is available only in binary format, it should emit a warning and make a log of such cases.
== Benefits to Debian ==
The tool can helpful in the following ways
* ''Segregation of true free software from those that do not provide 100% source'': This category could include software which provide incomplete or no source code. In this case the tool would fail to compile the source to uasable JARs and terminate or emit a warning. This would provide a reasonable decision platform for developers for choosing library dependencies.
* ''Reporting mismatch of licenses'': We can determine mismatch in licenses and terms in the dependency hierarchy.
* ''Determining abandonware'': As pointed out by Daniel Sir, we could find out whether the software JARs published are too old, to determine possiblity of their development being discontinued.
----
== Deliverables ==
* '''''The Data Set''''':
* ''A database schema'' tracking each piece of software
* ''A local Maven repository'' only containing JARs that we have built locally from some source
* ''A set of Git/Svn repositories'' to mirror the upstream repositories of projects that need to be tweaked
* '''''The Tool Set''' a command line tool to build a given artifact and its dependencies recursively with the following components'' :
* ''Source Discoverer'' to work out all the possible ways to get the source repository for an artifact or identify the existence of source tars/jars.
* ''Build Discoverer'' to recognize the build system of the artifact.
* ''Dependency Discoverer'' to determine the dependencies of the package.
* ''Jenkins Config'' Jenkins would be used to build the JARs.
* ''Unit Tests'' will be executed explicitly, to make sure everything is built and working as expected.
* ''User Interface'' A user interface would be needed for inserting data sources wherever they could not be determined by the tool.
* '''''The Reports''''':
* For what percentage of projects could we determine the license? Could we spot any license mismatch in the dependency hierarchy?
* Which tools/APIs in the chain of dependencies don't provide any source code? Are they optional tools (such as code quality analysis tools) that we can skip in the build process (e.g. by producing a mock version of the tool or plugin)?
* Which non-free/sourceless JARs are most widely depended upon by other projects in the free Java eco-system? Can we make a list of the top 10 or 20?
* Abandonware: Can we detect JARs that haven't been updated for an extended period of time, or those with no activity in the source repository?
== Implementation Details ==
* '''''The Data Set''''':
* ''A database schema'', perhaps based on sqlite file residing on the local system, tracking the following for each piece of software
* The binary artifact
* The source repository location (Git or svn URL)
* Source tarball/JAR location and availability
* Dependency relationships (including versions)
* License information
* ''A local Maven repository'' only containing JARs that we have built locally from some source. This will be different from ${HOME}/.m2/repositories
* ''A set of Git/Svn repositories'' to mirror the upstream repositories of projects that need to be tweaked
* '''''The Tool Set'''''<
>
''The command line tool would be java based. It can be installed as a JAR invoked through a shell-script(something along the lines of maven executable) '' <
>
'' My idea is to first create a dependency graph(only include dependencies not previously built) and then determine the best traversal option based on an algorithm (traverse lower to higher and there could be two alternatives for circular dependencies (1)try to build them together or (2)build any one with a prepackaged binary and let the second be based on built binary. Then check the second against the prepacked version). Now for each dependency try to locate source and build it. Continue this iteration for the entire graph. Parallel execution might be considered to quicken things up. '' <
>
''Various components of the tool are as described '':
* ''Source Discoverer'' The source code is at times available in the maven repo or sometimes it is mentioned about in the pom.xml file. We will need to mirror the remote repository into our local git and create a branch or in case of source tar/jar extract them to a local repo. Resolving of sources would be required only for dependencies. The original package we are trying to build would certainly be provided to the tool as a part of a local repo or as a link to a remote repo/tar/jar.
* ''Build Discoverer'' determine the existance of pom.xml/build.xml and try to parse it.
. ''[Advanced]'' For projects that have no recognizable build system, the program would need to try to generate a suitable build.xml/pom.xml and store it in a local clone of the repository.
* ''Dependency Discoverer'' to test the artifact's source tar/jar or repository for binary artifacts or other dependencies listed in pom.xml or build.xml. I propose to use Apache Ivy for this purpose. It has been built for Apache Ant, works well for Maven and can be customized easily. We need to make a record of all dependencies in our local database. A record of existing binaries would also be added to the local database so that we can symlink them from a trusted source when building.
* ''Jenkins Config'' Jenkins would be used to build the JARs. A tool would build the Jenkins job config file for the artifact, pointing Jenkins to the local git repository and creating the build.
* ''Unit Tests'' JUnit will be supported. My suggestion would be to let the build tool handle this and capture any errors or failed tests that occur along the way.
* If the project is a Maven or Ivy project, then there are likely to be attempts to find dependencies during the build process. Running under Jenkins, these tools should be tweaked in such a way that they only look in the local repository and use dependencies that we have already built. Thus we will need to override the pom.xml file to point it to our local maven repository.
* ''User Interface'' Sometimes, the system would be unable to proceed (e.g. because there are no clues about source locations in a given pom.xml). A user interface would need to be constructed to show a list of artifacts with exceptions and allow the user to manually locate the source and supply the URLs. The system would then continue iterating with this new data. The entire tool is CLI based so there should be little problem. Maybe base this interface optionally on swing(I have good experience of working on swing).
* '''''The Reports'''''<
>
''All the reports can be generated by any separate tool that examines the log files and the database set up. This should merely comprise querying the database and aggregating the result set into useful data chunks.''
== Project schedule ==
Unit Tests will be written with each component and are not explicitly mentioned here.
* '''''Phase I '''''
* Set up Database, Local Repositories
. Code Source Discoverer
. ''21st April, 2014 - 11th May, 2014''
* Code Basic Build Discoverer
. Code Dependency Discoverer
. ''12th May, 2014 - 1st June, 2014''
* Code Jenkins Configuration Creator
. ''2nd June, 2014 - 22nd June, 2014''
* '''''Mid Term Evaluation'''''
* '''''Phase II'''''
* Code Advanced Build Discoverer
. ''23rd June, 2014 - 13th July, 2014''
.
* Code Unit Test Executor
. Code User Interface
. ''14th July, 2014 - 31st July 2014''
.
* Code Reports Generator
. ''1st August, 2014 - 17nd August, 2014 ''
* '''''End Term Evaluation'''''
----
== Exams and other commitments ==
No examinations or other commitments between 19th May, 2014 and 22nd August, 2014. <
>
I however will be rejoining college starting 23rd July, 2014. My ''contact'' hours may slightly reduce but ''effort'' hours will remain more or less the same.
== Other summer plans ==
''None''
== Why Debian? ==
Debian is one of the premier names in the Linux world. Known for its stability, customizability, Debian is the one distro for all. It would be my pleasure to work under the mentorship of such a well established organisation.
== Other GSoC projects ==
I am applying with two other organizations for GSoC.
----------