- Vipin Nair
IRC Nick: 'swvist'
I am masters student in computer science & applications at National Institute of Technology, Calicut with an undergraduate degree in mathematics. I am a long time FOSS advocate (I have even got my laptop autographed by RMS ) and have been using Debian Sid since a really long time and see this as an opportunity to contribute in code. I also intend to stick around post GSOC project and be involved with the community.
I have been coding in python since quiet some time and have good experience in web development. Pattern recognition, machine learning and artificial intelligence are my fields of intrest. I have developed a Neural Network based application for Hand written character recognition which in available on Github. I am used to version control systems for maintaining my projects and have used git extensively. I have some good web development experience as well.
Why am i the best person for this project?
- I have the relevant experience and understand the project requirements.
- I have gone through the debexpo source code and familiarized myself with it.
- I have a preliminary implementation road map.
- This project falls under my areas of interest and i see it as a personal milestone.
- I am looking forward to contribute back to the GNU/Linux distribution i have been using since past few years.
- Semantic Package Review Interface for mentors.debian.net
- The project aims to speed up the process of getting new packages in Debian by using machine learning techniques to match Debian packages uploaded by volunteers to prospective sponsors in the interface provided by mentors.debian.net
- Any package uploaded to Debian by volunteers is scrutinized by experienced developers before the packages make their way into the distribution and getting a package reviewed and sponsored is often a time consuming process due to manpower issues. The goal of this project is to minimize the time required to get a package into Debian by streamlining the review process. Packages could be recommended to prospective sponsors by analyzing the existing data available. Meta information will be extracted from uploaded packages which will be used to match it with prospective sponsors based on their interests as registered on m.d.n or by analyzing their history, packages sponsored in the past, their packages and other available information. This project can be divided into three distinct phases.
- Meta Data extraction from uploaded packages
- From any uploaded package, we need to be able to extract relevant information which will be used to identify potential sponsors. Data can be obtained from the package and the project documentation. Data like debtags, dependencies, description, section etc. will be used to generate package Meta Data. Existing Debian QA tools like Lintian could also be used to gather data. There exists a python module (python-debian) that can be integrated with Debexpo for handling Debian File formats, which can be used to extract Meta data from the uploaded package, and if the existing tool is found to be lacking, additional code will be written to handle the extract data extraction and any addition to python-debian module will be sent upstream for inclusion, if it is relevant to the project.
- Identifying sponsor' preference
- A Debian Developer could have preferences when (s)he chooses to sponsor a package. The preference could be collected from the prospective sponsor explicitly, from the m.d.n interface and implicitly by analyzing existing data like past uploads, packages maintained, and derived data like section, source language, package dependencies etc. Main focus would be on identifying trends like ,say ,developer usually sponsors packages which is dependent on a package he maintains or something similar. Identifying such trends can immensely help generating recommendations. UDD and data collected by the Debian Team Activity Metrics project could be used as additional data sources which could strengthen the process.
- Matching packages and prospective sponsors.
- The most important task of this project would be to match uploaded packages to prospective sponsors. Multiple supervised and unsupervised learning techniques could be used to generate package recommendations and the best one will be used. Packages or prospective sponsors could be clustered together using different clustering(fuzzy) algorithms or basic association analysis technique could be used over the the set of sponsors and packages.
- Meta Data extraction from uploaded packages
Benefits To Debian
- This project could speed up the process of inclusion of new packages into Debian by reducing the time a prospective sponsor could take to review a package and upload it to Debian repository, by recommending uploaded packages to prospective sponsors. I am also willing to work on this project post GSOC which includes the creation of an integrated code browser and review interface, which should further speed up the entire process.
- Algorithm to extract meta data from uploaded packages.
- Algorithm to identify developer preferences by analyzing data from different sources.
- Algorithm to recommend uploaded packages to prospective sponsors.
- Web Interface to display the data from above algorithms. ( Meta data display UI, Sponsors preferences UI and recommended packages UI)
- Understand Debian packaging system. Getting acquainted with the Debexpo system. Research on different methodologies that could be followed for the above mentioned problem. Identifying different data sources that could be input to various algorithms.
Week 00-02: Phase one work.
- Reading on Debian packaging system and format of Debian Package. Analyzing data from Debian packages and identify relevant information. Extract data using python-debian module. Augment any limitation of python-debian module by writing additional code. Document the code and other work done.
Week 02-04(5): Phase two work
- Identifying data sources that could provide relevant information. Sources include UDD and data from Team Activity Metrics project. Extraction of information from available data sources that could help identify sponsor preferences. Develop web interface and data storage mechanism to accept sponsor preferences explicitly. Develop web interface to present extracted meta data information of uploaded package. Perform trend analysis on the extracted sponsor data. Generate Data set for next phase. Document the code and other work done.
Week 0(4)5-09: Phase Three work Begins
- Generate data sets for use in algorithms. Implement various machine learning algorithms to match sponsors preferences to package meta data including various clustering algorithms, association analysis, Test the results obtained to identify the best approach. Develop UI to present recommendations of different algorithms. Collect user feedback for different algorithms, apply tweaks based on it. Document code and other work done.
- Buffer period to complete any pending tasks. Write proper documentation. Write tests if applicable and fix bugs. Optimize/Refactor existing code. If time permits, proceed with post GSOC work.
- I am willing to work on this project post GSOC. Additional modules like a unified code browser and review interface was proposed which could speed up the package review process further down. I am willing to work on it as well and can dedicate time to this project every week.
- I am a full time Debian (sid) user since quiet some time and this would be one wonderful opportunity to contribute back to Debian. I am also fascinated by the working of Debian community and would love to work, interact and be a part of it. I do plan to stick around post GSOC and contribute to Debian.
Exams and other commitments
Other summer plans
- None. I will consider GSOC as a full time internship and work accordingly.
Are you applying for other projects in SoC?
- Yes. I am very much interested in working for Debian and i have applied for one more project under the Debian organization.