Name Aarsh Shah
Contact/Email: aarshkshah1992@gmail.com
Background:
- → I am a third year computer science undergrad with a deep interest in Information Retrieval , Natural Language Processing ,Graph Algorithms and Probability Theory and have also taken courses at university for these subjects.
- → The programming languages I am quite comfortable with include C,C++ and Python . I've done quite some interesting projects in all three languages. Also,picking up a new language pretty quickly shouldn't really be a problem. The following is a list of some interesting hobby projects I've done :-
- a.) Implemented an automatic text summarization program which could summarize a document to 40 % of it's original size without much loss of information.
- b.) Implemented the Ant Colony Optimization method ,a swarm intelligence algorithm by Michael Dorgio which can give good approximations of the Traveling Salesman Problem in polynomial time .
c.) Implemented a statistical analysis based IRC chat analyser which based on the analysis of an IRC chat log, could identify the speakers of some sentences if the names of the speakers were deleted. It relied heavily on the content and pattern of the sentence and previous discussions in the log. Got the second prize in India at a NLP programming contest for this (I know it does not matter much though. )
- d.) Have been working on implementing state of the art weighting algorithms in Xapian for quite some time now,have already implemented the classical Tf-Idf weighting scheme which has already been merged in Xapian. Have pull requests pending for a couple of Divergence from Randomness schemes ,will get merged soon, hopefully this summer itself.
- → Am quite comfortable with open source development now that I've been working on open source for 4 months. Have learnt a lot about good coding conventions, writing good tests using test suites, the importance of documentation, version control systems such as git and about the spirit of open source, feedback, collaboration and community.
- → Am comfortable with gcc, gdb , valgrind, autotools and cmake . Also, learning any new tool quickly shouldn't really be much of a problem.
What makes me the best person to work on this project:
→ The project is about improving Debian's search results on http://search.debian.org/ and http://lists.debian.org/search.html.
- → All these searches are powered by Xapian and Omega and as someone who has a decent experience of hacking on Xapian, I think I'll really be able to justify the project.
- → Also,IR has been my personal interest and hobby for quite some time now and I am quite comfortable with many IR concepts which will definitely help me in improving our search results.
- → Even though I agree that I have yet to learn a lot, I also have quite a good experience of hacking on open source codebases and that experience also makes the best man for the job. I've already got a couple of patches merged for Xapian and a couple are waiting to be merged once some issues are settled.
- → I have read and rearead Amati's research paper's and his thesis on the Divergence from Randomness framework which is a state of the art IR weighting framework known to outperform most known weighting schemes including the ones currently implemented in Xapian . So, I think I am quite strong on the theory side and an intricate understanding of all the issues involved there will help me write a good implementation for Xapian ,which will definitely make Debian's current search a lot more powerful than it already is. Also, I plan to introduce many changes and features to Debian search such as spell check, resistance to keyword spamming and query expansion based on relevance feedback using extremely efficient algorithms which are absent right now and will definitely make the Debian search experience a lot better for users.
Patches I've written:
Links to some DFR code I've written which will make Debian search a lot more powerful:
- → Implementation of the DPH and PL2 weighting schemes of the DFR framework which are known to outperform BM25,the current weighting scheme used in Debian search.
Links to some IR code I've written for Xapian:
- → The Tf-Idf weighting scheme.
→ Adding test coverage for the ?TradWeight weighting scheme.
Project title Improving Debian's Search Results and the Search Interface
Mentor: Olly Betts
Co-Mentor: Dan Colish
Project details:
→ Debian's search on http://search.debian.org/ and http://lists.debian.org/search.html is powered by Xapian (and it's indexing and CGI front Omega), an open source C++ Information retrieval library which is widely used both in the commercial and the academic world. This search is the main gateway used by the massive Debian community for obtaining any kind of information about Debian.
- → A major component of this project is to drastically improve Debian's search results by empowering Xapian with the start of the art weighting algorithms of the Divergence From Randomness framework written by Gianni Amati. DFR is not a weighting scheme in itself but rather a framework capable of generating a plethora of weighting schemes, each with different capabilities which can adapted to the different places where Debian uses search depending on the needs of the corresponding database and the user community.
- → Debian search currently lacks the query expansion feature and so when a user runs a search, he does not get suggestions for keywords which can help improve his search results. Another major component of this project is to implement query expansion in Debian search so that when a user runs a search, depending on what he has searched for and the documents he marks as relevant , he will get suggestions for keywords to add to his query which will get him a diversified range of more relevant results. The entire interface will be interactive.
- → Xapian currently does not have support for performing query expansion using state of the art weighting schemes like BM25 and the DFR schemes and instead uses a hard coded formula for query expansion which does not perform well compared to other weighting schemes for query expansion. So, an other subcomponent of this project is to add support for query expansion using BM25 and multiple DFR weighting schemes which will greatly improve the quality of query expansion. Also,the code that is used for weighting documents can't be used for query expansion because the formula for most weighting schemes is different for both purposes. Hence,it needs to be implemented separately.
- → Debian search also currently lacks the feature of spelling suggestions and so when the user makes a spelling mistake while searching, he does not get the required search results and does also not get spelling suggestions .This project also includes implementing spelling suggestions in Debian search.
- → Debian search currently suffers from the keyword spamming problem and so a page with multiple occurrences of a keyword will dominate the search results irrespective of whether it is quite relevant to the search or not and may even rank higher than a relevant document with fewer occurrences of the keyword. DFR schemes like DPH are known to be robust to keyword spamming and so,they will also be implemented which will improve the quality of Debian's search results.
Optional/ Post-Gsoc project implementation:
- In this section I've enlisted an optional idea which I'll work upon either during the summer itself if I finish my project in advance or after GSOC if time does not permit me to do so during the summer.
- → Additional weights can be given to pages (apart from those given by the weighting scheme) based on metrics such as the number of web page hits it gets as the popularity of a page is also a good measure of how relevant it is.
- → This can be done by modifying the weight of a page so as to include scaled versions of each metric along with scaling factors and tuning the scaling factors depending on the importance of each metric.
- → This will also improve Debian's search results and make them dynamic because when users start finding a particular page to be relevant and visit it, the page starts getting additional weight and gets promoted in the ranking.
- → However,this project will require access to the Debian search logs.
Synopsis: The project is about improving Debian's search results and search interface by implementing state of the art and parameter free DFR weighting schemes in Xapian which are known to outperform the current weighting schemes available under Xapian and then deploying them in Debian search. Also, query expansion feature will also be implemented in Debian search and it will be greatly improved by implementing novel query expansion schemes in Xapian which are not currently implemented. Moreover, spelling suggestions feature and resistance to keyword spamming will also be implemented which will greatly improve the search experience for Debian users. As an optional project, the popularity of a page will be estimated by various metrics which will then be used to add to the weight of a of a page thereby making the search results more relevant and dynamic.
Benefits to Debian:
- → Improved search experience because of spelling suggestions on making spelling mistakes:
- a.) For example. on searching for “Omega” , this is what I get on search.debian.org:
http://search.debian.org/cgi-bin/omega?P=omega&HITSPERPAGE=10&DB=en.
- b) However,when I make a spelling mistake and type “Omga”, not only do I not get any results, but I also don't get any suggestions:
http://search.debian.org/cgi-bin/omega?P=omga&HITSPERPAGE=10&DB=en&xP=Zdebian&xFILTERS=--A&xDB=en
- c.) However,on searching for “omga” on Xapian search, I get the following page with the correct spelling suggestion:
- → Improved search experience because of keywords suggestion for query expansion and giving users the ability to mark documents as relevant which also aids query expansion:
- a.) For example, no suggestions here:-
http://search.debian.org/cgi-bin/omega?P=omega&HITSPERPAGE=10&DB=en
- b.) Not only do I get keyword suggestions which I can add to my query via the check boxes,but also the ability to mark a result as relevant by clicking on the check boxes next to them:-
- After marking certain results as relevant and clicking on search .I get keyword suggestions which are based on the documents I marked as relevant.
- → Improved search results because of implementing weighting schemes that are resistance to keyword spamming
- a.) For example,when I search for “xapian” on the Debian mailing list , most of my top results are dominated by commit messages about xapian because “xapian” occurs in them many times. There maybe more relevant documents which don't appear as the top results because of the low frequency of keywords in them:-
http://lists.debian.org/cgi-bin/search?P=xapian&DEFAULTOP=or&sort=relevance&HITSPERPAGE=10
- → No need to tune parameters of weighting schemes as most DFR schemes are parameter free. This will take a huge load off the Debian search developers.
- → It has been proven that DFR schemes ,whether parameter free or not, outperform most of the existing schemes including all the ones currently implemented in Xapian in terms of accuracy. Thus ,implementing DFR in Xapian will definitely improve the search results in Debian. The results can be found in this paper:
- → Implementing query expansion using state of the art weighting schemes in Xapian will help Debian search users get greatly improved suggestions than the ones currently possible with Xapian.
Deliverables:
- → Named DFR schemes in Xapian and deployment of the most suitable one on Debian search based on testing.
- → Keyword suggestions for query expansion in Debian search.
- → Pluggable BM25 and DFR Query expansion schemes in Xapian and deployment of the best scheme on Debian based on testing.
- → Spelling suggestions will be provided in Debian search.
- →The parameter free and resistant to keyword spamming DPH scheme will be deployed on the mailing list as the problem seems to be highly obvious there.
- → [Optional] A postingsource class to add weight to a document based on it's popularity based on metrics obtained from analysis of logs(if I get access to them).
Project schedule:
Community Bonding Period (May 27 – June 16):
- → Get to know the members of the community better.
- → Understand the current Xapian and Omega code used by Debian search and identify regions where new code will have to be written to deploy implementations added to Xapian.
- → Merge DPH and PL2 code into Xapian.
DPH and PL2 integrated into Xapian.
- (Exams from 25th May to 7th June.)
Note:- Implementation of a weighting scheme includes corresponding tests and documentation.
Week 1 (June 17 – June 23) :
→ Implement the DFR ?BeB2. weighting scheme.
→ Implement the DFR ?IfB2 weighting scheme.
Week 2 (June 24 – June 30) :
→ Implement the DFR ?InL2 weighting scheme.
→ Implement DFR ?IneB2_2 weighting scheme.
Week 3 (July 1 – July 7) :
→ Implement the DFR ?IneB2_e weighting scheme.
- → Implement the DFR DLH weighting scheme.
?IneB2_e and DLH ready for merging.
Week 4(July 8 – July 14) :
- → Integrate all DFR weighting schemes written in Xapian in Omega by writing the parsing code.
- → Update Omega's documentation to reflect the newly available weighting schemes,
Code and documentation for DFR schemes ready to be merged in Omega.
Week 5 (July 15 – July 21):
- → Update Omega code used by Debian to reflect the DFR schemes.
- → Start intensive testing of the performance (relevance and speed) of the DFR schemes implemented in Xapian by deploying them on search.debian.org turn by turn to decide the best one to use .
Omega code used by Debian upgraded.
Week 6 (July 22 – July 28):
- → Finish testing of DFR schemes on search.debian.org and decide the best one to use.
- → Deploy selected DFR scheme on search.debian.org .
Improved weighting scheme deployed on Debian search.
Mid-Term evaluation.
Week 7 (July 29 – August 4):
- → Integrate spell checking feature of Xapian on search.debian.org and mailing list search.
- → Deploy the DPH weighting scheme on the Debian mailing list search (for resistance to keyword . spamming.)
Spell checking and resistance to keyword spamming implemented.
Week 8 (August 5 – August 11) :
- → Implement additional statistics required for query expansion with DFR schemes.
- → Design and implement base class for query expansion.
Base class for query expansion implemented.
Week 9 (August 12 – August 18):
- → Implement query expansion using Bo1 model with tests and documentation.
- → Start implementation of Bo2 model of query expansion.
Bo1 model of query expansion implemented.
Week 10 (August 19 – August 25) :
- → Finish implementing query expansion with Bo2 along with tests and documentation.
- → Start implementing KL model of query expansion with tests and documentation.
Bo2 model of query expansion implemented.
Week 11 (August 26 – September 1) :
- → Finish implementation of KL model query of query expansion with tests and documentation.
- → Implement BM25 model of query expansion with tests and documentation.
- → Merge the entire query expansion framework.
KL and BM25 models of query expansion implemented. The entire query expansion framework merged.
Week 12(September 2 -September 9):
- → Implement support for query expansion schemes in Omega.
Query Expansion framework integrated in Omega.
Week 13 (September 9 – September 15) :
- → Update Omega code used by Debian to the one that supports query expansion.
- → Discuss with mentor the methodology for testing query expansion on Debian.
Omega code used by Debian updated to include query expansion schemes.
Week 14 (September 16 – September 22) :
- → Carry out tests to find the most suitable query expansion scheme for Debian.
- → Deploy most suitable query expansion scheme on Debian search .
Most suitable scheme for Query Expansion deployed on Debian.
Final Evaluation.
Exams and other commitments: As stated in my project schedule, I only have exams during the community bonding period from 25th May to 7th June. Other than that ,I have no exams or any other commitments during the entire GSOC timeline.
Other summer plans: NO, I have no plans other than GSOC and don't intend to make any. If I get selected, I'll be one of the lucky few who gets to work for Debian and so, writing code for Debian will be my main priority during the summer and I will be able to devote 40-45 hours a week to it. (7-8 hours on weekdays and 4 hours on weekends.)
Why Debian?: I have been writing code for Xapian for some time now and have added algorithms there that can make search more powerful and can definitely improve search results. I still have a lot of ideas and algorithms that I want to implement based on research in IR that has been done in the past ten-fifteen years or so and the best thing that I can get out of it would be to see those implementations actually work and give better results. So,when I came to know that Debian uses Xapian for search and given the fact that Debian has a massive database of information where the cutting edge weighting and query expansion algorithms can actually make a difference and improve access to information for the entire Debian developer and user community,it was the most natural choice of organization to work for.Also,what's better than contributing in my own small way to the “universal operating system” ?
Are you applying for other projects in SoC? I am applying for OGDF but in case I get into both , I really and sincerely want to work with Debian.