Beginner Python assignment
In this project you are going to index a set of documents in a python open-source search engine
called tinysearch, devise a set of test queries and evaluate the system on those queries.
Indexing the Documents
• download the search engine and the corpse from OWL->Resources->Assignment2_Files
on your machines and index them (file name: tinysearch.zip).
o Note that this corpse contains dumped Wikipedia documents and it is a few years
o There are instructions concerning these steps further down in this document.
This corpse has some of the Wikipedia documents that can be used for mirroring, personal use,
informal backups, off-line use or database queries. All text content is licensed under the GNU
Free Documentation License (GFDL).
Topics and Questions
In this part of the assignment, you are going to think of an application domain (i.e. a subject) that
is of interest to you. For example, you could choose health, politics, sport, geography, music (any
kind), etc. Now, create twenty queries in your chosen domain. For example:
Q1: species and dogs
Q2: Akita dogs
Note: These are just examples chosen by a student who was interested in dogs. You can choose
whatever subject that:
• It is covered by the documents you are using;
• You can think of some quite difficult queries on your chosen topic.
Test the performance of the provided search engine using TF-IDF by applying the following
1. Run the queries, as prepared above, through the system and collect the first ten files (or
so) returned for each.
2. Compute precision and recall at the following levels of n (where n is the number of
documents considered): n=5, n=10.
3. To do this, for each query you need to look at (for example) the first ten results (i.e. files)
returned and see for each file whether is it Relevant or Not Relevant. A file is relevant if
it contains the answer to your query. It does not matter where in the file the answer
occurs as long as it is present somewhere.
Note that this is not as easy as it sounds since there will be occasions when you are not sure. You
need to make a note of the rationale for making your final decision in cases of doubt.
Computing recall poses a problem in that we need to know for each query all the correct answers
in the collection. Strictly, we cannot know that without inspecting every document in the
collection. At TREC they use a pooling method as discussed in the lecture. To get around the
problem here, simply check the first n documents (n = 20) returned for each query. Count the
number of correct responses there and assume that these are all the correct responses in the
collection. Then use this information to compute recall at n=5 and n=10 as above.
Write up your results in a short report USING THE TEMPLATE SUPPLIED with the following
headings exactly as shown in the template:
1. Cover page includes your (formal) name, ID and the program you are currently enrolledin.
2. Topic and Queries
• What topic you chose; how the queries were devised.
3. Indexing the Documents
• How was this done?
• What problems were encountered (if any) and how were they solved?
4. TF-IDF Performance
• Method - short text outlining what you did.
• Results - a table summarizing the numerical results as above (review assignment
2 - appendix 1).
• Discussion - a short description of what the results show (was TF-IDF always
better, always worse or sometimes better/worse?), any interesting problem
cases, any technical problems encountered and so on.
Report Appendix 1
• Include the queries you used for your TF-IDF evaluation and the IDs of the right answers found
for each (if any).
• Example (this is just a sample):
Num Query IDs of Answers
1 hot chicken 5003
20 chicken 0