In the previous two development assignments, we have created a process to generate an inverted index based upon the CS 3308 corpus. In the index part 1 assignment we created a process that scans a document corpus and creates an inverted index. In the index part 2 assignment we extended the functionality of the index process by incorporating functionality to:
In this assignment, we will be using the inverted index that was created by the processed developed as part of the index part 2 assignment. In unit 4 we learned about the weighting of terms to improve the relevance of documents found when searching the inverted index. First we learned about calculating the inverse document frequency and from this the tf-idft,d weight. Using the tf-idft,d weight we were able to calculate both document and query vectors and evaluate them using the formula for cosine similarity.
In the development assignment for this unit, we will implement these concepts to create a search engine. Your assignment will be to create a search engine that will allow the user to enter a query of terms that will be processed as a ‘bag of words’ query.
Your search engine must meet the following requirements:
This Source Code document contains code for a search engine that meets many of these requirements is provided for you as an example. This code does NOT meet all of the requirements of this assignment. Further there are key areas of the code that are missing. You are welcome to use this example code as a baseline, however, you must complete any missing functionality as required by the assignment.