It must prompt the user to enter a query as a ‘bag of words’ where multiple terms can be entered separated by a space

computer science


In the previous two development assignments, we have created a process to generate an inverted index based upon the CS 3308 corpus. In the index part 1 assignment we created a process that scans a document corpus and creates an inverted index. In the index part 2 assignment we extended the functionality of the index process by incorporating functionality to:

  • Ignore (not include the index) a list of stop words
  • Edit tokens as follows:
    • Terms under 2 characters in length are not indexed
    • Terms that contain only numbers are not indexed
    • Terms that begin with a punctuation character are not indexed
  • Integrate the Porter Stemmer code into our index
  • Calculate the tf-idft,d value for each unique combination of document and term

In this assignment, we will be using the inverted index that was created by the processed developed as part of the index part 2 assignment. In unit 4 we learned about the weighting of terms to improve the relevance of documents found when searching the inverted index. First we learned about calculating the inverse document frequency and from this the tf-idft,d weight. Using the tf-idft,d weight we were able to calculate both document and query vectors and evaluate them using the formula for cosine similarity.

In the development assignment for this unit, we will implement these concepts to create a search engine. Your assignment will be to create a search engine that will allow the user to enter a query of terms that will be processed as a ‘bag of words’ query.

Your search engine must meet the following requirements:

  • It must prompt the user to enter a query as a ‘bag of words’ where multiple terms can be entered separated by a space
  • For each query term entered, you process must determine the tf-idft,d weight as described in Unit 4
  • Using the query terms, your process must search for each document that contains each of the query terms
  • For each document that contains all of the search terms, your process must calculate the cosine similarity between the query and the document
  • The list of cosine similarity scores must be sorted in descending order from the most similar to the least similar
  • Finally your search process must print out the top 20 documents (or as many as are returned by the search if there are fewer than 20) listing the following statistics for each:
    • The document file name
    • The cosine similarity score for the document
    • The total number of items that were retrieved as candidates (you will only print out the top 20 documents)
    •  ‘home mortgage’ is provided in the output of the search for terms

This Source Code document contains code for a search engine that meets many of these requirements is provided for you as an example. This code does NOT meet all of the requirements of this assignment. Further there are key areas of the code that are missing. You are welcome to use this example code as a baseline, however, you must complete any missing functionality as required by the assignment.

Related Questions in computer science category