HW5 - Search Engine#
Overview#
Learning objective: Implement specialized data types with Python classes for tf-idf information retrieval.
Files and Tests#
search_engine.py
is the file representing the SearchEngine class as well as a command-line interface for using the SearchEngine.document.py
is the file representing a single document in the SearchEngine.hw5_test.py
is the file for you to put your own tests. The Run button executes this program.cse163_utils.py
is a helper file that has code to help you test your code.run_tests.py
is a helper file that will run all the āunit testsā in a way that is more transparent and debuggable.
You must submit your work as your teacher requires. Ask questions in class.
Submit the following Python files along with one or more ātest corpus sub-directoriesā:
search_engine.py
document.py
hw5_test.py
main.py
<my_test_corpus>
You should create subdirectories with test files in them. Use these subdirectories in your tests. And, since your tests need to run on the teacherās machine, be sure to submit your new subdirectories.
Please do NOT submit the following subdirectories. You may assume they exist on the teacherās machine:
small_text
test
test_corpus
Submit Subdirectories
You must submit your work as your teacher requires. Ask questions in class.
Try to break your search engine.
Create test files that are target the type of tests you want to run. They should have content that is easy enough to verify.
Time Management
This assessment is very information-dense. Plan ahead by taking notes of what you need to do and where you will need to do it. It will probably help to get a global view of the entire assessment by reading ALL of the components before starting any codingāwe expect students to spend at least 40 minutes reading, synthesizing, and planning before starting any coding.
Video on TF-IDF
Watch the video of Hunter Schafer.
Deliverables#
You are to create two different classes, Document
and SearchEngine
, that implement a Search Engine.
You will create a main.py
file that has console-based user interaction. It will prompt the user for a directory to use for the corpus of documents. The user may then enter any number of queries implemented by the SearchEngine
. The results of the query will be output to the console. If the user enters an empty string, the program ends.
Document
will have an initializer and at least one public method:
doc = Document(relative_path)
: Example code that creates a newDocument
initialized to an actual file.term_frequency(term)
: returns the TF value for the term for this document.
SearchEngine
will have an initializer and only one public method:
se = SearchEngine(relative_path)
: Example code that creates a newSearchEngine
initialized to a corpus of files found in the directory,relative_path
.search(query)
: returns a list of document paths, relative to current working directory. The list is sorted by relevance (TF-IDF).
Main Output#
Here is an example run of main
. Note that your program will output only the first 10 document paths even while the SearchEngine
returns a much longer list in its results.
Please enter the name of a directory: small_text
Building Search Engine...
Enter a search term to query (Enter=Quit): Elvis
Displaying results for 'Elvis':
1. small_text/Viacom - Wikipedia.txt
2. small_text/Federal Bureau of Investigation - Wikipedia.txt
3. small_text/Black or White - Wikipedia.txt
4. small_text/List of songs recorded by The Jackson 5 - Wikipedia.txt
5. small_text/Michael Jackson - Wikipedia.txt
6. small_text/Game of Thrones - Wikipedia.txt
Enter a search term to query (Enter=Quit): Santa Claus
Displaying results for 'Santa Claus':
1. small_text/Jackson 5 Christmas Album - Wikipedia.txt
2. small_text/ABC (The Jackson 5 album) - Wikipedia.txt
3. small_text/Mountain View, California - Wikipedia.txt
4. small_text/List of songs recorded by The Jackson 5 - Wikipedia.txt
5. small_text/Mandalay Bay - Wikipedia.txt
6. small_text/Michael Jackson - Wikipedia.txt
7. small_text/Federal Bureau of Investigation - Wikipedia.txt
8. small_text/Ubisoft - Wikipedia.txt
9. small_text/Viacom - Wikipedia.txt
10. small_text/Traditionalist conservatism - Wikipedia.txt
Enter a search term to query (Enter=Quit): notinanydocument
Displaying results for 'notinanydocument':
No results :(
Enter a search term to query (Enter=Quit): Horrible Bosses in Schools
Displaying results for 'Horrible Bosses in Schools':
1. small_text/Mountain View, California - Wikipedia.txt
2. small_text/Hal Abelson - Wikipedia.txt
3. small_text/Living with Michael Jackson - Wikipedia.txt
4. small_text/Jeb Bush - Wikipedia.txt
5. small_text/UNCF - Wikipedia.txt
6. small_text/VH1 - Wikipedia.txt
7. small_text/Hans-Hermann Hoppe - Wikipedia.txt
8. small_text/William McKinley - Wikipedia.txt
9. small_text/Steve Forbes - Wikipedia.txt
10. small_text/Nintendo - Wikipedia.txt
Enter a search term to query (Enter=Quit):
Thank you for searching.
Context#
A search engine is an algorithm that takes a query and retrieves the most relevant documents for that query. In order to identify the most relevant documents, our search engine will use term frequencyāinverse document frequency (tfāidf), a text information statistic for determining the relevance of a term to each document from a corpus consisting of many documents.
The tf-idf statistic consists of two components: term frequency and inverse document frequency. Term frequency computes the number of times that a term appears in a document (such as a single Wikipedia page). If we were to use only the term frequency in determining the relevance of a term to each document, then our search result might not be helpful since most documents contain many common words such as ātheā or āaā. In order to down-weight these common terms, the document frequency computes the number of times that a term appears across the corpus of all documents. The tf-idf statistic takes a term and a document and returns the term frequency divided by the document frequency.
In this assessment, weāll implement two classes: a Document
class to represent individual web pages and a SearchEngine
class that aggregates the corpus of all Document
objects. The SearchEngine
builds on the Document
, so be sure to fully complete the Document
class before moving onto the SearchEngine
.
Normalizing Tokens#
In this project we need to remove punctuation so that all tokens that are semantically identical can be recognized as āexactlyā identical. For example:
'Hello!' == 'hello'
normalize_token
is a helper method found in the cse163_utils.py
file.
Furthermore, in the handout it talks about
reading HTML files, only youāll discover that in this project there are no HTML files. Instead, Iāve already
stripped out all the <tags>
and <script>
for you. See below for a discussion on that.
The process of removing punctuation is called normalization
. A quick an easy way to remove punctuation
is to use a Regular Expression (more documentation here). Regular
Expressions are used all over in Computer Science and you would benefit from learning
them. But, they can get complicated and they are out of scope for this class and project.
So, you get the code for free. Youāre welcome!
Just use the code that is found in cse163_utils.py
.
from cse163_utils import normalize_token
Writing Tests#
You need to write your own tests. The tests should be simple enough that you can calculate expected values by hand. Do not assume your code is correct and use those values in your tests.
Your tests need to be commented so that I can understand them. Describe what youāre testing.
Develop āenoughā tests that you feel confident that your code is working correctly. Your tests
cannot be a copy of the Unit Tests provided. Although, you can examine those to see what kinds
of things you might test. Iām not claiming that the Unit Tests in the test
directory is
easy to read. Iām not saying those test exemplify good style. They are tests that UW provided
and I used them āout of the box.ā
IMPORTANT: hw5_test.py
needs to follow the Main-Pattern.
Test Methods#
Write many test methods. For example: do NOT write a single method for all of Document. You should have at least one test method for each public method on the object.
For example, Document has:
get_path
,get_words
,term_frequency
.
And, there should be a test method for every āimportantā private method on the object.
Example: SearchEngine has
_calculate_idf
Iāve provided you with one fat Unit Test in Replit. It actually runs a bunch of tests
in the test
directory using the files in the test_corpus
directory. This is not the
preferred Unit Test to have. It is provided to you as a Safety Check only. You need
to write your own tests. The provided tests do not test every case!
You can get details of the Unit Test failures by updating .replit to run the file run_tests.py
.
These tests are a gift to help you get extra verification that your code is working. More tests will be run during grading.
Do Not Changeā¦#
Do not add, change or remove anything from the files in these directories. Furthermore, do not add files to these directories:
test
test_corpus
small_text
Do not change cse163_utils.py
. Do NOT add any method to this file either!
Normalize Paths#
Due to how os.join.path
works across operating systems, search results
need to be normalized. Use the following pattern in your tests to assure that your
expected paths equal your actual paths no matter where your tests run.
If you do not use this pattern, your tests might fail on my machine.
from cse163_utils import normalize_paths
def test_search():
'''
This method is an Example Test showing how to normalize paths.
Do not include this test as one of your own!
'''
engine = SearchEngine('directory')
expected = ['directory/doc1.txt', 'directory/doc2.txt']
actual = engine.search('my query')
# Please normalize the paths!!
expected = normalize_paths(expected)
actual = normalize_paths(actual)
assert_equals(expected, actual)
Important Tips#
Testing Output#
Use python run_tests.py
to run the suite of tests āmanuallyā so that you can get more details
on any failures. Note that if you fail one of these Unit Tests,
then your Unit Tests are not thorough enough. Update them!!
Performance#
Performance is an important part of this project. You should implement your project in ways that limits the processing. Donāt REprocess anything.
You should cache values/data in the Document and SearchEngine classes to avoid
expensive recalculations. Of course, a search
will still require you to gather
TF-IDF values across all the documents that contain any term in the Search Term.
When you test to see if something contains something, be sure this is a fast operation. For example, the following is slow:
# our example needs a long list
my_list = [ n**2 for n in range(200000)]
x = -1
if x in my_list:
print('Well that "in" operation was slow!')
You should manually try out your project with the small_text
. Do this by running main.py
.
If your implementation fails to process this
directory in a reasonable amount of time, you have a flaw. You will not pass
the project if you cannot successfully do a search on the small_text
directory.
Normalize#
The Unit Tests in Replit require that you normalize
your terms in the Document
method term_frequency
.
Test Document Thoroughly#
Consider edge cases. Can you think of any?
Test
term_frequency
on at least two different words and documents where you can easily calculate the values manually.Search with UnNormalized! terms.
Sorting Lists for comparison#
When you assert_equals
on two lists, the assert requires that the
lists be in the same order. IF you donāt care about the order of the lists,
one way to verify that they are identical is to first
sort the lists. Note that sometimes the order is important.
# This test doesn't care about the order of the items.
# This could be true for Document::get_words
assert_equals(sorted(list1), sorted(list2))
Commenting#
Emphasis will be put on two things:
WHY comments
Class doc-strings
Some reminders:
You need to thoroughly comment
document.py
,main.py
andsearch_engine.py
.Every function needs a doc-string comment.
Every class needs a doc-string.
You should add some comments to
hw5_test.py
.
HTML Stripping#
In prior years, students were given HTML files and not text files. This section is a discussion about stripping HTML out of the documents which has already been done for you.
In ReplIt, you are provided with two folders with documents. When you run main.py
, you should start off
by crawling the files in test_corpus
or a folder you created with your own individual set of files. Eventually,
but not at the start, you will
be able to quickly and effectively search small_text
. Give it a shot before you submit your project!
The small_text
folder is a subset
of wikipedia HTML articles with the HTML stripped away to reveal only the content. By stripping out
the HTML tags, there are some pros and cons.
Pros of Stripping:
The developer can proceed without the distraction of how the code interferes with the solution.
The indexing solution is not āconfusedā with needless code (the keywords and the non-English tag names).
The size of the content is reduced.
Cons of Stripping:
The indexing solution actually works quite nicely on the HTML directly (with a few exceptions).
The text has seemingly random spacing and an unusual occurance of the
^
character.Viewing the content is unappealing because all the nice formatting is lost. :-(
If you want to use the HTML version (small_wiki), you may download those files here.
Note: The curious developer may be interested to know that small_text was created by using BeautifulSoup to strip the HTML files of their HTML.
The name of the folder small_text
may be a misnomer because it is 2.5 MB in size. Thatās
arguably large. However, there is a much larger corpus you can choose to use:
Access Note
The Google shares require that you use your NSD Credentials.
Due to their size, these Large Wikipedia folders should NOT be uploaded to Replit. Instead, you are encouraged to download to your local computer and run HW5 locally. If your code is written efficiently, it will work if given a little time. If your code is not written efficiently, patience will not suffice. :-)