diff --git a/README.md b/README.md index b1d5431c1e05574a8bb2ae428388e6970dea1b1a..e5218038ab4cdbd5c1378e76d4ed26f951e24284 100644 --- a/README.md +++ b/README.md @@ -15,37 +15,37 @@ Reza Sajjadi ### How it Works ### Crawler Our crawler works as follows: URLs to crawl, URLs that have been crawl, and dissallowed domains are saved in HashSets. We used HashSets because they enforce uniqueness (we do not want to crawl the same URL multiple times) and operate in O(1) time for adding, removing, and contains operations. Seed URLs are added either by command line arguments or from a specified file. The seeds are loaded into the HashMap. From there, until we run out of URLs to crawl or hit our page crawl limit, we crawl each page one at a time. The crawl algorithm works as follows: the next URL is removed from the HashSet. We then check if its domain is known to allow crawling. If the domain has not been crawled before, we check the website's robots.txt file to see what pages are allowed to be crawled and save those preferences so we can respect them later. If the has been crawled before, then then we simply check if the URL is in the allowed Domains HashSet. If it is not allowed, the URL is skipped. If it is allowed, we then update the wait delay and write to a file as much data as allowed per website (default 1KB). After a page is finished being crawled (and we are below the crawled page limit), the crawler fetches the next page and the process repeats. -#### Checker +### Checker Our checker is using two different methods to assign confidence points. A confidence point is 0 if the checker didn't find any problem and 100 if it was completley skeptical. The first method uses a State Machine and the second is using an n-Grams inspired implementation. The final score will be a weighted sum of these scores provided by two different methods. -##### Typo corrector +#### Typo corrector The typo corrector is calculating the distance of a word to dictionary(supposly provided by crawler). Here we used the small dictionary of homewrok2, and replace the misspelled word with closest distant word with some condition. -##### State Machine Checker +#### State Machine Checker In order to have a good working State machine we need to first update the grammar and roles of each word manually. This step will be automatized to some extent in the next milestones. Tokens are provided in `CheckerCorrector/SQLite/mydatabase.db`, and also the basic graph provided by `CheckerCorrector/DirectedGraph/BasicGraph.java`. A sentence will first go through a typo checker and get updated if needed (it will also affect the confidence score). Then the sentence will be tokenized, and using the provided graph it will check whether the sentence is following the correct format or not, for each miss on any edge of the graph a penalty will be added to the confidence score. The typo corrector is calculating the distance of a word to dictionary(supposly provided by crawler). Here we used the small dictionary of homewrok2, and replace the misspelled word with closest distant word with some condition. -##### n-Grams checker +#### n-Grams checker This checker used the crawled data and gave a score by summing up all the n_grams probabilities of phrases in a sentence. In order to store the data of the crawled data we used SHA-256 hash and store the result in CheckerCorrector/SQLite/hash_database.db. #### Proof of Effictiveness In order to show the effictiveness of our tool we used ChatGBT to write a script that is using a third-party Python library to make the exact same score for each sentence. These can be found in CheckerCorrector/samples/ directory. Both Json generated by our tool and the third party are available. -#### Corrector +### Corrector Our corrector utilize three methods to correct a sentence. The first method is Typo corrector which will correct typos. The second is the state machine that tries to fix the structure of the words in the sentence based on their roles. Finally, our similarity method which has similer idea to n-grams. This method tries to fix tenses of the verbs and also tries to help the state machine give better suggestion in case of replacement or insertion. -##### Type Corrector +#### Type Corrector The typo corrector is the same as the checker. -##### State Machine Checker +#### State Machine Checker The state machine suggests possible corrections in the structure of the sentence. Corrections of the typo corrector are based on the most similar path through the state machine. In order to find a similar path we are first doing a DFS on the graph starting from the first token and storing all the possible paths. Then, based on the most similar path we decide whether we should change/delete/add a token and suggest a token with a similar role. The correction at this point is limited to the state machine complexity, but it will be improved for the next milestone. -##### Similarity method +#### Similarity method This method also do a DFS on the crawled data and store all the possible words(nodes) in depth of two in a file. In this method instead of doing the DFS on tokens we are doing DFS on the words itself. This is mostly use to detect the tenses for verb or prepositions or any composition that can be seen. In other words, we look for possible words that has occured close to the given word in our crawled data. To do so, the crawled data should be preprocessed and neighbors of each word should be store in the database. -#### Learning Proceess and Database Updating +### Learning Proceess and Database Updating In order to have a checker/corrector that is working properly it needs to process the crawled data. This processing cosists of different stages. The process for n-grams is to break compute the hash of each new phrase and add up the number of its occurecs. The other updates is the update for tokens. The tool needs so somehow learn what is the role of each word in a sentence. It can be given a to it directly which is happen when a crawled data from a dictionary is given to it, or it can happen just by giving the raw data. Giving the raw data can cause some mis proccessing since the tool will learn the token by itslef. That is why we have an option where the user can decide whether the role predicted for the token by the tool is valid or not through a GUI. ### Implemented Features