noteLabel=newJLabel("Rejection of suggestions with * may result in a new set of suggestions regardless of other choices.");
noteLabel=newJLabel("Suggestions with ** have the highest priority and then *. Rejection of a suggestion with higher priority neutralize suggestions with lower priority.");
@@ -18,6 +18,9 @@ Our crawler works as follows: URLs to crawl, URLs that have been crawl, and diss
#### Checker
Our checker is using two different methods to assign confidence points. A confidence point is 0 if the checker didn't find any problem and 100 if it was completley skeptical. The first method uses a State Machine and the second is using an n-Grams inspired implementation. The final score will be a weighted sum of these scores provided by two different methods.
##### Typo corrector
The typo corrector is calculating the distance of a word to dictionary(supposly provided by crawler). Here we used the small dictionary of homewrok2, and replace the misspelled word with closest distant word with some condition.
##### State Machine Checker
In order to have a good working State machine we need to first update the grammar and roles of each word manually. This step will be automatized to some extent in the next milestones. Tokens are provided in `CheckerCorrector/SQLite/mydatabase.db`, and also the basic graph provided by `CheckerCorrector/DirectedGraph/BasicGraph.java`. A sentence will first go through a typo checker and get updated if needed (it will also affect the confidence score). Then the sentence will be tokenized, and using the provided graph it will check whether the sentence is following the correct format or not, for each miss on any edge of the graph a penalty will be added to the confidence score.
The typo corrector is calculating the distance of a word to dictionary(supposly provided by crawler). Here we used the small dictionary of homewrok2, and replace the misspelled word with closest distant word with some condition.
@@ -29,9 +32,19 @@ This checker used the crawled data and gave a score by summing up all the n_gram
In order to show the effictiveness of our tool we used ChatGBT to write a script that is using a third-party Python library to make the exact same score for each sentence. These can be found in CheckerCorrector/samples/ directory. Both Json generated by our tool and the third party are available.
#### Corrector
The current corrector uses the typo corrector which was used in the checker and also the state machine to suggest possible corrections. Corrections of the typo corrector are based on the most similar path through the state machine. In order to find a similar path we are first doing a DFS on the graph starting from the first token and storing all the possible paths. Then, based on the most similar path we decide whether we should change/delete/add a token and suggest a token with a similar role.
Our corrector utilize three methods to correct a sentence. The first method is Typo corrector which will correct typos. The second is the state machine that tries to fix the structure of the words in the sentence based on their roles. Finally, our similarity method which has similer idea to n-grams. This method tries to fix tenses of the verbs and also tries to help the state machine give better suggestion in case of replacement or insertion.
##### Type Corrector
The typo corrector is the same as the checker.
##### State Machine Checker
The state machine suggests possible corrections in the structure of the sentence. Corrections of the typo corrector are based on the most similar path through the state machine. In order to find a similar path we are first doing a DFS on the graph starting from the first token and storing all the possible paths. Then, based on the most similar path we decide whether we should change/delete/add a token and suggest a token with a similar role.
The correction at this point is limited to the state machine complexity, but it will be improved for the next milestone.
##### Similarity method
This method also do a DFS on the crawled data and store all the possible words(nodes) in depth of two in a file. In this method instead of doing the DFS on tokens we are doing DFS on the words itself. This is mostly use to detect the tenses for verb or prepositions or any composition that can be seen. In other words, we look for possible words that has occured close to the given word in our crawled data. To do so, the crawled data should be preprocessed and neighbors of each word should be store in the database.
#### Learning Proceess and Database Updating
In order to have a checker/corrector that is working properly it needs to process the crawled data. This processing cosists of different stages. The process for n-grams is to break compute the hash of each new phrase and add up the number of its occurecs. The other updates is the update for tokens. The tool needs so somehow learn what is the role of each word in a sentence. It can be given a to it directly which is happen when a crawled data from a dictionary is given to it, or it can happen just by giving the raw data. Giving the raw data can cause some mis proccessing since the tool will learn the token by itslef. That is why we have an option where the user can decide whether the role predicted for the token by the tool is valid or not through a GUI.
@@ -82,6 +95,7 @@ Michael Harkess
### Known Issues/Bugs/Limitation
#### Checker/Corrector
- The biggest limitation of our tool is the amount of data that is stored in our databases and the amount of self-training and user feedback that has been given to our tool.
- The tool is unable to detect the language and it is assumed that the user is giving the desired language by specifying the target.
- The corrector is mostly effective for typo correction. There are some hyper paramters withing the code that can be tuned to help the corrector decide how aggresive it should rewrite the code, but still it has known problem with suggesting a replacement.
- Both corrector and checker are limited by how large the crawled data was and how much the user fix them.