@@ -41,16 +41,11 @@ The Crawler module serves as a foundation for a language correction tool, focusi
- The module also has a neat debug mode with a flag that enables uncompressed JSON data showing the Trie for debugging purposes. This allows for easy inspection and troubleshooting.
- The crawler also calculates and logs metrics such as the rate of data processing and the number of links found as feedback to the user, allowing for a greater user experience.
**7. Scalability Considerations:**
- The module is built to be modular, allowing for scalability without major changes to the code. It also allows for easy integration of other methods to optimize the crawler.
By focusing on these design decisions, the Crawler module efficiently collects and processes web-based textual data, providing a good starting point for subsequent language analysis tasks in our Language Correction project.
### Checker
Building upon the data collected by the Crawler, the Checker module evaluates snippets of text to determine the conformity of their language structures with standard usage. It assigns confidence scores to sentences and phrases, highlighting those that deviate significantly from common usage patterns. This system not only flags potentially incorrect language but also provides a quantitative measure of suspicion, thereby assisting in prioritizing corrections.
Checker module evaluates snippets of text to determine the conformity of their language structures with standard usage. It assigns confidence scores to sentences and phrases, highlighting those that deviate significantly from common usage patterns. This system not only flags potentially incorrect language but also provides a quantitative measure of suspicion, thereby assisting in prioritizing corrections.
For the Checker component of the Language Correction project, several key design decisions focus on efficient data handling and precise language analysis, which play crucial roles in the system's ability to accurately identify and score language discrepancies. Here's a high-level description of these aspects:
High level overview:
**1. Integration with a Trie Structure:**
- The Checker relies on a Trie structure for storing and querying language data, which is crucial for efficient language processing. The Trie is used to store a serialized form of language data that includes common usage patterns, which the Checker uses to evaluate and suggest corrections.
@@ -60,23 +55,19 @@ For the Checker component of the Language Correction project, several key design
- The use of data compression allows the Checker to handle large amounts of language data efficiently by reducing the space needed for storage and the time required for data transfer. This is particularly important for applications that need to scale to large datasets or operate within limited storage capacities.
- Decompression is handled on-the-fly as the Checker loads the Trie data. This method ensures that the processing overhead is minimized and that the data remains compact until needed.
**3. Integration with TextProcessor:**
- The Checker component closely integrates with the TextProcessor class to preprocess text data, which is crucial for ensuring the accuracy of language analysis. The TextProcessor handles the extraction of sentences and phrases from the input text, which are the primary units of analysis for the Checker.
- This separation of concerns allows the Checker to focus solely on the language evaluation aspect, relying on TextProcessor to provide clean and structured data.
**4. Phrase Extraction Using N-Grams:**
**3. Phrase Extraction Using N-Grams:**
- Phrase extraction is implemented using an n-gram model, which is a common technique in natural language processing. By examining contiguous sequences of words (n-grams) within the text, the system can effectively analyze common and unusual language patterns.
- This method allows for the extraction of phrases of variable lengths (controlled by `minN` and `maxN` parameters), offering flexibility in the granularity of language analysis. It is particularly useful in identifying non-standard language usage that may not be evident when analyzing larger text blocks or individual words.
**5. Perplexity Calculation:**
**4. Perplexity Calculation:**
- Perplexity is a measure used to quantify how well a probability model predicts a sample. In the context of the Checker, it's used to assess the likelihood of phrases based on their frequency and arrangement in the learned language model (stored and accessed via a Trie structure).
- A lower perplexity indicates that a phrase is more typical or expected in the language model, whereas higher values suggest rarity or unusual usage. Phrases with extremely high perplexity scores are flagged as potentially incorrect or suspicious, which are then highlighted to the user.
**6. Efficient Data Structures:**
**5. Efficient Data Structures:**
- The use of a HashMap to store the scores of sentences and phrases ensures quick retrieval and update operations, which are essential for real-time language processing applications. This choice of data structure supports efficient key-value associations, which is ideal for mapping text units to their respective perplexity scores.
- The use of a Set in the extraction of phrases helps eliminate duplicates, ensuring that each unique phrase is only analyzed once, thereby improving the efficiency of the system.
**7. Scalability and Performance Considerations:**
**6. Scalability and Performance Considerations:**
- The system is designed to handle large volumes of text efficiently. The modular design allows for easy scaling, where the Checker can process larger datasets or be extended to include more complex analytical features without significant redesign.
- The performance is also optimized through the use of efficient data compression and decompression techniques, which reduce the memory footprint of stored language models and speed up data transfer and processing.
@@ -188,6 +179,7 @@ Added perplexity and probability methods calculations for TrieNode
### Leon Long
Worked on Crawler Module: implemented data compression and storage.
Worked on GUI Feature: implemented UI to wrap everything together, allow for user inputs (URL processing, file system inputs, module selection), and real-time feedback to users (part of another feature).
### Manuel Segimón
@@ -203,6 +195,8 @@ Implemented GUI for correcter
### Tejas Singh
Worked on base functionality of crawler: implemented Jsoup, basic data structures (such as the URL queue), and CLI (for use with files).
Worked on real-time feedback: added system for timing methods and computing processing rate, added/cleaned up status outputs.
Worked on social media integration: added social media option to CLI, created new method for parsing Reddit user pages.