@@ -14,6 +14,35 @@ This Language Correction project not only aims to improve the quality of written
## Design and Implementation
### TrieNode
For the TrieNode component significant design decisions were made to support the functionality of both the Checker and the Corrector components effectively. This involves managing language data, offering efficient and precise computations for language analysis, and supporting serialization and deserialization processes. Here’s a detailed overview of these aspects:
**1. Data Structure Choice:**
- The use of a Trie (prefix tree) structure is central to the TrieNode's design. This data structure is particularly well-suited for tasks that involve a large set of strings and need efficient query operations — typical in language processing tasks such as autocomplete, spelling checks, and now, perplexity calculations.
- Each node in the Trie represents a word or part of a word, making it an efficient way to store the language model where each path can represent a different phrase or sentence.
**2. Handling Phrase Insertion and Lookup:**
- The `insert` method allows for the addition of phrases into the Trie. Each word in a phrase navigates through the Trie, creating new nodes where necessary. This method increments counts to record the frequency of each phrase's occurrence, which is crucial for probability calculations.
- The `probability` method computes the likelihood of phrases based on the occurrences of words and their sequences in the Trie. This is fundamental for evaluating how typical or atypical a phrase is within the stored language data.
**3. Perplexity Computation:**
- Perplexity is used as a measure of how well a probability distribution predicts a sample. In the context of the TrieNode, perplexity calculations are used to assess the fluency and commonality of phrases in the language model.
- The method calculates the product of the probabilities of individual words forming a phrase, taking their logarithm to handle the small probability values efficiently, thereby giving a measure of the phrase's commonality or rarity.
**4. Serialization and Deserialization:**
- Serialization involves converting the Trie structure into a byte array format, which can be saved or transmitted. The TrieNode uses a custom serialization format that stores each node's count and the structure in a compact form. This functionality is crucial for efficiently loading and saving the state of the language model.
- Deserialization reconstructs the Trie from its serialized form. This process is key when initializing the TrieNode with previously saved data, ensuring that the Trie is accurately rebuilt from its compact representation.
**5. Efficiency Considerations:**
- Both serialization and deserialization processes are optimized for performance. The custom methods allow for direct control over the format and efficiency of these operations, ensuring that the Trie can handle large datasets without excessive memory use or processing time.
- Special care is taken to manage memory during these processes, such as using a `ByteArrayOutputStream` and buffer handling during decompression to minimize memory overhead.
**6. Cloneability:**
- Implementing `Cloneable` allows the TrieNode to be duplicated. This is beneficial in scenarios where an isolated copy of the Trie is needed for testing or when operating in a context where modifications to a Trie should not affect the original, such as in concurrent processing scenarios.
The TrieNode component is intricately designed to efficiently handle, store, and process language data critical for the operation of both the Checker and the Corrector components. By leveraging a Trie structure, it facilitates rapid insertions, searches, and analysis of phrases. The custom serialization and deserialization methods ensure that the Trie can be efficiently saved and loaded, maintaining its integrity across different states of the application. This design not only supports the current functionalities but also provides a robust foundation for future enhancements and scalability of the Language Correction project.
### Crawler
The Crawler module serves as a foundation for a language correction tool, focusing on collecting web-based textual data and storing it efficiently for further processing. It creates a corpus of language usage, which can later be analyzed for linguistic patterns in various languages. Below is a high-level description of the implementation of the data structures and algorithms:
@@ -107,7 +136,7 @@ The Corrector component combines sophisticated algorithms with efficient data ha
## Feature Implementation
### Real Time Status/Statistics - 10%
### Provide real-time status and statistics feedback for the crawler - 10%
This feature is part of the crawler, as it provides real-time, accurate information regarding its state and behavior. In particular, it informs the user of the current page being processed, the amount of links found per page, the crawler’s rate of processing (in bytes/sec), and the size of the metadata (both compressed and uncompressed).
@@ -140,7 +169,13 @@ A filtering mechanism is employed in the `correct` method, where only sentences
The combination of trie structures for perplexity calculations, priority queues for maintaining top scores, backtracking for generating sentence permutations, and hashing mechanisms for change tracking provides a robust system. This system efficiently ranks corrections of a given text by their likelihood and deviation from the original, offering users meaningful and contextually appropriate alternatives.
### Social Media Implementation - 15%
### Graphical User Interface that highlights suspicious textual elements in a given text - 15%
The `MainApp` class provides a nice Graphical User Interface using Java Swing, ensuring cross-platform compatibility and a user-friendly interface. `BorderLayout` allows for a clean layout with input fields, ‘run’ button, and display areas. `JComboBox` presents a drop-down menu while dynamically alters the ‘run’ button depending on the user’s choice of module. This keeps the interface’s visual appearance clean while allowing user selections. Inputs are managed through `JTextField’, which accepts URLs, paths to files, and texts. With the corresponding module output in the `JtextArea` complimented by the `JScrollPane` for easy viewing of large outputs from the program.
Furthermore, the GUI for the checker highlights significant parts of the text using the `Highlighter` and `HighlightPainter` functions, enhancing usability by visually distinguishing potentially suspicious texts in the analysis outputs. Other nice features include, the way intensive processing tasks are executed in a separate thread, keeping the interface responsive. And how on startup, the application checks for necessary user specific configurations, prompting users for initial setup if needed, ensuring the tool is ready for proper operation.
### Extend your crawler to crawling social media posts of some large network (reddit) - 15%
Continuing with features of the crawler, we expanded its capabilities to be able to access Reddit, one of the largest social media sites/forums online.
@@ -148,11 +183,11 @@ Our implementation involves passing a flag (--social) and a Reddit username into
With regards to processing these user pages, the posts with which they have interacted are found via identifying specific “shreddit” tags that are accessible from the web data. This is accomplished in a brute force manner by looping through the entirety of the page. The links to these posts are then added to the URL queue and processed using the standard method. In the case where the user has not interacted with any posts (resulting in 0 links being added to the queue), the reddit homepage is added instead as a seed URL.
### GUI for Text Analysis - 10%
### System works for multiple languages in which none of the team members have fluency - 15% for each (45% total)
The `MainApp` class provides a nice Graphical User Interface using Java Swing, ensuring cross-platform compatibility and a user-friendly interface. `BorderLayout` allows for a clean layout with input fields, ‘run’ button, and display areas. `JComboBox` presents a drop-down menu while dynamically alters the ‘run’ button depending on the user’s choice of module. This keeps the interface’s visual appearance clean while allowing user selections. Inputs are managed through `JTextField’, which accepts URLs, paths to files, and texts. With the corresponding module output in the `JtextArea` complimented by the `JScrollPane` for easy viewing of large outputs from the program.
The system was designed to be language-agnostic, allowing for the analysis and correction of text in multiple languages. This was achieved by using a Trie structure to store language data, which can be loaded from serialized files. The Trie structure is used to calculate perplexity scores for sentences, which are then used to rank potential corrections. The system was tested with text in German, Italian, and Portuguese, languages that none of the team members are fluent in. The system was able to analyze and correct text in these languages, demonstrating its language-agnostic capabilities.
Furthermore, the GUI highlights significant parts of the text using the `Highlighter` and `HighlightPainter` functions, enhancing usability by visually distinguishing potentially suspicious texts in the analysis outputs. Other nice features include, the way intensive processing tasks are executed in a separate thread, keeping the interface responsive. And how on startup, the application checks for necessary user specific configurations, prompting users for initial setup if needed, ensuring the tool is ready for proper operation.
Text corpses for German, Italian, and Portuguese were obtained from the Leipzig Corpora Collection. These corpses are used to train the Trie structure whenever you switch language, allowing for users to test how the system analyzes and corrects text in these languages.