Language Correction - Group 7
Team Members: Manuel Segimón, Moi Bensadon, Leon Long, Tejas Singh
Problem and Project Description
In today’s digital age, the accuracy and appropriateness of language used in online content are paramount for clear communication and professional presentation. As the internet continues to expand, web pages, social media posts, and other forms of digital text are often written quickly and without thorough review, leading to a proliferation of grammatical errors and awkward phrasing. This not only affects the readability but also the credibility of the information presented.
The Language Correction project aims to address these issues by developing a sophisticated tool that evaluates language usage over the web. The core objective of this tool is to enhance the quality of digital content by identifying and correcting improper language structures.
The challenge lies in developing a system that is both efficient and effective in real-time analysis and correction of language on a wide scale. The tool must minimize false positives (incorrectly flagged correct usage) and false negatives (overlooked errors), ensuring high reliability and user trust. Additionally, the project explores further enhancements such as integration with social media platforms, development of a graphical user interface, and extension to multiple languages, broadening the tool’s applicability and accessibility.
This Language Correction project not only aims to improve the quality of written content but also serves as a step towards more sophisticated natural language processing applications in various online and digital platforms.
Design and Implementation
Crawler
The Crawler module serves as a foundation for a language correction tool, focusing on collecting web-based textual data and storing it efficiently for further processing. It creates a corpus of language usage, which can later be analyzed for linguistic patterns in various languages. Below is a high-level description of the implementation of the data structures and algorithms:
1. Efficient URL Management:
- The module uses a queue to manage the URLs that need to be crawled, ensuring systematic FIFO processing
- To prevent re-processing URLs, the Crawler uses a set to efficiently check whether a URL has already been processed in O(1) average time complexity.
2.Robust Web Interaction:
- The module utilizes Jsoup to handle HTTP interactions, checking the status code of each fetched URL to ensure successful retrieval of content. This allows for a robust system that handles errors accurately.
3.Data Compression Techniques:
- The Crawler employs compression techniques to reduce the space required for storage and thus speeds up data transfer when loading and saving the language model data.
- Furthermore, decompression occurs only when data needs to be processed, minimizing memory usage and processing time.
4. N-Gram Extraction and Usage Analysis:
- The module uses a trie to store and query n-grams efficiently as the Trie data structure allows for fast insertions and searches, making it suitable for handling large amounts of language data.
- The module breaks the texts into n-grams which can be analyzed across different sources and languages, allowing for further analysis by other modules in the project.
5. Dynamic Data Serialization:
- The Trie is handled through a serialization methods that converts the trie into a byte array for compression.
- Moreover, the system also interacts with file systems, allowing for greater flexibility.
6. Performance Monitoring and Debugging:
- The module also has a neat debug mode with a flag that enables uncompressed JSON data showing the Trie for debugging purposes. This allows for easy inspection and troubleshooting.
- The crawler also calculates and logs metrics such as the rate of data processing and the number of links found as feedback to the user, allowing for a greater user experience.
Checker
Checker module evaluates snippets of text to determine the conformity of their language structures with standard usage. It assigns confidence scores to sentences and phrases, highlighting those that deviate significantly from common usage patterns. This system not only flags potentially incorrect language but also provides a quantitative measure of suspicion, thereby assisting in prioritizing corrections.
High level overview:
1. Integration with a Trie Structure:
- The Checker relies on a Trie structure for storing and querying language data, which is crucial for efficient language processing. The Trie is used to store a serialized form of language data that includes common usage patterns, which the Checker uses to evaluate and suggest corrections.
- The serialized data is loaded from a file (
metadata.ser
), ensuring that the Checker operates with the most updated language model available, which enhances the accuracy and relevance of the corrections.
2. Data Compression and Decompression:
- The use of data compression allows the Checker to handle large amounts of language data efficiently by reducing the space needed for storage and the time required for data transfer. This is particularly important for applications that need to scale to large datasets or operate within limited storage capacities.
- Decompression is handled on-the-fly as the Checker loads the Trie data. This method ensures that the processing overhead is minimized and that the data remains compact until needed.
3. Phrase Extraction Using N-Grams:
- Phrase extraction is implemented using an n-gram model, which is a common technique in natural language processing. By examining contiguous sequences of words (n-grams) within the text, the system can effectively analyze common and unusual language patterns.
- This method allows for the extraction of phrases of variable lengths (controlled by
minN
andmaxN
parameters), offering flexibility in the granularity of language analysis. It is particularly useful in identifying non-standard language usage that may not be evident when analyzing larger text blocks or individual words.
4. Perplexity Calculation:
- Perplexity is a measure used to quantify how well a probability model predicts a sample. In the context of the Checker, it's used to assess the likelihood of phrases based on their frequency and arrangement in the learned language model (stored and accessed via a Trie structure).
- A lower perplexity indicates that a phrase is more typical or expected in the language model, whereas higher values suggest rarity or unusual usage. Phrases with extremely high perplexity scores are flagged as potentially incorrect or suspicious, which are then highlighted to the user.
5. Efficient Data Structures:
- The use of a HashMap to store the scores of sentences and phrases ensures quick retrieval and update operations, which are essential for real-time language processing applications. This choice of data structure supports efficient key-value associations, which is ideal for mapping text units to their respective perplexity scores.
- The use of a Set in the extraction of phrases helps eliminate duplicates, ensuring that each unique phrase is only analyzed once, thereby improving the efficiency of the system.
6. Scalability and Performance Considerations:
- The system is designed to handle large volumes of text efficiently. The modular design allows for easy scaling, where the Checker can process larger datasets or be extended to include more complex analytical features without significant redesign.
- The performance is also optimized through the use of efficient data compression and decompression techniques, which reduce the memory footprint of stored language models and speed up data transfer and processing.
8. Output in JSON Format:
- Outputting results in JSON format aligns with modern data handling practices, offering high compatibility with web interfaces and ease of integration with other software tools. JSON is lightweight, easy to parse, and widely used, making it an excellent choice for both internal diagnostics and external reporting.
These design decisions underscore a commitment to robustness, efficiency, and accuracy in the Checker component of the Language Correction project. By leveraging advanced data structures, efficient algorithms, and strategic modular integration, the system effectively addresses the challenges of real-time, large-scale language correction and analysis.
Corrector
The final component, the Corrector, offers suggestions for rectifying identified errors, aiming to replace incorrect or awkward phrases with more appropriate alternatives. This functionality is crucial for users seeking to enhance the quality of their written content directly based on the system’s analysis.
For the Corrector component of the Language Correction project, several pivotal design decisions ensure the tool not only efficiently identifies but also corrects incorrect or unusual language usage. Here's a high-level overview of these decisions:
1. Integration with a Trie Structure:
- Same as the Checker, the Corrector relies on a Trie structure for storing and querying language data, which is essential for efficient language processing. The Trie is used to store a serialized form of language data that includes common usage patterns, which the Corrector uses to evaluate and suggest corrections.
2. Data Decompression:
- Same as the Checker, the Corrector leverages data compression to handle large amounts of language data efficiently. By compressing the serialized language model, the Corrector can access the data more effectively, reducing the memory footprint and improving performance.
3. Sentence Reconstruction Algorithm:
- The Corrector includes a sophisticated algorithm for generating potential sentence reconstructions. This involves a backtracking approach that allows the tool to explore various combinations and sequences of words to form sentences. The goal is to find the arrangement that offers the lowest perplexity score, indicating the most grammatically sound structure.
- This method allows the Corrector to not only identify but actively propose more acceptable forms of a sentence, enhancing both the utility and the interactive nature of the tool.
4. Perplexity-based Scoring System:
- Each generated sentence configuration is scored based on its perplexity, a measure provided by the loaded Trie. This scoring system evaluates how common or rare a sentence structure is within the context of standard usage patterns.
- Sentences with lower perplexity scores are considered more typical and therefore preferable. This scoring mechanism is central to deciding the best correction to suggest, ensuring that the recommendations are not only grammatically correct but also contextually appropriate.
5. Efficient Iterative Processing:
- The Corrector processes input text iteratively, considering each sentence independently. This design choice allows the Corrector to scale effectively, handling texts of any length by breaking them down into manageable units.
- This approach also enables the Corrector to provide real-time corrections in a piecewise manner, which is ideal for interactive applications where users might need immediate feedback on specific sections of text.
6. Handling of Non-alphanumeric Characters:
- Prior to correction, sentences are stripped of non-alphanumeric characters, focusing the correction process strictly on the words and their arrangement. This simplifies the processing and ensures that punctuation does not skew the analysis.
The Corrector component combines sophisticated algorithms with efficient data handling to provide a robust solution for correcting language errors in digital text. By leveraging Trie structures for data storage, backtracking algorithms for sentence reconstruction, and perplexity scores for evaluation, the Corrector offers a comprehensive tool that enhances written communication across digital platforms. The implementation ensures that the tool is both efficient in terms of performance and effective in delivering high-quality corrections.
Feature Implementation
Real Time Status/Statistics - 10%
This feature is part of the crawler, as it provides real-time, accurate information regarding its state and behavior. In particular, it informs the user of the current page being processed, the amount of links found per page, the crawler’s rate of processing (in bytes/sec), and the size of the metadata (both compressed and uncompressed).
Its implementation was straightforward, as much of the infrastructure that supports this information was part of the crawler's core functionality. Specifically, finding the current page being processed can be done at the same time that its URL is retrieved from the URL queue, hence there is no additional overhead. Similarly, the number of links found on a page is merely a function of the number of URLs being added to the queue (thus it only involves incrementing a variable in parallel). The size of the metadata can be found via the size() method being invoked on the byte arrays that are used to store the compressed/uncompressed word usage data.
Hence, the only portion that required additional logic was computing the rate of processing (using nanoTime() at the beginning and end of the processing function), as it required timing the page processing/compression steps, as well as accessing the size of the webdata at hand. Additionally, we had to change the output from System.out to a callback that will stream output to the text area instead of to the console
Provide a list of reasonable corrections to a suspicious text, ranked in order of how different they are from the original text - 15%
The feature for providing a list of reasonable corrections to a suspicious text, ranked in order of how different they are from the original text, was implemented using various data structures and algorithms in the Java programming language. Here's an in-depth look at the approach:
Key Components of the Implementation
Data Structures
- PriorityQueue: Utilized to maintain a list of generated sentences along with their respective perplexity scores, sorted in descending order to easily fetch the best candidates.
- List and ArrayList: These are used to store generated sentence combinations and to maintain the list of corrected sentences before final filtering.
- Map and HashMap: Employed to map sentences to their respective change count relative to the original sentence, which aids in sorting them by the number of changes.
Algorithms
-
Backtracking: This recursive strategy is pivotal in
generateSentences
method for generating potential sentence permutations by selectively including/excluding words. It ensures that each combination adheres to a specific length constraint, reflecting a substantial yet reasonable modification from the original sentence. -
Sorting and Ranking: Once potential corrections are scored, they are sorted based on their perplexity scores. Subsequently, in
printSentencesInOrderOfChanges
, sentences are ranked based on the count of changes needed to transform the original sentence into the corrected one.
Detailed Breakdown of Key Methods
printSentencesInOrderOfChanges
This method is responsible for ordering corrected sentences by the number of changes from the original sentence. It compares each word of the original sentence with the corrected sentence and tallies differences to compute a 'change score'. It employs a HashMap to associate each sentence with its change score and then sorts these entries by their values (change scores).
Removing Sentences in the 0.5 Percentile
A filtering mechanism is employed in the correct
method, where only sentences with scores not exceeding 1.5 times the score of the best sentence are retained. This effectively discards less likely corrections and focuses on the most linguistically probable options. This step ensures that the final suggestions are not only close to the original in structure but also are reasonable in terms of language use.
The combination of trie structures for perplexity calculations, priority queues for maintaining top scores, backtracking for generating sentence permutations, and hashing mechanisms for change tracking provides a robust system. This system efficiently ranks corrections of a given text by their likelihood and deviation from the original, offering users meaningful and contextually appropriate alternatives.
Social Media Implementation - 15%
Continuing with features of the crawler, we expanded its capabilities to be able to access Reddit, one of the largest social media sites/forums online.
Our implementation involves passing a flag (--social) and a Reddit username into the command line interface. The name is then appended onto a complete URL and the page is added to the URL queue.
With regards to processing these user pages, the posts with which they have interacted are found via identifying specific “shreddit” tags that are accessible from the web data. This is accomplished in a brute force manner by looping through the entirety of the page. The links to these posts are then added to the URL queue and processed using the standard method. In the case where the user has not interacted with any posts (resulting in 0 links being added to the queue), the reddit homepage is added instead as a seed URL.
GUI for Text Analysis - 10%
The MainApp
class provides a nice Graphical User Interface using Java Swing, ensuring cross-platform compatibility and a user-friendly interface. BorderLayout
allows for a clean layout with input fields, ‘run’ button, and display areas. JComboBox
presents a drop-down menu while dynamically alters the ‘run’ button depending on the user’s choice of module. This keeps the interface’s visual appearance clean while allowing user selections. Inputs are managed through JTextField’, which accepts URLs, paths to files, and texts. With the corresponding module output in the
JtextAreacomplimented by the
JScrollPane` for easy viewing of large outputs from the program.
Furthermore, the GUI highlights significant parts of the text using the Highlighter
and HighlightPainter
functions, enhancing usability by visually distinguishing potentially suspicious texts in the analysis outputs. Other nice features include, the way intensive processing tasks are executed in a separate thread, keeping the interface responsive. And how on startup, the application checks for necessary user specific configurations, prompting users for initial setup if needed, ensuring the tool is ready for proper operation.
Code
Repository: https://agile.bu.edu/gitlab/ec504/ec504_projects/group7/-/tree/master
Work breakdown
Signed by: Tejas S., Leon Long, Moi Bensadón, Manuel Segimón
Moi Bensadon
Built initial Corrector Module using ngrams on brown corpus
Integrated separate checker/corrector and crawler modules into one package
Implemented Serialization and Deserialization of TrieNode for crawler
Implemented streaming for GUI so that output appears as it crawls each site
Added dropdown for module selection and dialog box to GUI to build off of existing corpora
Added perplexity and probability methods calculations for TrieNode
Leon Long
Worked on Crawler Module: implemented data compression and storage.
Worked on GUI Feature: implemented UI to wrap everything together, allow for user inputs (URL processing, file system inputs, module selection), and real-time feedback to users (part of another feature).
Manuel Segimón
Built initial checker module using statistical methods
Implemented TrieNode structure for crawler to use ngrams and track conditional probabilities
Limited compression to 1KB per site by tracking incremental compressed size in crawler
Implemented logic for making correction and ranking based on differences
Implemented GUI text highlighter for checker
Implemented GUI for correcter
Found and generated text corpses for german, italian and portuguese (From: https://wortschatz.uni-leipzig.de/en/download/German)
Added executables to the project so that reviewers can easily run the program and wrote execution instructions to INSTALL.txt
Tejas Singh
Worked on base functionality of crawler: implemented Jsoup, basic data structures (such as the URL queue), and CLI (for use with files).
Worked on real-time feedback: added system for timing methods and computing processing rate, added/cleaned up status outputs.
Worked on social media integration: added social media option to CLI, created new method for parsing Reddit user pages.