Commit 8719bc33 authored by Michael D Harkess's avatar Michael D Harkess
Browse files

Removed JSoup from Crawler Parser, Updated README

parent 04aedc65
Loading
Loading
Loading
Loading
+0 −41
Original line number Diff line number Diff line
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.*;
import java.util.ArrayList;
import java.util.List;
@@ -10,37 +5,6 @@ import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexParser {

    /*public static void main(String[] args) {
        // Read URLs from a text file
        List<String> urls = readUrlsFromFile("urls.txt");
        List<String> links = new ArrayList<String>();
        List<String> sentences = new ArrayList<String>();

        // Parse each webpage and gather information
        for (String url : urls) {
            try {
                // Fetch the webpage using JSoup
                Document doc = Jsoup.connect(url).get();

                // Extract all the text content from the page
                String allText = getAllText(doc);

                // Find all sentences in the text
                sentences = findSentences(allText);

                // Extract links from the page
               //links = extractLinks(doc);

                // Write sentences and links to a file
                writeToFile(sentences, links, "output_" + getFilenameFromUrl(url) + ".txt");

            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }*/

    // Function to read URLs from a text file
    private static List<String> readUrlsFromFile(String filename) {
        List<String> urls = new ArrayList<>();
@@ -55,11 +19,6 @@ public class RegexParser {
        return urls;
    }

    // Function to extract all text content from the page
    private static String getAllText(Document doc) {
        return doc.text();
    }

    // Function to find all sentences in the given text
    private static List<String> findSentences(String text) {
        List<String> sentences = new ArrayList<>();
+14 −9
Original line number Diff line number Diff line
# Group 6
Our java webcrawler and langauge checker project. 
## Summary
Our group aims to develop a Language Correction software that efficiently analyzes text for correct language usage by crawling through pages of the target language(s) on the Internet. The evaluator will compile common usage patterns of words and phrases from crawled texts and use them as a reference to identify words or phrases that are consistent or inconsistent with the gathered data. This project will assist users in improving their language skills and ensuring the accuracy of their written content.

### Credits
Webcrawler by Alex Melnick 

Data Parser by Michael
@@ -14,24 +16,27 @@ https://docs.oracle.com/javase/tutorial/networking/urls/index.html
The only library used for the checker and corrector is https://github.com/xerial/sqlite-jdbc and its dependency https://www.slf4j.org. These are both provided in the repo, and there is no need to download them.

## Usage
To run the webcrawler, simply create a ScratchCrawler object and run the .crawl command with the seed URL as the arguement.
### Web Crawler
To run the webcrawler, simply create a ScratchCrawler object and run the .crawl command with the seed URL as the argument.

`ScratchCrawler crawler = new ScratchCrawler(); // Create a new ScratchCrawler object`

`crawler.crawl("https://archive.org/details/bostonpubliclibrary"); // Start off the crawl with the seed page`

To build the project for CheckerCorrector we are using a make file. running "make dev_corrector" and "make dev_checker" will compile and build the checker.jar and corrector.jar with the user interface requested.
### Checker/Corrector
To build the project for CheckerCorrector we are using a make file. running `make dev_corrector` and `make dev_checker` will compile and build the `checker.jar` and `corrector.jar` with the user interface requested.

## checker
## How it Works
### Checker
Our checker is using two different methods to assign confidence points. The first method uses a State Machine and the second is using an n-Grams inspired implementation. The final score will be a weighted sum of these scores provided by two different methods.

### State Machine Checker
In order to have a good working State machine we need to first update the grammar and roles of each word manually. This step will be automatized to some extent in the next milestones. Tokens are provided in CheckerCorrector/SQLite/mydatabase.db, and also the basic graph provided by CheckerCorrector/DirectedGraph/BasicGraph.java. A sentence will first go through a typo checker and get updated if needed (it will also affect the confidence score). Then the sentence will be tokenized, and using the provided graph it will check whether the sentence is following the correct format or not, for each miss on any edge of the graph a penalty will be added to the confidence score.
#### State Machine Checker
In order to have a good working State machine we need to first update the grammar and roles of each word manually. This step will be automatized to some extent in the next milestones. Tokens are provided in `CheckerCorrector/SQLite/mydatabase.db`, and also the basic graph provided by `CheckerCorrector/DirectedGraph/BasicGraph.java`. A sentence will first go through a typo checker and get updated if needed (it will also affect the confidence score). Then the sentence will be tokenized, and using the provided graph it will check whether the sentence is following the correct format or not, for each miss on any edge of the graph a penalty will be added to the confidence score.

### n-Grams checker
#### n-Grams checker
This checker used the crawled data and gave a score by summing up all the n_grams probabilities of phrases in a sentence.


## Corrector
### Corrector
The current corrector uses the typo corrector which was used in the checker and also the state machine to suggest possible corrections. Corrections of the typo corrector are based on the most similar path through the state machine. In order to find a similar path we are first doing a DFS on the graph starting from the first token and storing all the possible paths. Then, based on the most similar path we decide whether we should change/delete/add a token and suggest a token with a similar role.
The correction at this point is limited to the state machine complexity, but it will be improved for the next milestone.