Removed JSoup from Crawler Parser, Updated README (8719bc33) · Commits · EC504 Spring 2024 Group Projects / Group6

Crawler/src/main/java/RegexParser.java

+0 −41

Original line number	Diff line number	Diff line
		import org.jsoup.Jsoup;
		import org.jsoup.nodes.Document;
		import org.jsoup.nodes.Element;
		import org.jsoup.select.Elements;

		import java.io.*;
		import java.util.ArrayList;
		import java.util.List;
		@@ -10,37 +5,6 @@ import java.util.regex.Matcher;
		import java.util.regex.Pattern;

		public class RegexParser {

		/*public static void main(String[] args) {
		// Read URLs from a text file
		List<String> urls = readUrlsFromFile("urls.txt");
		List<String> links = new ArrayList<String>();
		List<String> sentences = new ArrayList<String>();

		// Parse each webpage and gather information
		for (String url : urls) {
		try {
		// Fetch the webpage using JSoup
		Document doc = Jsoup.connect(url).get();

		// Extract all the text content from the page
		String allText = getAllText(doc);

		// Find all sentences in the text
		sentences = findSentences(allText);

		// Extract links from the page
		//links = extractLinks(doc);

		// Write sentences and links to a file
		writeToFile(sentences, links, "output_" + getFilenameFromUrl(url) + ".txt");

		} catch (IOException e) {
		e.printStackTrace();
		}
		}
		}*/

		// Function to read URLs from a text file
		private static List<String> readUrlsFromFile(String filename) {
		List<String> urls = new ArrayList<>();
		@@ -55,11 +19,6 @@ public class RegexParser {
		return urls;
		}

		// Function to extract all text content from the page
		private static String getAllText(Document doc) {
		return doc.text();
		}

		// Function to find all sentences in the given text
		private static List<String> findSentences(String text) {
		List<String> sentences = new ArrayList<>();

README.md

+14 −9

Original line number	Diff line number	Diff line
		# Group 6
		Our java webcrawler and langauge checker project.
		## Summary
		Our group aims to develop a Language Correction software that efficiently analyzes text for correct language usage by crawling through pages of the target language(s) on the Internet. The evaluator will compile common usage patterns of words and phrases from crawled texts and use them as a reference to identify words or phrases that are consistent or inconsistent with the gathered data. This project will assist users in improving their language skills and ensuring the accuracy of their written content.

		### Credits
		Webcrawler by Alex Melnick

		Data Parser by Michael
		@@ -14,24 +16,27 @@ https://docs.oracle.com/javase/tutorial/networking/urls/index.html
		The only library used for the checker and corrector is https://github.com/xerial/sqlite-jdbc and its dependency https://www.slf4j.org. These are both provided in the repo, and there is no need to download them.

		## Usage
		To run the webcrawler, simply create a ScratchCrawler object and run the .crawl command with the seed URL as the arguement.
		### Web Crawler
		To run the webcrawler, simply create a ScratchCrawler object and run the .crawl command with the seed URL as the argument.

		`ScratchCrawler crawler = new ScratchCrawler(); // Create a new ScratchCrawler object`

		`crawler.crawl("https://archive.org/details/bostonpubliclibrary"); // Start off the crawl with the seed page`

		To build the project for CheckerCorrector we are using a make file. running "make dev_corrector" and "make dev_checker" will compile and build the checker.jar and corrector.jar with the user interface requested.
		### Checker/Corrector
		To build the project for CheckerCorrector we are using a make file. running `make dev_corrector` and `make dev_checker` will compile and build the `checker.jar` and `corrector.jar` with the user interface requested.

		## checker
		## How it Works
		### Checker
		Our checker is using two different methods to assign confidence points. The first method uses a State Machine and the second is using an n-Grams inspired implementation. The final score will be a weighted sum of these scores provided by two different methods.

		### State Machine Checker
		In order to have a good working State machine we need to first update the grammar and roles of each word manually. This step will be automatized to some extent in the next milestones. Tokens are provided in CheckerCorrector/SQLite/mydatabase.db, and also the basic graph provided by CheckerCorrector/DirectedGraph/BasicGraph.java. A sentence will first go through a typo checker and get updated if needed (it will also affect the confidence score). Then the sentence will be tokenized, and using the provided graph it will check whether the sentence is following the correct format or not, for each miss on any edge of the graph a penalty will be added to the confidence score.
		#### State Machine Checker
		In order to have a good working State machine we need to first update the grammar and roles of each word manually. This step will be automatized to some extent in the next milestones. Tokens are provided in `CheckerCorrector/SQLite/mydatabase.db`, and also the basic graph provided by `CheckerCorrector/DirectedGraph/BasicGraph.java`. A sentence will first go through a typo checker and get updated if needed (it will also affect the confidence score). Then the sentence will be tokenized, and using the provided graph it will check whether the sentence is following the correct format or not, for each miss on any edge of the graph a penalty will be added to the confidence score.

		### n-Grams checker
		#### n-Grams checker
		This checker used the crawled data and gave a score by summing up all the n_grams probabilities of phrases in a sentence.


		## Corrector
		### Corrector
		The current corrector uses the typo corrector which was used in the checker and also the state machine to suggest possible corrections. Corrections of the typo corrector are based on the most similar path through the state machine. In order to find a similar path we are first doing a DFS on the graph starting from the first token and storing all the possible paths. Then, based on the most similar path we decide whether we should change/delete/add a token and suggest a token with a similar role.
		The correction at this point is limited to the state machine complexity, but it will be improved for the next milestone.