Commit 0af524fc authored by Manuel  Segimon's avatar Manuel Segimon
Browse files

Merge branch '21-fix-corpus-read-error' into 'master'

Resolve "Fix corpus read error"

Closes #21

See merge request ec504/ec504_projects/group7!13
parents 3fc60724 5f442f58
Loading
Loading
Loading
Loading
+3 −1
Original line number Diff line number Diff line
@@ -169,7 +169,7 @@ Built initial Corrector Module using ngrams on brown corpus

Integrated separate checker/corrector and crawler modules into one package

Implemented Serialization and Deserialization of TrieNode for crawler (limited compression to 1KB per site by tracking incremental compressed size)
Implemented Serialization and Deserialization of TrieNode for crawler

Implemented streaming for GUI so that output appears as it crawls each site

@@ -187,6 +187,8 @@ Built initial checker module using statisical methods

Implemented TrieNode structure for crawler to use ngrams and track conditional probabilities

Limited compression to 1KB per site by tracking incremental compressed size in crawler

Implemented logic for making correction and ranking based on differences

Implemented GUI text highlighter for checker
+5 −1
Original line number Diff line number Diff line
@@ -22,8 +22,12 @@ import random
with open('/Users/manuelsegimonplana/Documents/Current Courses/Not Completed Homework/DS - Project/group7/src/main/java/resources/english.txt', 'r') as file:
    lines = file.readlines()

# Remove lines with less than 4 words
lines = [line for line in lines if len(line.split()) >= 4]

# Shuffle the lines
random.shuffle(lines)

# Optionally, you can write the cleaned lines back to a file
with open('src/main/java/resources/brown.txt', 'a') as file:
    file.writelines(lines[:450000]) # Last working number: 500000
    file.writelines(lines[:10000])
+1 −1
Original line number Diff line number Diff line
@@ -278,7 +278,7 @@ public class crawler {
                }
            }
        } else {
            extractWordUsage(web_data.text(), wordUsage);
            extractWordUsage(web_data.text().replaceAll("\\p{Punct}", ""), wordUsage);
            // System.out.println("Ngrams built successfully. for size:"+MAXNGRAM);
            uncompressedData = wordUsage.serialize();
            compressedData = compress(uncompressedData);
+0 −0

File changed.

Preview suppressed by a .gitattributes entry or the file's encoding is unsupported.

+0 −0

File changed.

Preview suppressed by a .gitattributes entry or the file's encoding is unsupported.

Loading