Update README.md (8614f736) · Commits · EC504 Spring 2024 Group Projects / Group4

README.md

+22 −0

Original line number	Diff line number	Diff line
		@@ -88,6 +88,7 @@ Since isomorphic molecules are equivalent to each other, we must test for isomor

		# Implemented features


		Can hold 10,000 molecules

		Percentage: Minimum Requirement
		@@ -98,6 +99,8 @@ As detailed in the Project Implementation section, our molecule database is buil
		public HashMap<Integer, ArrayList<Molecule>> db
		In this structure, each integer key corresponds to an ArrayList containing molecules with the same number of atoms. This structure allows our database to efficiently manage up to 10,000 molecules if needed.



		Efficiently searches for a molecule up to graph isomorphism

		Percentage: Minimum Requirement
		@@ -109,6 +112,8 @@ Our database utilizes an array-based organization where each index corresponds t
		Next, we can analyze the specific characteristics of both the target molecule and each molecule sharing the same count of atoms. We use the function areMoleculesEqual() to compare the molecule we are searching for with the other molecules individually. Each molecule has an array, numElements, of size 118, where each index corresponds to an atomic number of an element. Within this array, each index holds a count representing the number of occurrences of each element in the molecule. We compare the numElements arrays of both molecules, and we can determine if they are not the same and if there are any inconsistencies. Afterward, we verify that each molecule's total number of edges matches. Following this comparison, we proceed with a deeper examination of each atom in both molecules. For every atom in the first molecule, we ensure a corresponding atom exists in the second molecule. We compare the atoms by looking at their atomic number, degree of edges, and bonds. We ensure that each bond of the atom is exactly the same as the bonds of the atom in the second molecule. If, at any point, inconsistencies between the two molecules arise, we conclude that the molecules are not identical. The function will only return that a molecule has been found in the database if it satisfies all the tests.




		Command-line User Interface

		Percentage: Minimum Requirement
		@@ -148,6 +153,8 @@ To execute the program via the command-line interface, navigate to the directory
		The Main.java class, which facilitates the command-line interface, also includes a client-server connection feature. When the program is executed, it first attempts to determine whether it can function as a client or server and establishes connections accordingly.




		Stand-alone GUI

		Percentage: 15%
		@@ -172,6 +179,9 @@ Database Statistics: Clicking this option prints database statistics, including

		When the GUI is closed, the program automatically saves the working database as molecule.db in the same project folder. Upon reopening, the database loads automatically, and a message confirming successful loading is displayed.




		Downloads known compounds from an existing database (1,000)

		Percentage: 15%
		@@ -184,6 +194,9 @@ The script parses the JSON response for the compound’s CID, Title, and SMILES

		Two scripts are provided for generating input files, one that is meant to be called by the Main.java class (downloadPubChem.py) and one that is interactable through a CLI (molecule_input.py). The Main.java class provides a range of CIDs that are automatically inserted into the database with the command --downloadPubChem start,end. When utilizing the user-intended command-line interface, several other functions are also provided, such as generating input files based on a user-provided SMILES String and generating isomorphic test files to test the --findMolecule command. The user is prompted to input what functions they would like to use, and the generated molecule input files are created in their respective folders.




		Implementation handles core operations on over 10 million molecules at a rate of 10 ops/sec

		Percentage: 30%
		@@ -200,6 +213,9 @@ The feature is not complete at the current time because the database and molecul

		Dividing the database into multiple partitions is a solution in plan. They will be accompanied by a file with bloom filters, one for each partition. Before a partition is loaded, its bloom filter will be tested to rule out partitions that cannot contain the molecule. A canonicalization algorithm may be necessary to enforce a unique representation of isomorphic molecules. On the other hand, having multiple separate databases will be friendly to parallelization.




		Handles core operations on 10,000 complex molecules, each with over 10,000 atoms (rate of 10 ops/sec)

		Percentage: 30%
		@@ -214,6 +230,9 @@ The feature is not complete at the current time, because the data structures of

		Dividing the database into multiple partitions is a solution in plan. They will be accompanied by a file with bloom filters, one for each partition. Before a partition is loaded, its bloom filter will be tested to rule out partitions that cannot contain the molecule. Some kind of canonicalization algorithm may be necessary to enforce a unique representation of isomorphic molecules. On the other hand, having multiple separate databases will be friendly to parallelization.




		Searches for the most similar molecule to a given molecule if no exact match exists.

		Percentage: 30%
		@@ -230,6 +249,9 @@ Each edge originating from each atom in the first molecule is compared to each e

		Given that our molecule database is organized as an array where each index i holds the molecules that have i atoms in them, we opted to compute the similarity score only for molecules with a number of atoms within 100 of the molecule we are searching for. This decision stems from the anticipation that molecules with a significantly larger or smaller number of atoms will exhibit substantial differences. By limiting the number scope of molecules considered for similarity scoring, we optimize efficiency when there is a wide variance in the number of atoms among the molecules in our database.




		Subgraph search (finds all molecules that contain the provided subgraph)

		Percentage: 30%