@@ -144,7 +144,7 @@ To execute the program via the command-line interface, navigate to the directory
./md --makeFewComplex: This command creates 10,000 million molecule files, each with over 10,000 atoms. Generated molecules are saved in a folder called `complex` that is located in the parent directory of the project folder. Users do not need to specify a file or folder path for this feature.
./md --addProteins [FILE PATH]: This command adds proteins created by the `--makeManySimple` and `--makeFewComplex` commands. Please make sure to select a file path, which is either the Complex or Simple directory.
./md --addProteins [FILE PATH]: This command adds proteins created by the `--makeManySimple` and `--makeFewComplex` commands. Please make sure to select a file path, which is either the `complex` or `simple` directory.
./md --marco: This command pings the server to check if it is still alive.
@@ -201,7 +201,7 @@ The PubChem database is supported by a Power User Getaway (PUG) REST-style API,
The script parses the JSON response for the compound’s CID, Title, and SMILES and generates valid molecule inputs for the database. PySMILES provides a helper function, read_smiles, which generates a NetworkX graph object of the molecule. Then, NetworkX is used to write an edge list for the graph, and metadata such as the Title of the compound, the number of atoms, and the atom names are placed in front of the edge list. This constitutes a valid molecule input format for our database, which is then saved as a *.txt file with the name [“molecule” + CID + “.txt”]. When interpreting SMILES Strings, any SMILES containing non-integer bond values (e.g., 1.5) will be ignored. Additionally, any PubChem CIDs without a Title index will also be ignored.
Two scripts are provided for generating input files, one that is meant to be called by the Main.java class (`downloadPubChem.py`) and one that is interactable through a CLI (molecule_input.py). The Main.java class provides a range of CIDs that are automatically inserted into the database with the command --downloadPubChem start,end. When utilizing the user-intended command-line interface, several other functions are also provided, such as generating input files based on a user-provided SMILES String and generating isomorphic test files to test the --findMolecule command. The user is prompted to input what functions they would like to use, and the generated molecule input files are created in their respective folders.
Two scripts are provided for generating input files, one that is meant to be called by the Main.java class (`downloadPubChem.py`) and one that is interactable through a CLI (molecule_input.py). The Main.java class provides a range of CIDs that are automatically inserted into the database with the command --downloadPubChem start,end. When utilizing the user-intended command-line interface, several other functions are also provided, such as generating input files based on a user-provided SMILES String and generating isomorphic test files to test the `--findMolecule` command. The user is prompted to input what functions they would like to use, and the generated molecule input files are created in their respective folders.
@@ -212,9 +212,9 @@ Two scripts are provided for generating input files, one that is meant to be cal
**How it was implemented:**
10 million molecules are unavailable in the PubChem database or any readily accessible database. Therefore, a ProteinFactory class is created to procedurally generate unique proteins from a set of amino acids. These proteins are saved to a designated location in the file system in the same format as the user input files, which then can be added to a database with the `--addProteins` command.
10 million molecules are unavailable in the PubChem database or any readily accessible database. Therefore, a `ProteinFactory` class is created to procedurally generate unique proteins from a set of amino acids. These proteins are saved to a designated location in the file system in the same format as the user input files, which then can be added to a database with the `--addProteins` command.
Because saving 10 million files in a single directory is too demanding, 100 child directories are created, each containing 100,000 protein files. The command for generating 10 million protein files is --makeManySimple, and the default location is ../simple. The reason for traveling to the parent directory is to hide the files from the IDE in use, which may throw an error in an attempt to index the files.
Because saving 10 million files in a single directory is too demanding, 100 child directories are created, each containing 100,000 protein files. The command for generating 10 million protein files is `--makeManySimple`, and the default location is ../simple. The reason for traveling to the parent directory is to hide the files from the IDE in use, which may throw an error in an attempt to index the files.
The number of atoms in each protein is 52 at minimum to ensure the uniqueness of each protein. The number of atoms in each protein is and 136 at maximum because no more is necessary to generate 10 million unique proteins.
@@ -231,9 +231,9 @@ Dividing the database into multiple partitions is a solution in plan. They will
**How it was implemented:**
Molecules comprised of over 10,000 atoms are not available in the PubChem database, or in any readily accessible database. Therefore, a ProteinFactory class is created to procedurally generate unique proteins from a set of amino acids. These proteins are saved to a designated location in the file system, in a same format as the user input files, which then can be added to a database with the --addProteins command.
Molecules comprised of over 10,000 atoms are not available in the PubChem database, or in any readily accessible database. Therefore, a `ProteinFactory` class is created to procedurally generate unique proteins from a set of amino acids. These proteins are saved to a designated location in the file system, in a same format as the user input files, which then can be added to a database with the `--addProteins` command.
Because saving 10,000 protein files with over 10,000 atoms each in a single directory is too demanding, 10 child directories are created, which contains 1,000 protein files each. The command for generating 10 million protein files is --makeFewComplex and the default location is ../complex. The reason for traveling to the parent directory is to hide the files from the IDE in use, which may throw an error in an attempt to index the files.
Because saving 10,000 protein files with over 10,000 atoms each in a single directory is too demanding, 10 child directories are created, which contains 1,000 protein files each. The command for generating 10 million protein files is `--makeFewComplex` and the default location is ../complex. The reason for traveling to the parent directory is to hide the files from the IDE in use, which may throw an error in an attempt to index the files.
The feature is not complete at the current time, because the data structures of the database and the molecules are too memory intensive. Adding 10,000 molecules with 10,000 atoms each to the database simply did not succeed before the Java software crashed from an OutOfMemory error.
@@ -268,7 +268,7 @@ Given that our molecule database is organized as a HashMap where the value for e
**How it was implemented:**
Subgraph search was implemented using the same heuristics as findMolecule(), where certain qualities of the molecule were used to eliminate any possible matches. If the subgraph contained more atoms or more of each type of element than the target molecule, the target molecule was no longer considered. The number of edges was also a factor. Once these preliminary tests were met, Each atom was tested so that there was at least one candidate in the target molecule. For example, if a carbon atom was connected to two hydrogen atoms with single bonds in the subgraph, then at least one carbon atom must be connected to at least two hydrogen atoms with a single bond. The candidates are stored on a HashMap, with the subgraph’s atom acting as a key and an array list of candidate atoms as the value. After all candidates are found, a linked list traverses through the subgraph, choosing one of the candidates to pair with and adding neighbors similar to Breadth First Search. If no available candidates are left to choose from during traversal, the linked list will traverse backward through its parent, choosing a different candidate as an option. If the linked list attempts to traverse on the head, that means that no subgraph exists in the molecule. If the subgraph is traversed through the entire linked list, however, this means that the subgraph does exist in the target molecule. This process is repeated for every possible molecule in the database.
Subgraph search was implemented using the same heuristics as `findMolecule()`, where certain qualities of the molecule were used to eliminate any possible matches. If the subgraph contained more atoms or more of each type of element than the target molecule, the target molecule was no longer considered. The number of edges was also a factor. Once these preliminary tests were met, Each atom was tested so that there was at least one candidate in the target molecule. For example, if a carbon atom was connected to two hydrogen atoms with single bonds in the subgraph, then at least one carbon atom must be connected to at least two hydrogen atoms with a single bond. The candidates are stored on a HashMap, with the subgraph’s atom acting as a key and an array list of candidate atoms as the value. After all candidates are found, a linked list traverses through the subgraph, choosing one of the candidates to pair with and adding neighbors similar to Breadth First Search. If no available candidates are left to choose from during traversal, the linked list will traverse backward through its parent, choosing a different candidate as an option. If the linked list attempts to traverse on the head, that means that no subgraph exists in the molecule. If the subgraph is traversed through the entire linked list, however, this means that the subgraph does exist in the target molecule. This process is repeated for every possible molecule in the database.
# Changes from Initial Project Defense
@@ -290,7 +290,7 @@ Link to a folder containing all testing code utilized to observe the correctness
# Work Breakdown
Hyunsoo Kim implemented the Main.java, the MoleculeDatabase.java, and the `ProteinFactory.java`. In addition, Hyunsoo worked to implement a partitioned database scheme with manual memory management, Hyunsoo helped discover useful PubChem APIs and put together testing and benchmarking suite, and also contributed to the README.md file.
Caelan Wong implemented the `mostSimilar()` method in Molecule.java and MoleculeDatabase.java to run whenever `findMolecule()` returns null. Also, Caelan helped with the early implementation of the `addMolecule()` method and created the PeriodicTable.java enum. In addition, Caelan implemented the `deleteMolecule()` function in the GUI and command line interface. Lastly, Caelan helped with the README.md.
Caelan Wong implemented the `mostSimilar()` method in Molecule.java and MoleculeDatabase.java to run whenever `findMolecule()` returns null. Also, Caelan helped with the early implementation of the `addMolecule()` method and created the `PeriodicTable.java` enum. In addition, Caelan implemented the `deleteMolecule()` function in the GUI and command line interface. Lastly, Caelan helped with the README.md.
Phuong Khanh Tran helped implement `MoleculeDatabase.java`, which initializes the database, designed `GUI.java`, which constructs the graphical user interface, and coded `MDB.java`, which creates the database that can work with the GUI. Additionally, Phuong contributed to writing the README.md and INSTALL.txt files.