updated readme (3b1e0edc) · Commits · Zane Safi Mroue / ObjectDetectionVSLAM

README.md

+25 −3

Original line number	Diff line number	Diff line
		@@ -96,11 +96,11 @@ graph LR;
		- We integrated the system with two datasets, the tum_rgbd and the imperial_london subdirectories within the codebase. However, with the current system, the only addition necessary would be to give the VSLAM algorithm any RGBD (RGB + depth) video. Given that video, VSLAM should build all data we need, and then we can perform object detection and tracking.
		- [30%] Object Tracking: object tracking occurs when we iterate over each KeyFrame, placing objects into the objectset. At each iteration, we find new candidate objects from the new frame, and iteratively check if these new objects are instances of previously found PointSets, or previously undiscovered objects within the real world.
		- [15%] Comprehensive Benchmark:
		- see benchmark section below
		- See benchmark section below
		- [10%] Server and Database: we implemented the server with Spring-Boot, and wrote a GUI in JavaScript/CSS/HTML that is served by a Java backend. This makes it easy to view the pointcloud view of the room, and choose objects to be highlighted on the display.
		- [10%] The original Tello drone integration failed due to contraints of our drone itself. We are able to demonstrate performing VSLAM on footage collected from our phone isntead

		# Benchmarking
		### Benchmarking and Time Complexity Analysis

		Benchmarking Time Analysis:

		@@ -114,9 +114,31 @@ Processing time for single frame:

		22,000 points: avg 2ms, max 16ms, min under 1ms


		It should be noted that before implementation optimization on the PointSet class (changing from a list to a hashset), the system could significantly longer (on the magnitude of minutes). Now, the full system takes less than 2 seconds to process an 87 KeyFrame video.

		Yolo detector average execution time per frame: 915 ms.

		### Time Complexity

		The object detection can be split into several parts:
		1. Downsampling: from a video, a massive pointcloud is created. For example, an 87 keyframe video produces around 4,000,000 points. We perform downsampling to lower the amount of points used in our analysis, which saves on time.
		- this is performed by voxel bucketing: the process basically creates buckets in a 3D grid, and places each point in the pointcloud into its corresponding bucket (according to its x,y,z coordinate). Then, we average out all points within each bucket, and return all buckets (now called voxels), with more than 1 point in it.
		- this is linear on points, since for each point, finding the bucket can be done in O(1) time with the bucket grid implemented as a 3D array.
		- FINAL RUNTIME: O(P) for P initial points
		2. Point Projection: we iterate through each frame, and project all points in the now downsampled pointcloud onto that frame. This uses a complicated 3D projection algorithm that uses linear algebra and the Camera Pose (its location and orientation) to find the 2D frame coordinate of each point.
		- this is linear on the number of points in the pointcloud, since we only perform constant time operations, and all matrix multiplications use matrices of known size (3 or 4 rows/columns), so these operations are also constant
		- FINAL RUNTIME: O(PF) for P points and F frames
		3. Bounding Box Point Collection: the next step is finding whether each point (now projected onto the 2D frame) falls into a specific bounding box (which represents a potential object). This is down, similar to downsampling, where we first create a 2D grid that represents the frame, and iterate through each bounding box, setting the value at the pixel locations of each bounding box to that bounding boxes' index in the list of bounding boxes. Then, we iterate through each point, check whether a bounding box has been set at that points pixel location, and if it is, we place that point into a new PointSet, which represents an object.
		- the final product is a list of candidate PointSets (objects), each containing the points that were projected into each bounding box
		- although the number of bounding boxes is variable, since we know the size of the frame, this limits the number of bounding boxes to the number of pixels within the frame, which we know beforehand to be 480 and 640 (for the currently supported datasets). Therefore, the runtime is not dependent on the number of bounding boxes, but rather only on the number of points and frames
		- FINAL RUNTIME: O(PF) for P points and F frames
		4. Object Reconciliation: the final step in the process is checking each candidate object against previous objects stored within the database. For each candidate object recovered within a frame, we iterate through each previously known object in the ObjectSet list, and iterate through each point in the candidate PointSet, keeping a count of how many points are found to be equal. If the number of points passes some heuristically-set percentage of the previously known object's size, we intersect the two PointSets. If not, we add the candidate PointSet to the list of objects.
		- for a single frame, we might have N objects in the ObjectSet already. Then, for the current PointSet, we iterate through each of those N objects, and check each of the points in the candidate. This check takes O(1) for each point, since we implement the PointSets as a hashset.
		- in the worst case, we still have a constant number of objects for each frame, due to the limit on the frame size in pixels. So, we know that the number of points we end up checking proportional to F^2 for F frames
		- FINAL RUNTIME: O(PF^2), for P points and F frames

		In total, we are bounded by O(PF^2) for P points and F frames. It should be noted that P will almost always be greater than F^2 in a real life situation due to the magnitude of points that are collected for each frame.

		# Code

		The following links: