Skip to content
Snippets Groups Projects
Code owners
Assign users and groups as approvers for specific file changes. Learn more.

Documentation

Nafis Abeer: nafis@bu.edu

Rohan Kumar: roku@bu.edu

Zane Mroue: zanem@bu.edu

Samuel Gulinello: samgul@bu.edu

Sanford Edelist: edelist@bu.edu

Description

Visual Simultaneous Localization and Mapping (VSLAM) is the process of taking camera feed, as well as its position, and building a map of the current local world, specifically using visual input. This project uses this process, and builds upon it by also tracking objects within a frame. In this comes two problems: object detection, and then subsequent mapping and tracking of objects within a 3D space.

Implementation

For simplicity, the general system framework is shown below:

graph LR;
    Z[Camera/Video] -->|Input| A
    A[VSLAM]-->|KeyFrames| B[YOLOv4];
    A-->|Features| C[Object Tracking];
    A-->|Features| E
    B-->|Object Detection|C;
    C-->|Objects| D[Database];
    D-->|Objects| E[GUI];
  • Camera/Video: as of right now, we use prerecorded video as our examples and tests, but this system can be easily extended to real time camera systems or drone footage
    • the output are the frames that constitute the video
  • VSLAM: using the MATLAB VSLAM algorithm, this process takes the raw frame and does two things: finds "features", or important points used for tracking, and finds "keyframes", which are a subset of the entire set of frames that capture the most information about the video (i.e. movement gives points their 3D position)
    • the outputs are the keyframes and the features
  • YOLOv4: using a Java library, we perform deep learning using the YOLOv4 model, which is a convolutional neural network that takes a keyframe, and finds bounding boxes around each object that the model can discern
    • the output are the bounding boxes around each object for each frame
  • Object Tracking: the significant contribution using data structures and algorithms, this system takes the bounding boxes of each object (in 2D on a single frame), and the features (in 3D on the same frame), finds each feature in each bounding box, and then tries to rectify the objects in the current frame with objects found in past frames. We solve this by implementing a data structure called an ObjectSet. For each object that has already been found, we compare a new object, and if there is some percentage of similarity in features contained in both objects, we combine these two objects and update the database correspondingly.
    • there is further explanation and runtime analysis in the Appendix A
    • the output is an iteratively more accurate set of objects that correspond to reality
  • Database: for ease of retrieving, updating, and storing Objects and corresponding features, we use a MongoDB database
    • the output is storage for those objects from object tracking
  • GUI: for an outward facing display of our work, we implemented a Javascript UI, that creates a server, such that we can view the system's output in any browser.
    • the output is a clean point cloud view of objects and features that the camera has seen

Features

need to fill in this area

Code

The following links:

Work Breakdown

Nafis Abeer:

Rohan Kumar:

Zane Mroue:

Samuel Gulinello:

Sanford Edelist:

Appendix

A: Runtime and Space Analysis of ObjectSet

TODO

B: References and Material Used

need to fill in references and also ALL LIBRARIES USED (MATLAB, YOLO, Javascript stuff, etc)

Personal Access token

Qzkmjrtrda1yGxkxyz8C