Kshama Nitin Shah
I am a second year EEECS masters student at the University of Michigan. I'm also currently working as a Research Assistant advised by Justin Johnson and Andrew Owens . My research focuses on Computer Vision , Machine Learning and Deep Learning.
I completed my undergraduate degree in Electronics and Communication Engineering from BITS, Dubai.
Feel free to email me if you're interested in collaborating on research related to Computer Vision or Machine Learning.
Email  / 
CV  / 
Github  / 
Transcript
|
|
Research
My research interests broadly lie in image and video understanding and multimodal learning. Specifically I'm interested in developing 'self-supervised' computer vision models that learn from multimodal sensation specifically natural language and cross modal image data.
|
|
Self-Supervised Object Detection with Multimodal Image Captioning
Course Project, EECS 545, Machine Learning, Winter 2022 , Advisor :Honglak Lee
Report
/
Poster
/
Code
We developed a novel self supervised pipeline that uses natural language supervision as a pre-training task to localize objects given an image.
We also developed specialized prompts and heatmap visualizations to generate pseudo ground truth object classes and bounding box coordinates.
Upon finetuning the model with a few supervised labels, we observed that the model performed as well as other semi-supervised object detection models.
Achieved an mAP of 21.57% upon training the best performing model with an FCOS object detector.
|
|
Language Supervised Vision Pre-Training for Fine-grained Food Classification
Course Project, EECS 598-008, Deep Learning for Computer Vision, Winter 2022 , Advisor :Justin Johnson
Report
Worked on fine-grained food classification of food by leveraging a vision and language pre-training model, which was trained on sub-reddits from the RedCaps dataset.
Achieved a 20% top-5 classification accuracy after zero-shot transfer.
|
|
MC-VQA using customized prompts
Course Project, EECS 595, Natural Language Processing, Fall 2022 , Advisor :Joyce Chai
Report
Developed a novel pipeline to perform zero-shot Visual Question Answering by conjoining large pre-trained models such as CLIP and the T-5 transformer.
The way we did this is by converting questions into declarative statements and using these with the images and multiple choice answers to perform zero-shot VQA using CLIP.
Achieved a total accuracy of 49.5% which is similar to the SOTA zero-shot VQA performance and better than the previous SOTA in zero-shot VQA.
|
|
A Monocular Local Mapper for Urban Scenes
Course Project, EECS 442, Computer Vision, Fall 2021 , Advisor :Andrew Owens
Report
/
Code
Developed a model to perform object detection, semantic segmentation and depth estimation simultaneously using a U-Net, YOLO-v1 and MobileNet-v3 feature extractor.
|
|