About
Hi, I’m Yuto Imai 👋
I’m a Vision and Language Researcher and Machine Learning Engineer based in Japan, currently pursuing my Master’s degree at Keio University. I’m passionate about developing intelligent systems that understand both vision and language, with real-world problem solving.
🚀 What I Do
I specialize in the intersection of computer vision and natural language processing:
- Evaluation of MLLMs: Creating fair evaluation codebases for multimodal models with omnidirectional and perspective images
- Multimodal AI: Developing vision-language models that can understand and reason about both images and text
- Cross-Modal Retrieval: Building systems that can search for objects using natural language descriptions
- Referring Expression Comprehension: Enabling models to identify specific objects in images based on textual descriptions
- Robotics Applications: Applying multimodal AI to indoor navigation and object manipulation
🎯 Current Focus
- 🔬 Research: Working on panorama image VQA Benchmark for MLLMs
- 💼 Professional: Research intern at SB Intuitions Corp.
- 🌱 Learning: Exploring advanced applications of diffusion models and large language models in robotics
🏆 Highlights
- 📝 Publications: 6+ research papers in top AI/Robotics conferences (JSAI, RSJ, ICRA)
- 🏅 Awards: Winner of DialFRED Challenge @ CVPR 2023, multiple research excellence awards
- 🎯 Recognition: Top 1% research award from Japanese Robotics Society (800+ submissions)
- 🎓 Education: Master’s at Keio University with significant scholarship support
📊 GitHub Activity
🎓 Academic Background
Current Research Areas:
- Dense text-based multimodal object retrieval
- Cross-lingual visual prompting for daily object search
- Diffusion models for referring expression segmentation
- Large language model integration in robotic systems
Teaching Experience:
- Advanced Machine Learning Course instructor (Diffusion Models)
- Guest lecturer at Yokohama Science Frontier High School
📬 Let’s Connect!
I’m always excited to discuss research collaborations, innovative projects, or just chat about the latest developments in multimodal AI!
- 📧 Email: ytim8812@keio.jp
- 🐙 GitHub: yutojubako
- 📝 Technical Writing: Zenn.dev
💭 Fun fact: I’m probably one of the few researchers who can debug a neural network in the afternoon and run sound for a theater production in the evening! The skills actually complement each other more than you’d think.
Last updated: October 2025