Hi, I’m Yuto Imai 👋

I’m a Vision and Language Researcher and Machine Learning Engineer based in Japan, currently pursuing my Master’s degree at Keio University. I’m passionate about developing intelligent systems that understand both vision and language, with real-world problem solving.

🚀 What I Do

I specialize in the intersection of computer vision and natural language processing:

Evaluation of MLLMs: Creating fair evaluation codebases for multimodal models with omnidirectional and perspective images
Multimodal AI: Developing vision-language models that can understand and reason about both images and text
Cross-Modal Retrieval: Building systems that can search for objects using natural language descriptions
Referring Expression Comprehension: Enabling models to identify specific objects in images based on textual descriptions
Robotics Applications: Applying multimodal AI to indoor navigation and object manipulation

🎯 Current Focus

🔬 Research: Working on panorama image VQA Benchmark for MLLMs
💼 Professional: Research intern at SB Intuitions Corp.
🌱 Learning: Exploring advanced applications of diffusion models and large language models in robotics

🏆 Highlights

📝 Publications: 6+ research papers in top AI/Robotics conferences (JSAI, RSJ, ICRA)
🏅 Awards: Winner of DialFRED Challenge @ CVPR 2023, multiple research excellence awards
🎯 Recognition: Top 1% research award from Japanese Robotics Society (800+ submissions)
🎓 Education: Master’s at Keio University with significant scholarship support

📊 GitHub Activity

🎓 Academic Background

Current Research Areas:

Dense text-based multimodal object retrieval
Cross-lingual visual prompting for daily object search
Diffusion models for referring expression segmentation
Large language model integration in robotic systems

Teaching Experience:

Advanced Machine Learning Course instructor (Diffusion Models)
Guest lecturer at Yokohama Science Frontier High School

📬 Let’s Connect!

I’m always excited to discuss research collaborations, innovative projects, or just chat about the latest developments in multimodal AI!

📧 Email: ytim8812@keio.jp
🐙 GitHub: yutojubako
📝 Technical Writing: Zenn.dev

💭 Fun fact: I’m probably one of the few researchers who can debug a neural network in the afternoon and run sound for a theater production in the evening! The skills actually complement each other more than you’d think.

Last updated: December 2025