Master's Thesis
2025-2026End-to-End Multimodal AI Agent for Autonomous Robotic Control
About
This Master's Thesis explores the design and implementation of an end-to-end multimodal AI agent for autonomous robotic control, focusing on the synergy between Large Language Models (LLMs) and Vision-Language-Action (VLA) models. The system allows operators to control an SO-101 robotic arm through natural voice commands. The interaction pipeline begins with voice activity detection (Silero VAD) and speech-to-text transcription with openai/whisper-large-v3-turbo. A LangGraph-based agent orchestrator interprets the intent, plans multi-step actions, and delegates execution to specialized tools. For physical manipulation, several VLA architecture (SmolVLA, Pi0.5, and GR00T) were trained with imitation learning on manually collected datasets via the LeRobot framework and benchmarked on a created test bench to select the better performer. The agent operates in a closed loop, using visual feedback to validate task success and autonomously trigger error correction or recovery if a manipulation fails. The agent's textual responses are synthesized back to the user via local hexgrad/Kokoro-TTS for a fluid conversational feedback. Weights & Biases is used for experiment tracking and training run comparison.