RobotVoiceControl

2025

LLM-driven Agent for Industrial Robot Control. Controls a KUKA robot via natural language (text & speech).

About

RobotVoiceControl is an LLM-powered agentic system that enables operators to control a KUKA industrial robot through natural language — both voice and text. The pipeline follows five stages: Voice Activity Detection captures and filters speech, Gemini Flash 2.5 transcribes audio to text, a LangGraph-based agent orchestrates intent understanding and multi-step task planning, robot control tools execute the actions, and Azure TTS synthesizes the agent's response back to speech for fluid bidirectional communication.

The LangGraph graph has three nodes: a chatbot node (Gemini 2.5 Flash via OpenRouter) that reasons over conversation history and decides which tools to call, a tools node that executes those calls and feeds results back to the chatbot, and a normalization node (Gemini 2.5 Flash Lite) that cleans the final reply before it is passed to TTS — expanding abbreviations, stripping markdown, and standardizing units. A MemorySaver checkpointer keeps the full conversation thread in memory for context-aware multi-turn interactions.

The agent is equipped with nine tools: get_tech_doc retrieves the robot's technical specification sheet for capability questions; send_movement_command moves the TCP by relative Cartesian offsets in X/Y/Z and orientation A/B/C (mm and degrees), supporting both PTP and LIN motion types; send_joint_movement_command rotates one or more of the six joints by relative angles in degrees with automatic ±180° normalization; send_robot_to_initial_home_position drives the arm to a predefined safe home pose; get_current_position and get_current_joint_positions read live state from the controller via py-openshowvar over Telnet; check_detected_objects triggers the Cognex vision system via HTTP/FTP, fetches pattern-match results, and returns detected object names with their XY coordinates and angles; send_pick_and_place_command orchestrates the full pick-and-place sequence — it triggers the camera, resolves pick coordinates from the detected pattern, computes the correct Z height per object type, drives the arm to pick, activates the gripper, then moves to the target bin or conveyor. All physical commands are sent as KRL (KUKA Robot Language) instructions over a Telnet connection, with position verification after each move and automatic error reporting if the robot fails to reach its target.

Tech Stack

KUKA

LangGraph

YOLO (Ultralytics)

Links

GitHub Paper