ChatAcadien

2024

RAG-based conversational agent for querying historical Acadian genealogical records (1700-1900).

About

ChatAcadien is an LLM-based conversational agent that provides interactive access to Acadian genealogical records from 1700–1900, built in collaboration with the Centre d'études acadiennes Anselme-Chiasson (CEAAC) at the University of Moncton and funded jointly by Mitacs and the UdeM Experiential Learning Service.

The core technical challenge was designing a retrieval strategy for genealogical corpora — multi-generational family records spanning hundreds of pages — where standard fixed-size chunking severs the contextual links between family members across generations. Beyond the chunking problem itself, there was very little existing literature or established practice on applying RAG to this type of structured genealogical document, making the design choices largely exploratory. The solution we found is a Parent-Child Retriever: large parent chunks (7 000 characters) preserve intra-family narrative coherence, while smaller child chunks (1 024 characters, 20-character overlap) are embedded and indexed for dense retrieval. At query time, child chunks surface the most relevant passages and the system returns their full parent document to the LLM, keeping family relationships intact. A VoyageAI rerank-2 pass re-scores retrieved candidates before generation.

Three dedicated Pinecone indices partition the knowledge base by domain: Acadian family records, institutional CEAAC information, and an FAQ index. LangChain orchestrates tool-calling over these indices, with BraveSearch as a web-search fallback. GPT-4.1 (OpenRouter) is the primary generator with Claude 3.7 Sonnet as fallback. The Streamlit frontend supports fully bilingual (French/English) streaming responses with multi-turn conversation memory. Conversation logs and user feedback are persisted in MongoDB. Deployment is containerized with Docker and automated via GitHub Actions CI/CD. The work was published at IEEE CASCON 2025.

Presentations

Tech Stack

DockerDocker
GitHub Actions
LangChainLangChain
PineconePinecone
StreamlitStreamlit
VoyageAIVoyageAI

Links