
Overview
Trace is a next-generation financial assistant and production-grade Retrieval-Augmented Generation (RAG) pipeline. It solves the unstructured data problem in personal finance by converting raw receipt images into type-safe JSON. The platform enables natural language queries over financial data using a Hybrid Engine that fuses dense vector retrieval with sparse keyword matching, seamlessly streamed to a modern React 19 frontend.
Architecture
The ingestion pipeline extracts text via PaddleOCR, formats it into type-safe JSON using Instructor, and applies item-level chunking. For retrieval, async FastAPI endpoints query both ChromaDB and BM25 indices, merging the results before an ONNX-optimized Cross-Encoder reranks candidates for the final streamed response.
Key Features
Hybrid Retrieval & Reranking
Achieved 91%+ Context Precision (benchmarked via RAGAS) by implementing a dense (semantic) and sparse (BM25) RAG pipeline with a Cross-Encoder Reranking layer and Parent Document Retrieval Item-level chunking to optimize context window relevance.
OCR & Structuring
Utilizes PaddleOCR combined with Instructor (Pydantic) to extract and validate type-safe JSON from receipts, featuring real-time OCR coordinate visualization on the frontend.
Quantized SLM Inference
Runs quantized Phi-3.5 (4-bit) and Cross-Encoder models via ONNX Runtime to eliminate heavy PyTorch dependencies, ensuring fast inference.
Containerized & Privacy-First
Deployed a fully containerized stack using Docker Compose with persistent volumes for vector stores and model weights, ensuring 100% data privacy and offline capability[cite: 47].
Tech Stack
Backend
Frontend
Database
AI/ML
DevOps
Challenges & Solutions
Semantic search alone failed on exact keyword matches (e.g., retrieving specific receipt totals or merchant names).
Architected a Hybrid RAG pipeline combining Dense Vector Search (ChromaDB) and Sparse Keyword Search (BM25).
Token usage and context windows were bloated by passing entire receipt documents into the LLM.
Implemented semantic chunking by deconstructing receipts into 'Full Document' vectors for high-level summaries and 'Item Granularity' vectors for line-item accuracy.
Full-size LLMs and heavy framework dependencies were too expensive and slow for per-receipt processing at scale.
Engineered a cost-effective ingestion pipeline using a Quantized SLM (Phi-3.5 4-bit), reducing ingestion costs by 90%.