Trace

Completed

RAG Powered FinTech Platform

Hybrid RAGFastAPIChromaDBPaddleOCR

Role:AI & Software Engineer

Duration:4 months

Overview

Trace is a next-generation financial assistant and production-grade Retrieval-Augmented Generation (RAG) pipeline. It solves the unstructured data problem in personal finance by converting raw receipt images into type-safe JSON. The platform enables natural language queries over financial data using a Hybrid Engine that fuses dense vector retrieval with sparse keyword matching, seamlessly streamed to a modern React 19 frontend.

Architecture

The ingestion pipeline extracts text via PaddleOCR, formats it into type-safe JSON using Instructor, and applies item-level chunking. For retrieval, async FastAPI endpoints query both ChromaDB and BM25 indices, merging the results before an ONNX-optimized Cross-Encoder reranks candidates for the final streamed response.

Key Features

Hybrid Retrieval & Reranking

Achieved 91%+ Context Precision (benchmarked via RAGAS) by implementing a dense (semantic) and sparse (BM25) RAG pipeline with a Cross-Encoder Reranking layer and Parent Document Retrieval Item-level chunking to optimize context window relevance.

OCR & Structuring

Utilizes PaddleOCR combined with Instructor (Pydantic) to extract and validate type-safe JSON from receipts, featuring real-time OCR coordinate visualization on the frontend.

Quantized SLM Inference

Runs quantized Phi-3.5 (4-bit) and Cross-Encoder models via ONNX Runtime to eliminate heavy PyTorch dependencies, ensuring fast inference.

Containerized & Privacy-First

Deployed a fully containerized stack using Docker Compose with persistent volumes for vector stores and model weights, ensuring 100% data privacy and offline capability[cite: 47].

Tech Stack

Backend

FastAPI

High-performance async API utilizing Server-Sent Events (SSE) for streaming

Frontend

React 19 & TypeScript

Modern UI with TanStack Query for optimistic updates and caching

Database

ChromaDB & BM25

Vector database and sparse retrieval engine for hybrid search

AI/ML

ONNX Runtime

Accelerated inference for embedding and reranking models

PaddleOCR

Lightweight, state-of-the-art text detection and extraction

Phi-3.5 (Quantized)

Efficient LLM for structuring OCR text into type-safe data

DevOps

Docker

Container orchestration ensuring offline capability and data privacy

Challenges & Solutions

Challenge

Semantic search alone failed on exact keyword matches (e.g., retrieving specific receipt totals or merchant names).

Solution

Architected a Hybrid RAG pipeline combining Dense Vector Search (ChromaDB) and Sparse Keyword Search (BM25).

Challenge

Token usage and context windows were bloated by passing entire receipt documents into the LLM.

Solution

Implemented semantic chunking by deconstructing receipts into 'Full Document' vectors for high-level summaries and 'Item Granularity' vectors for line-item accuracy.

Challenge

Full-size LLMs and heavy framework dependencies were too expensive and slow for per-receipt processing at scale.

Solution

Engineered a cost-effective ingestion pipeline using a Quantized SLM (Phi-3.5 4-bit), reducing ingestion costs by 90%.

Results

91%+ Context Precision (benchmarked via RAGAS)

90% reduction in data ingestion costs by utilizing local quantized models

100% data privacy and offline capability through Docker deployment

Eliminated PyTorch dependencies in production via ONNX Runtime