I Built an AI Document Intelligence API to Demo to My Internship Students - Here's the Full Story

Navas Herbert

Monday morning. My internship students walk in - Amina, Brian, Njeri, Otieno, Wanjiku and the rest...

Monday morning. My internship students walk in - Amina, Brian, Njeri, Otieno, Wanjiku and the rest of the cohort. I pull up a browser tab, upload a PDF, type a question in plain English, and the system finds the relevant sections and answers it in two seconds.

Then I tell them: "You're building your own version of this. Five days. Skeleton repo is ready. Go."

That's the plan. But before Monday I had to actually build the thing - and it didn't go smoothly. This is the honest story of DataPulse AI: what it does, every tool I tried, every error that broke me, and what finally got it deployed and working end to end.

If you're a junior developer in Nairobi wondering whether you could build something like this - read to the end. The answer is yes. But it'll take some detours.

What Is RAG? (The Open-Book Exam Analogy)

Before the stack, a concept.

RAG stands for Retrieval Augmented Generation. It sounds intimidating. It isn't.

Think of a closed-book exam versus an open-book exam.

A closed-book LLM (like ChatGPT with no plugins) answers from memory - whatever it learned during training. It's fast, but it has no idea what's in your document. Ask it about the contents of your specific company report, your custom dataset, your uploaded PDF - and it'll hallucinate or say it doesn't know.

RAG gives the model an open book. Before answering, it searches your documents, retrieves the most relevant sections, and hands those to the LLM as context. The model doesn't guess - it reads the relevant pages and answers from them.

That's DataPulse AI. Upload any document. Ask a question. The system finds what matters and generates a grounded answer from the actual content.

No hallucinations from thin air. Answers tied to your source material.

The Stack - And Why I Chose Each Piece

Here's what DataPulse AI is built with, and more importantly, why:

FastAPI - the API framework. Fast, async, automatic docs at /docs, Python-native. For a RAG system that needs to handle file uploads, vector searches, and LLM calls, async matters. FastAPI was the obvious call.

PostgreSQL on NeonDB - stores document metadata. Which files have been uploaded, when, what their status is. NeonDB gives you a serverless Postgres instance with a generous free tier. Perfect for a demo project that needs a real database without paying for one.

LanceDB - vector database for semantic search. This is where document chunks live after being embedded. More on why I chose LanceDB specifically in a moment - there's a story there.

fastembed - converts text chunks into vector embeddings. Small, fast, pure Python. Also a story here.

Groq API with llama3 - the LLM that generates the final answer. Groq runs llama3 on custom inference hardware and it is fast. We're talking two seconds for a full response.

Render - backend deployment. Free tier. Enough said.

Vercel - frontend deployment. Clean dark-themed HTML/JS interface. Also free.

The architecture in one sentence: user uploads a document → it gets chunked and embedded into LanceDB → user asks a question → question gets embedded → LanceDB finds the nearest chunks → Groq generates an answer from those chunks → user gets a grounded response.

Simple concept. Getting there was not simple.

The Errors - Every Wall I Hit

This is the part I want every junior developer to read carefully. The finished product looks clean. The journey was not.

Wall 1: Ollama Was Too Heavy to Deploy

I started with Ollama for local LLM inference. It worked beautifully on my machine. ollama run llama3, ask it questions, get answers. Clean.

Then I tried to deploy it.

Ollama's model files are gigabytes. Render's free tier doesn't have the disk space or RAM for that. Not even close. The deployment would time out before the model even loaded.

Local development: great. Cloud deployment on a free tier: not happening.

Switched to: Groq API. Same llama3 model, running on Groq's cloud infrastructure. Free tier available. Response time dropped from 14 seconds (local Ollama on my laptop) to under 2 seconds. That's not a small improvement - that's a completely different user experience.

Wall 2: ChromaDB Killed Windows Students

My first vector database choice was ChromaDB. Popular, well-documented, easy Python API.

The problem: ChromaDB has C++ dependencies under the hood. On Mac and Linux, the pip install works fine. On Windows - which is what most of my students are running - the build errors were brutal.

error: Microsoft Visual C++ 14.0 or greater is required

I was not going to start a student internship with "first, install Visual Studio build tools and configure your PATH." That kills momentum before day one.

Switched to: LanceDB. Pure Python. No C++ anywhere in the dependency tree. pip install lancedb and it works - on Windows, Mac, Linux, no exceptions. For a student project where everyone has different machines, that matters enormously.

Wall 3: sentence-transformers Ate All the RAM

For generating embeddings I started with sentence-transformers - the standard library, widely used, excellent quality.

The RAM usage on startup: 1.5GB.

Render's free tier gives you 512MB.

The deployment would crash every single time the embedding model tried to load. I tried lazy loading, tried quantising, tried everything. The math just didn't work on 512MB.

Switched to: fastembed. Same idea - generates sentence embeddings - but built for efficiency. RAM usage on startup: around 50MB. The quality is slightly lower than sentence-transformers for some tasks, but for document Q&A on a free-tier server? Completely fine. It runs. That matters more.

Wall 4: The First Deployment That Actually Stuck

After switching all three components - Groq for LLM, LanceDB for vectors, fastembed for embeddings - I pushed to Render.

It deployed. The health check passed. I uploaded a test document. I asked a question.

The API returned an answer in 1.8 seconds.

I sat there for a moment and just looked at it.

The Moment It Worked

The test document was a short PDF about Nairobi's public transport system - matatus, routes, fares, the usual chaos.

I asked: "What are the main challenges with matatu operations in Nairobi?"

The system:

Embedded the question into a vector
Searched LanceDB for the three most semantically similar chunks from the uploaded document
Sent those chunks as context to Groq's llama3
Got back a structured, grounded answer in 1.8 seconds

The answer cited specific sections of the document. It didn't make anything up. It said "based on the document" and then actually referred to the document.

That's when RAG stopped being a concept and became something real.

What the API Actually Does

A quick look at the core endpoints:

Upload a document:

POST /documents/upload
Content-Type: multipart/form-data
file: [your .txt or .pdf file]

The system chunks the document into overlapping segments, embeds each chunk with fastembed, and stores the vectors in LanceDB alongside the chunk text. PostgreSQL records the document metadata.

Ask a question:

POST /documents/{doc_id}/query
{
  "question": "What does this document say about deployment costs?"
}

Response:

{
  "answer": "According to the document, deployment costs depend primarily on...",
  "sources": [
    { "chunk": "...relevant section text...", "score": 0.91 },
    { "chunk": "...second relevant section...", "score": 0.87 }
  ],
  "response_time_ms": 1823
}

The sources array is important. Students can see which chunks the model used to generate the answer. That's the grounding — not just an answer, but evidence.

List documents:

GET /documents/

Returns all uploaded documents with metadata. The frontend uses this to populate the document selector.

What Students Will Build: The 5-Day Internship Project

Monday morning I demo the finished product. Then I hand them the skeleton repo.

Here's the five-day breakdown:

Day 1 - Setup & Document Ingestion
Get the project running locally. Implement the document upload endpoint. Connect to NeonDB. Chunk text documents into overlapping segments. No vectors yet - just get files in and stored.

Day 2 - Embeddings & Vector Storage
Integrate fastembed. Embed each chunk on upload. Store vectors in LanceDB. By end of day, uploaded documents should be fully embedded and searchable.

Day 3 - Query Pipeline
Build the question-answering endpoint. Embed the incoming question. Retrieve the top-k most similar chunks from LanceDB. Return the raw chunks with similarity scores. No LLM yet - just search.

Day 4 - LLM Integration & Answer Generation
Connect Groq. Pass the retrieved chunks as context. Prompt the model to answer the question grounded in the provided sections only. Full pipeline: question in, grounded answer out.

Day 5 - Frontend, Testing & Demo Prep
Build a simple HTML/JS frontend that talks to the API. Upload a document, ask questions, display answers and sources. Deploy on Vercel. Prep a 5-minute demo for the cohort.

Five days. One working RAG system. Built from scratch.

Lessons for Builders

Four honest takeaways from this build:

1. Deploy early, deploy often - not at the end.
I built almost the entire system locally before trying to deploy. That's why the RAM and C++ issues blindsided me. If I'd tried to deploy after Day 1, I'd have hit those walls immediately and switched tools before writing any application code. Now I teach this: get a hello world deployed first. Then build.

2. Free-tier constraints are features, not bugs.
Render's 512MB RAM limit forced me to find fastembed, which is genuinely a better choice for this use case than sentence-transformers. The constraint made the system leaner and faster. Sometimes the limits push you toward better solutions.

3. Pure Python dependencies win on student projects.
ChromaDB was technically fine. But one C++ build error on one student's Windows machine would have cost half a day of troubleshooting before anyone wrote a line of application code. LanceDB being pure Python is not just a technical choice - it's a teaching choice. Reduce friction at setup, maximise time on actual learning.

4. Groq is genuinely fast and the free tier is real.
I expected the free Groq tier to be throttled or limited enough to feel broken. It isn't. Two-second responses on llama3, reasonable rate limits, clean API. For a demo project or a student internship, it's exactly right. Use it.

Try It / See the Code

The full system is live:

API docs: https://datapulse-ai-zsv7.onrender.com/docs
GitHub: https://github.com/Navashub/datapulse-ai

Upload a .txt or .pdf file and ask it a question. The API docs are interactive - you can test every endpoint directly from the browser without writing any code.

If you want to build your own version: start with the API docs, understand the three endpoints, then implement them one at a time in the order above. The skeleton repo will be public after Monday's demo.

What Monday Looks Like

8:30am. Students file in. I pull up the live UI.

I upload a PDF - something relevant to them, probably a Kenya tech industry report or a data engineering reference doc. I ask a question. The answer comes back in under two seconds, with sources highlighted.

Then I say: "Your skeleton repo is in the chat. You have five days."

And we'll see what they build.

I'll write about how that goes next week.

I'm a data trainer in Nairobi running a full data programme -
Python foundations → Data Science or Data Engineering specialisations.
I write weekly about what we covered, what worked, and what surprised me.
Follow along or drop your questions in the comments.