From Zero to RAG: Implementing Retrieval-Augmented Generation in a Laravel Application

# laravel# ai# php# webdev

Marcc Atayde

Imagine you've deployed a sleek AI chatbot for a client — it answers questions confidently, users...

Imagine you've deployed a sleek AI chatbot for a client — it answers questions confidently, users love it, and then someone asks about a company policy updated last Tuesday. The bot hallucinates an answer so wrong it nearly causes a compliance incident. This is the core problem that Retrieval-Augmented Generation (RAG) solves, and it's something every developer building LLM-powered features needs to understand deeply.

In this guide, we'll build a working RAG pipeline inside a Laravel application — from chunking documents and generating embeddings, to storing them in a vector database and wiring everything together with a streaming chat interface.

What RAG Actually Does (And Why It Matters)

Large Language Models are trained on static datasets. They don't know what happened yesterday, they don't know your client's internal documentation, and they confidently make things up when they hit the edge of their knowledge. RAG fixes this by injecting relevant, retrieved context into the prompt before the model generates a response.

The pipeline looks like this:

Ingest — Split your documents into chunks and convert them into vector embeddings
Store — Save those embeddings in a vector database
Retrieve — On each user query, find the most semantically similar chunks
Generate — Pass the retrieved chunks as context to the LLM and stream the answer back

The model stops guessing and starts reasoning over your data.

Setting Up the Laravel Project

We'll use Laravel with the OpenAI PHP client, and pgvector as our vector store (PostgreSQL extension — free, production-ready, no external service required).

composer require openai-php/laravel
php artisan vendor:publish --provider="OpenAI\Laravel\ServiceProvider"

Add your key to .env:

OPENAI_API_KEY=sk-...

Enable pgvector in your database:

CREATE EXTENSION IF NOT EXISTS vector;

Create the migration for document chunks:

// database/migrations/xxxx_create_document_chunks_table.php
public function up(): void
{
    Schema::create('document_chunks', function (Blueprint $table) {
        $table->id();
        $table->foreignId('document_id')->constrained()->cascadeOnDelete();
        $table->text('content');
        $table->string('source')->nullable();
        $table->vector('embedding', 1536); // text-embedding-3-small dimensions
        $table->timestamps();
    });

    DB::statement(
        'CREATE INDEX ON document_chunks USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100)'
    );
}

Note: The vector column type requires the pgvector Laravel package:

composer require tpetry/laravel-postgresql-enhanced

Step 1 — Document Ingestion and Embedding

Here's a reusable service that takes raw text, splits it into overlapping chunks, and stores embeddings:

// app/Services/DocumentIngestionService.php
namespace App\Services;

use App\Models\DocumentChunk;
use OpenAI\Laravel\Facades\OpenAI;

class DocumentIngestionService
{
    private int $chunkSize = 500;   // characters
    private int $overlap   = 100;

    public function ingest(int $documentId, string $text, string $source = ''): void
    {
        $chunks = $this->splitIntoChunks($text);

        // Batch embed — OpenAI allows up to 2048 inputs per request
        $response = OpenAI::embeddings()->create([
            'model' => 'text-embedding-3-small',
            'input' => $chunks,
        ]);

        foreach ($response->embeddings as $index => $embedding) {
            DocumentChunk::create([
                'document_id' => $documentId,
                'content'     => $chunks[$index],
                'source'      => $source,
                'embedding'   => json_encode($embedding->embedding),
            ]);
        }
    }

    private function splitIntoChunks(string $text): array
    {
        $chunks = [];
        $length = strlen($text);
        $start  = 0;

        while ($start < $length) {
            $chunk    = substr($text, $start, $this->chunkSize);
            $chunks[] = trim($chunk);
            $start   += ($this->chunkSize - $this->overlap);
        }

        return array_filter($chunks);
    }
}

Why overlapping chunks? Splitting on hard boundaries breaks sentences mid-thought. A 100-character overlap ensures concepts that straddle chunk boundaries still get captured in at least one chunk.

Step 2 — Semantic Retrieval

When a user sends a query, we embed it and find the closest chunks using cosine similarity:

// app/Services/RetrievalService.php
namespace App\Services;

use App\Models\DocumentChunk;
use Illuminate\Support\Collection;
use OpenAI\Laravel\Facades\OpenAI;

class RetrievalService
{
    public function retrieve(string $query, int $topK = 5): Collection
    {
        $response = OpenAI::embeddings()->create([
            'model' => 'text-embedding-3-small',
            'input' => $query,
        ]);

        $queryVector = json_encode($response->embeddings[0]->embedding);

        // pgvector cosine distance operator: <=>
        return DocumentChunk::selectRaw(
                'id, content, source, 1 - (embedding <=> ?) AS similarity',
                [$queryVector]
            )
            ->orderByDesc('similarity')
            ->limit($topK)
            ->get();
    }
}

The <=> operator in pgvector computes cosine distance natively in Postgres — no Python microservice, no external vector DB subscription needed for most production workloads.

Step 3 — The RAG Chat Controller

Now we wire retrieval into the generation step:

// app/Http/Controllers/ChatController.php
public function ask(Request $request, RetrievalService $retrieval): StreamedResponse
{
    $query   = $request->validate(['message' => 'required|string|max:1000'])['message'];
    $chunks  = $retrieval->retrieve($query);

    $context = $chunks->pluck('content')->implode("\n\n---\n\n");

    $systemPrompt = <<<PROMPT
    You are a helpful assistant. Answer the user's question using ONLY the context below.
    If the answer is not in the context, say you don't have enough information.

    Context:
    {$context}
    PROMPT;

    return response()->stream(function () use ($systemPrompt, $query) {
        $stream = OpenAI::chat()->createStreamed([
            'model'    => 'gpt-4o-mini',
            'messages' => [
                ['role' => 'system',  'content' => $systemPrompt],
                ['role' => 'user',    'content' => $query],
            ],
        ]);

        foreach ($stream as $response) {
            $text = $response->choices[0]->delta->content ?? '';
            echo "data: " . json_encode(['text' => $text]) . "\n\n";
            ob_flush();
            flush();
        }

        echo "data: [DONE]\n\n";
    }, 200, [
        'Content-Type'  => 'text/event-stream',
        'Cache-Control' => 'no-cache',
        'X-Accel-Buffering' => 'no',
    ]);
}

The streaming response uses Server-Sent Events (SSE), which pairs perfectly with Alpine.js on the frontend for a real-time typing effect without WebSockets.

Step 4 — Connecting the Frontend with Alpine.js

<div x-data="chatBot()">
    <div x-html="response" class="prose"></div>
    <input x-model="message" @keydown.enter="send" placeholder="Ask anything..." />
</div>

<script>
function chatBot() {
    return {
        message: '',
        response: '',
        async send() {
            this.response = '';
            const es = new EventSource(`/chat?message=${encodeURIComponent(this.message)}`);
            es.onmessage = (e) => {
                if (e.data === '[DONE]') { es.close(); return; }
                this.response += JSON.parse(e.data).text;
            };
            this.message = '';
        }
    }
}
</script>

Tuning Tips for Production

Chunk strategy matters more than the model. Badly chunked documents produce irrelevant retrievals regardless of how powerful your LLM is. For structured content like FAQs, chunk by question-answer pair rather than by character count.

Add metadata filtering. If your system serves multiple clients or document categories, add a tenant_id or category column and filter before the vector search. This dramatically improves precision and prevents cross-contamination of context.

Rerank retrieved chunks. For high-stakes applications, pass the top 10 retrieved chunks through a cross-encoder reranker (Cohere Rerank or a local model) and only send the top 3 to the LLM. This cuts hallucinations further.

This exact architecture — pgvector, Laravel, and streaming SSE — is something we've deployed at HanzWeb.ae for client knowledge bases across industries from legal to hospitality, and pgvector consistently handles hundreds of thousands of vectors without needing to reach for a dedicated vector DB like Pinecone.

Conclusion

RAG isn't magic — it's an engineering pattern. The quality of your pipeline comes down to three things: how you split documents, how precisely you retrieve context, and how clearly you instruct the model to stay grounded in that context. Get those three right, and you've built an AI feature that actually earns user trust rather than destroying it.

Start with a small document set, instrument your retrieval similarity scores, and iterate on your chunking strategy before scaling up. The model is almost never the bottleneck — your data preparation is.