How To Built a Search Engine (And Finally Understood Google)

# searchengine# webdev# algorithms# programming

Igor Nosatov

Last Tuesday, I was debugging why our app's search feature kept showing cat videos when users...

Last Tuesday, I was debugging why our app's search feature kept showing cat videos when users searched for "database optimization." Classic Tuesday. Instead of diving into Elasticsearch docs (again), I did something weird: I grabbed index cards, Post-its, and built a physical search engine on my kitchen table.

What I learned in those 3 hours changed how I think about every search query I write.

The Library Metaphor Everyone Gets Wrong

We always compare search engines to librarians. Wrong metaphor.

Search engines are more like that friend who always knows which restaurant to recommend. They don't just remember every restaurant (indexing). They remember:

Which ones your other friends loved
Which ones are busy on Friday nights
Which ones have parking
Which one you mentioned last month
That you're vegetarian

This is ranking. And it's where the magic happens.

My Kitchen Experiment: Building PageRank with Cards

Here's what I did. You can do this too (seriously, it helps).

Materials:

20 index cards = web pages
Arrows drawn with marker = links between pages
Post-it notes = search queries
Coffee = mandatory

The Setup:

[Card A: "Python Tutorial"] ←─┐
         ↓                     │
[Card B: "Django Guide"]       │
         ↓                     │
[Card C: "Web Dev Blog"] ──────┘
         ↓
[Card D: "Random Page"]

When I searched for "Python learning," which card should win?

Naive me at 2 AM: "Card A says 'Python Tutorial' - that's literally the query!"

Slightly smarter me at 2:15 AM: "Wait... Card C has THREE cards linking to it. People trust it more."

I just rediscovered PageRank at my kitchen table.

The Three-Stage Pipeline (How Google Actually Works)

Stage 1: Crawling (The Stalker Phase)

# Oversimplified crawler
visited = set()
to_visit = ["https://example.com"]

while to_visit:
    url = to_visit.pop()
    if url in visited:
        continue

    page = fetch(url)
    visited.add(url)

    # Extract links
    for link in extract_links(page):
        to_visit.append(link)

    # Store in index
    index[url] = {
        'content': page.text,
        'links': extract_links(page),
        'timestamp': now()
    }

Real-world gotcha: Googlebot hits your site billions of times. My first production site crashed because I didn't implement rate limiting. The CEO was... not happy.

Stage 2: Indexing (The Librarian Phase)

This is where search engines build their "inverted index" - a fancy term for "every word points to every page where it appears."

# Building an inverted index
inverted_index = {}

for doc_id, document in documents.items():
    words = tokenize(document.content)

    for position, word in enumerate(words):
        if word not in inverted_index:
            inverted_index[word] = []

        inverted_index[word].append({
            'doc_id': doc_id,
            'position': position,
            'context': get_context(words, position)
        })

# Now searching is instant:
# "python" → [doc1, doc5, doc89, doc234...]

Mind-blowing fact: Google's index is estimated at 100+ petabytes. That's 100,000,000 GB. Your entire life's photos? Maybe 500 GB.

Stage 3: Ranking (The Magic Phase)

This is where I spent most of my kitchen time. And it's not just PageRank anymore.

The Modern Ranking Formula (200+ Signals)

Google doesn't use a single algorithm. It's a committee of algorithms voting. Here are the big players:

1. TF-IDF (Term Frequency × Inverse Document Frequency)

The "does this page actually talk about what you're searching for" algorithm.

import math
from collections import Counter

def calculate_tfidf(term, document, all_documents):
    # Term Frequency: how often does word appear?
    tf = document.count(term) / len(document)

    # Inverse Document Frequency: is this word rare/special?
    docs_with_term = sum(1 for doc in all_documents if term in doc)
    idf = math.log(len(all_documents) / (1 + docs_with_term))

    return tf * idf

# Example:
doc = "python tutorial python programming python guide"
all_docs = [doc, "javascript tutorial", "java guide"]

score = calculate_tfidf("python", doc.split(), all_docs)
print(f"Score: {score}")  # High because "python" appears often but not everywhere

Translation: Common words like "the" score low. Specific words like "Django" score high. The word "python" in a Python tutorial? Very high.

2. BM25 (The Better TF-IDF)

TF-IDF has a problem: a page that says "python" 100 times doesn't deserve to rank 100x higher than one that says it 10 times.

def bm25_score(term, document, all_documents, k1=1.5, b=0.75):
    # Term frequency with diminishing returns
    tf = document.count(term)
    doc_length = len(document)
    avg_doc_length = sum(len(d) for d in all_documents) / len(all_documents)

    # IDF component
    docs_with_term = sum(1 for doc in all_documents if term in doc)
    idf = math.log((len(all_documents) - docs_with_term + 0.5) / (docs_with_term + 0.5))

    # The magic formula
    numerator = tf * (k1 + 1)
    denominator = tf + k1 * (1 - b + b * (doc_length / avg_doc_length))

    return idf * (numerator / denominator)

This is what Elasticsearch uses by default. It's beautiful because:

First mention of "python" = big boost
10th mention = tiny boost
100th mention = almost nothing

3. PageRank (The Popularity Contest)

def calculate_pagerank(graph, damping=0.85, iterations=100):
    """
    Graph = {page_id: [list_of_pages_it_links_to]}
    """
    num_pages = len(graph)
    pagerank = {page: 1.0 / num_pages for page in graph}

    for _ in range(iterations):
        new_ranks = {}

        for page in graph:
            # Base probability: random surfer teleports
            rank = (1 - damping) / num_pages

            # Add rank from incoming links
            for other_page, links in graph.items():
                if page in links:
                    rank += damping * (pagerank[other_page] / len(links))

            new_ranks[page] = rank

        pagerank = new_ranks

    return pagerank

# Example graph
web_graph = {
    'A': ['B', 'C'],
    'B': ['C'],
    'C': ['A'],
    'D': ['C']
}

ranks = calculate_pagerank(web_graph)
for page, score in sorted(ranks.items(), key=lambda x: x[1], reverse=True):
    print(f"{page}: {score:.4f}")

Output:

C: 0.3874  ← Most linked-to page wins
A: 0.2703
B: 0.2141
D: 0.1282

4. The Other 197 Signals

Google considers:

Freshness: News articles decay fast; documentation stays relevant
Click-through rate: If everyone skips your result, you drop
Dwell time: Users staying on page = good signal
Mobile-friendliness: 60% of searches are mobile
HTTPS: Security matters
Page speed: Slow sites rank lower
Domain authority: Stack Overflow > random-blog-2009.com
User location: "pizza" shows nearby results
Search history: Your past clicks influence future results
Entity recognition: Understanding "Python" (language) vs "python" (snake)

The Real-World Implementation

Here's a simplified but functional search engine:

from dataclasses import dataclass
from typing import List, Dict
import math

@dataclass
class SearchResult:
    url: str
    title: str
    snippet: str
    score: float

class SimpleSearchEngine:
    def __init__(self):
        self.index = {}
        self.documents = {}
        self.pagerank = {}

    def add_document(self, doc_id: str, title: str, content: str, links: List[str]):
        """Index a document"""
        self.documents[doc_id] = {
            'title': title,
            'content': content,
            'links': links
        }

        # Build inverted index
        words = self._tokenize(content + " " + title)
        for word in words:
            if word not in self.index:
                self.index[word] = []
            self.index[word].append(doc_id)

    def calculate_pagerank(self):
        """Calculate PageRank for all documents"""
        graph = {doc_id: doc['links'] for doc_id, doc in self.documents.items()}
        # Use the pagerank function from above
        self.pagerank = calculate_pagerank(graph)

    def search(self, query: str, top_k: int = 10) -> List[SearchResult]:
        """Search and rank results"""
        words = self._tokenize(query)

        # Find candidate documents
        candidates = set()
        for word in words:
            if word in self.index:
                candidates.update(self.index[word])

        # Score each candidate
        scored_results = []
        for doc_id in candidates:
            doc = self.documents[doc_id]

            # Combine multiple signals
            text_score = self._calculate_bm25(words, doc)
            popularity_score = self.pagerank.get(doc_id, 0)

            # Weighted combination
            final_score = (
                0.7 * text_score +          # Content relevance
                0.3 * popularity_score       # Popularity
            )

            scored_results.append(SearchResult(
                url=doc_id,
                title=doc['title'],
                snippet=self._create_snippet(doc['content'], words),
                score=final_score
            ))

        # Sort by score and return top K
        scored_results.sort(key=lambda x: x.score, reverse=True)
        return scored_results[:top_k]

    def _tokenize(self, text: str) -> List[str]:
        """Simple tokenization"""
        return text.lower().split()

    def _calculate_bm25(self, query_terms: List[str], document: Dict) -> float:
        """Simplified BM25 scoring"""
        content = document['content'] + " " + document['title']
        score = 0

        for term in query_terms:
            tf = content.lower().count(term)
            if tf > 0:
                # Simplified BM25
                score += math.log(1 + tf)

        return score

    def _create_snippet(self, content: str, query_terms: List[str]) -> str:
        """Create search result snippet"""
        words = content.split()
        for i, word in enumerate(words):
            if word.lower() in query_terms:
                start = max(0, i - 10)
                end = min(len(words), i + 10)
                return "..." + " ".join(words[start:end]) + "..."
        return content[:100] + "..."

# Usage
engine = SimpleSearchEngine()

# Add some documents
engine.add_document(
    "python-tutorial",
    "Python Tutorial for Beginners",
    "Learn Python programming from scratch. Python is easy to learn.",
    links=["django-guide"]
)

engine.add_document(
    "django-guide", 
    "Django Web Framework Guide",
    "Django is a Python web framework. Build web apps with Django.",
    links=["python-tutorial"]
)

engine.add_document(
    "javascript-intro",
    "JavaScript Introduction", 
    "JavaScript is a programming language for web browsers.",
    links=[]
)

# Calculate PageRank
engine.calculate_pagerank()

# Search
results = engine.search("python programming")
for i, result in enumerate(results, 1):
    print(f"{i}. {result.title}")
    print(f"   URL: {result.url}")
    print(f"   Score: {result.score:.4f}")
    print(f"   {result.snippet}")
    print()

Output:

1. Python Tutorial for Beginners
   URL: python-tutorial
   Score: 1.8945
   ...Learn Python programming from scratch. Python is easy to learn...

2. Django Web Framework Guide
   URL: django-guide
   Score: 0.9123
   ...Django is a Python web framework. Build web apps...

The Lessons I Learned (At 4 AM)

1. Relevance ≠ Popularity

Just because everyone links to a page doesn't mean it answers your question. Good search engines balance both:

Final Score = α × Content_Match + β × Popularity + γ × Freshness + ...

Tweak those Greek letters (α, β, γ) and you change everything.

2. Context Is King

The word "bank" means different things in:

"bank account"
"river bank"
"memory bank"

Modern search engines use semantic search (embeddings, BERT) to understand context. That's why searching "best restaurants nearby" works even though no page says "best restaurants nearby [your exact location]."

3. Ranking Is Never "Done"

Google updates its algorithm 500-600 times per year. Why?

New spam techniques emerge
User behavior changes
Better ML models appear
Competitors innovate

4. The Cold Start Problem

New pages have no PageRank. How do they rank?

Google uses:

Domain authority (inheritance)
Content freshness bonus
Social signals
Manual editorial review (for important queries)

Practical Takeaways for Developers

If You're Building Search:

Start with Elasticsearch or Meilisearch

   # Don't reinvent the wheel
   docker run -p 9200:9200 elasticsearch:8.11.0

Use BM25 as your baseline
It's the industry standard for good reason.
Add domain-specific signals
E-commerce? Factor in price, reviews, stock.
News site? Prioritize freshness.
A/B test everything
Users click your #1 result 28% of the time. If that drops to 20%, investigate.

If You're Optimizing for Search:

Write for humans, optimize for robots
- Clear headers (H1, H2)
- Descriptive titles
- Natural keyword usage
- Internal linking
Technical SEO matters

   <!-- Good -->
   <title>Python Tutorial for Beginners | Learn Programming</title>
   <meta name="description" content="Step-by-step Python tutorial...">

   <!-- Bad -->
   <title>Page 1</title>

Get linked from authority sites One link from Stack Overflow > 100 links from spam sites

The Visualization That Changed My Mind

Here's how I visualize ranking now:

Query: "python tutorial"

┌─────────────────────────────────────┐
│  Candidate Documents (1000s)        │
└─────────────────┬───────────────────┘
                  │
        ┌─────────┴─────────┐
        │                   │
   ┌────▼─────┐      ┌─────▼────┐
   │ Content  │      │ Signals  │
   │ Matching │      │ (200+)   │
   └────┬─────┘      └─────┬────┘
        │                   │
        │  ┌───────────────┐│
        └──▶ ML Ranker     ◀┘
           │ (RankBrain)   │
           └───────┬───────┘
                   │
           ┌───────▼───────┐
           │ Top 10 Results│
           └───────────────┘

Try This Yourself

Weekend project: Build a tiny search engine for your blog or docs site.

Starter code (10-minute version):

# pip install whoosh
from whoosh.index import create_in
from whoosh.fields import Schema, TEXT
from whoosh.qparser import QueryParser
import os

# 1. Define schema
schema = Schema(title=TEXT(stored=True), content=TEXT)

# 2. Create index
if not os.path.exists("indexdir"):
    os.mkdir("indexdir")
ix = create_in("indexdir", schema)

# 3. Add documents
writer = ix.writer()
writer.add_document(
    title="Python Tutorial",
    content="Learn Python programming basics and advanced concepts"
)
writer.add_document(
    title="JavaScript Guide", 
    content="Modern JavaScript development techniques"
)
writer.commit()

# 4. Search
with ix.searcher() as searcher:
    query = QueryParser("content", ix.schema).parse("python")
    results = searcher.search(query)
    for result in results:
        print(result['title'])

Resources to Go Deeper

How Search Works (Google)
Elasticsearch BM25 Documentation
PageRank Original Paper (1998)
Information Retrieval (book) - The Bible of search