How To Built a Search Engine (And Finally Understood Google)

How To Built a Search Engine (And Finally Understood Google)

# searchengine# webdev# algorithms# programming
How To Built a Search Engine (And Finally Understood Google)Igor Nosatov

Last Tuesday, I was debugging why our app's search feature kept showing cat videos when users...

Last Tuesday, I was debugging why our app's search feature kept showing cat videos when users searched for "database optimization." Classic Tuesday. Instead of diving into Elasticsearch docs (again), I did something weird: I grabbed index cards, Post-its, and built a physical search engine on my kitchen table.

What I learned in those 3 hours changed how I think about every search query I write.

The Library Metaphor Everyone Gets Wrong

We always compare search engines to librarians. Wrong metaphor.

Search engines are more like that friend who always knows which restaurant to recommend. They don't just remember every restaurant (indexing). They remember:

  • Which ones your other friends loved
  • Which ones are busy on Friday nights
  • Which ones have parking
  • Which one you mentioned last month
  • That you're vegetarian

This is ranking. And it's where the magic happens.

My Kitchen Experiment: Building PageRank with Cards

Here's what I did. You can do this too (seriously, it helps).

Materials:

  • 20 index cards = web pages
  • Arrows drawn with marker = links between pages
  • Post-it notes = search queries
  • Coffee = mandatory

The Setup:

[Card A: "Python Tutorial"] ←─┐
         ↓                     │
[Card B: "Django Guide"]       │
         ↓                     │
[Card C: "Web Dev Blog"] ──────┘
         ↓
[Card D: "Random Page"]
Enter fullscreen mode Exit fullscreen mode

When I searched for "Python learning," which card should win?

Naive me at 2 AM: "Card A says 'Python Tutorial' - that's literally the query!"

Slightly smarter me at 2:15 AM: "Wait... Card C has THREE cards linking to it. People trust it more."

I just rediscovered PageRank at my kitchen table.

The Three-Stage Pipeline (How Google Actually Works)

Stage 1: Crawling (The Stalker Phase)

# Oversimplified crawler
visited = set()
to_visit = ["https://example.com"]

while to_visit:
    url = to_visit.pop()
    if url in visited:
        continue

    page = fetch(url)
    visited.add(url)

    # Extract links
    for link in extract_links(page):
        to_visit.append(link)

    # Store in index
    index[url] = {
        'content': page.text,
        'links': extract_links(page),
        'timestamp': now()
    }
Enter fullscreen mode Exit fullscreen mode

Real-world gotcha: Googlebot hits your site billions of times. My first production site crashed because I didn't implement rate limiting. The CEO was... not happy.

Stage 2: Indexing (The Librarian Phase)

This is where search engines build their "inverted index" - a fancy term for "every word points to every page where it appears."

# Building an inverted index
inverted_index = {}

for doc_id, document in documents.items():
    words = tokenize(document.content)

    for position, word in enumerate(words):
        if word not in inverted_index:
            inverted_index[word] = []

        inverted_index[word].append({
            'doc_id': doc_id,
            'position': position,
            'context': get_context(words, position)
        })

# Now searching is instant:
# "python" → [doc1, doc5, doc89, doc234...]
Enter fullscreen mode Exit fullscreen mode

Mind-blowing fact: Google's index is estimated at 100+ petabytes. That's 100,000,000 GB. Your entire life's photos? Maybe 500 GB.

Stage 3: Ranking (The Magic Phase)

This is where I spent most of my kitchen time. And it's not just PageRank anymore.

The Modern Ranking Formula (200+ Signals)

Google doesn't use a single algorithm. It's a committee of algorithms voting. Here are the big players:

1. TF-IDF (Term Frequency × Inverse Document Frequency)

The "does this page actually talk about what you're searching for" algorithm.

import math
from collections import Counter

def calculate_tfidf(term, document, all_documents):
    # Term Frequency: how often does word appear?
    tf = document.count(term) / len(document)

    # Inverse Document Frequency: is this word rare/special?
    docs_with_term = sum(1 for doc in all_documents if term in doc)
    idf = math.log(len(all_documents) / (1 + docs_with_term))

    return tf * idf

# Example:
doc = "python tutorial python programming python guide"
all_docs = [doc, "javascript tutorial", "java guide"]

score = calculate_tfidf("python", doc.split(), all_docs)
print(f"Score: {score}")  # High because "python" appears often but not everywhere
Enter fullscreen mode Exit fullscreen mode

Translation: Common words like "the" score low. Specific words like "Django" score high. The word "python" in a Python tutorial? Very high.

2. BM25 (The Better TF-IDF)

TF-IDF has a problem: a page that says "python" 100 times doesn't deserve to rank 100x higher than one that says it 10 times.

def bm25_score(term, document, all_documents, k1=1.5, b=0.75):
    # Term frequency with diminishing returns
    tf = document.count(term)
    doc_length = len(document)
    avg_doc_length = sum(len(d) for d in all_documents) / len(all_documents)

    # IDF component
    docs_with_term = sum(1 for doc in all_documents if term in doc)
    idf = math.log((len(all_documents) - docs_with_term + 0.5) / (docs_with_term + 0.5))

    # The magic formula
    numerator = tf * (k1 + 1)
    denominator = tf + k1 * (1 - b + b * (doc_length / avg_doc_length))

    return idf * (numerator / denominator)
Enter fullscreen mode Exit fullscreen mode

This is what Elasticsearch uses by default. It's beautiful because:

  • First mention of "python" = big boost
  • 10th mention = tiny boost
  • 100th mention = almost nothing

3. PageRank (The Popularity Contest)

def calculate_pagerank(graph, damping=0.85, iterations=100):
    """
    Graph = {page_id: [list_of_pages_it_links_to]}
    """
    num_pages = len(graph)
    pagerank = {page: 1.0 / num_pages for page in graph}

    for _ in range(iterations):
        new_ranks = {}

        for page in graph:
            # Base probability: random surfer teleports
            rank = (1 - damping) / num_pages

            # Add rank from incoming links
            for other_page, links in graph.items():
                if page in links:
                    rank += damping * (pagerank[other_page] / len(links))

            new_ranks[page] = rank

        pagerank = new_ranks

    return pagerank

# Example graph
web_graph = {
    'A': ['B', 'C'],
    'B': ['C'],
    'C': ['A'],
    'D': ['C']
}

ranks = calculate_pagerank(web_graph)
for page, score in sorted(ranks.items(), key=lambda x: x[1], reverse=True):
    print(f"{page}: {score:.4f}")
Enter fullscreen mode Exit fullscreen mode

Output:

C: 0.3874  ← Most linked-to page wins
A: 0.2703
B: 0.2141
D: 0.1282
Enter fullscreen mode Exit fullscreen mode

4. The Other 197 Signals

Google considers:

  • Freshness: News articles decay fast; documentation stays relevant
  • Click-through rate: If everyone skips your result, you drop
  • Dwell time: Users staying on page = good signal
  • Mobile-friendliness: 60% of searches are mobile
  • HTTPS: Security matters
  • Page speed: Slow sites rank lower
  • Domain authority: Stack Overflow > random-blog-2009.com
  • User location: "pizza" shows nearby results
  • Search history: Your past clicks influence future results
  • Entity recognition: Understanding "Python" (language) vs "python" (snake)

The Real-World Implementation

Here's a simplified but functional search engine:

from dataclasses import dataclass
from typing import List, Dict
import math

@dataclass
class SearchResult:
    url: str
    title: str
    snippet: str
    score: float

class SimpleSearchEngine:
    def __init__(self):
        self.index = {}
        self.documents = {}
        self.pagerank = {}

    def add_document(self, doc_id: str, title: str, content: str, links: List[str]):
        """Index a document"""
        self.documents[doc_id] = {
            'title': title,
            'content': content,
            'links': links
        }

        # Build inverted index
        words = self._tokenize(content + " " + title)
        for word in words:
            if word not in self.index:
                self.index[word] = []
            self.index[word].append(doc_id)

    def calculate_pagerank(self):
        """Calculate PageRank for all documents"""
        graph = {doc_id: doc['links'] for doc_id, doc in self.documents.items()}
        # Use the pagerank function from above
        self.pagerank = calculate_pagerank(graph)

    def search(self, query: str, top_k: int = 10) -> List[SearchResult]:
        """Search and rank results"""
        words = self._tokenize(query)

        # Find candidate documents
        candidates = set()
        for word in words:
            if word in self.index:
                candidates.update(self.index[word])

        # Score each candidate
        scored_results = []
        for doc_id in candidates:
            doc = self.documents[doc_id]

            # Combine multiple signals
            text_score = self._calculate_bm25(words, doc)
            popularity_score = self.pagerank.get(doc_id, 0)

            # Weighted combination
            final_score = (
                0.7 * text_score +          # Content relevance
                0.3 * popularity_score       # Popularity
            )

            scored_results.append(SearchResult(
                url=doc_id,
                title=doc['title'],
                snippet=self._create_snippet(doc['content'], words),
                score=final_score
            ))

        # Sort by score and return top K
        scored_results.sort(key=lambda x: x.score, reverse=True)
        return scored_results[:top_k]

    def _tokenize(self, text: str) -> List[str]:
        """Simple tokenization"""
        return text.lower().split()

    def _calculate_bm25(self, query_terms: List[str], document: Dict) -> float:
        """Simplified BM25 scoring"""
        content = document['content'] + " " + document['title']
        score = 0

        for term in query_terms:
            tf = content.lower().count(term)
            if tf > 0:
                # Simplified BM25
                score += math.log(1 + tf)

        return score

    def _create_snippet(self, content: str, query_terms: List[str]) -> str:
        """Create search result snippet"""
        words = content.split()
        for i, word in enumerate(words):
            if word.lower() in query_terms:
                start = max(0, i - 10)
                end = min(len(words), i + 10)
                return "..." + " ".join(words[start:end]) + "..."
        return content[:100] + "..."

# Usage
engine = SimpleSearchEngine()

# Add some documents
engine.add_document(
    "python-tutorial",
    "Python Tutorial for Beginners",
    "Learn Python programming from scratch. Python is easy to learn.",
    links=["django-guide"]
)

engine.add_document(
    "django-guide", 
    "Django Web Framework Guide",
    "Django is a Python web framework. Build web apps with Django.",
    links=["python-tutorial"]
)

engine.add_document(
    "javascript-intro",
    "JavaScript Introduction", 
    "JavaScript is a programming language for web browsers.",
    links=[]
)

# Calculate PageRank
engine.calculate_pagerank()

# Search
results = engine.search("python programming")
for i, result in enumerate(results, 1):
    print(f"{i}. {result.title}")
    print(f"   URL: {result.url}")
    print(f"   Score: {result.score:.4f}")
    print(f"   {result.snippet}")
    print()
Enter fullscreen mode Exit fullscreen mode

Output:

1. Python Tutorial for Beginners
   URL: python-tutorial
   Score: 1.8945
   ...Learn Python programming from scratch. Python is easy to learn...

2. Django Web Framework Guide
   URL: django-guide
   Score: 0.9123
   ...Django is a Python web framework. Build web apps...
Enter fullscreen mode Exit fullscreen mode

The Lessons I Learned (At 4 AM)

1. Relevance ≠ Popularity

Just because everyone links to a page doesn't mean it answers your question. Good search engines balance both:

Final Score = α × Content_Match + β × Popularity + γ × Freshness + ...
Enter fullscreen mode Exit fullscreen mode

Tweak those Greek letters (α, β, γ) and you change everything.

2. Context Is King

The word "bank" means different things in:

  • "bank account"
  • "river bank"
  • "memory bank"

Modern search engines use semantic search (embeddings, BERT) to understand context. That's why searching "best restaurants nearby" works even though no page says "best restaurants nearby [your exact location]."

3. Ranking Is Never "Done"

Google updates its algorithm 500-600 times per year. Why?

  • New spam techniques emerge
  • User behavior changes
  • Better ML models appear
  • Competitors innovate

4. The Cold Start Problem

New pages have no PageRank. How do they rank?

Google uses:

  • Domain authority (inheritance)
  • Content freshness bonus
  • Social signals
  • Manual editorial review (for important queries)

Practical Takeaways for Developers

If You're Building Search:

  1. Start with Elasticsearch or Meilisearch
   # Don't reinvent the wheel
   docker run -p 9200:9200 elasticsearch:8.11.0
Enter fullscreen mode Exit fullscreen mode
  1. Use BM25 as your baseline
    It's the industry standard for good reason.

  2. Add domain-specific signals
    E-commerce? Factor in price, reviews, stock.
    News site? Prioritize freshness.

  3. A/B test everything
    Users click your #1 result 28% of the time. If that drops to 20%, investigate.

If You're Optimizing for Search:

  1. Write for humans, optimize for robots

    • Clear headers (H1, H2)
    • Descriptive titles
    • Natural keyword usage
    • Internal linking
  2. Technical SEO matters

   <!-- Good -->
   <title>Python Tutorial for Beginners | Learn Programming</title>
   <meta name="description" content="Step-by-step Python tutorial...">

   <!-- Bad -->
   <title>Page 1</title>
Enter fullscreen mode Exit fullscreen mode
  1. Get linked from authority sites One link from Stack Overflow > 100 links from spam sites

The Visualization That Changed My Mind

Here's how I visualize ranking now:

Query: "python tutorial"

┌─────────────────────────────────────┐
│  Candidate Documents (1000s)        │
└─────────────────┬───────────────────┘
                  │
        ┌─────────┴─────────┐
        │                   │
   ┌────▼─────┐      ┌─────▼────┐
   │ Content  │      │ Signals  │
   │ Matching │      │ (200+)   │
   └────┬─────┘      └─────┬────┘
        │                   │
        │  ┌───────────────┐│
        └──▶ ML Ranker     ◀┘
           │ (RankBrain)   │
           └───────┬───────┘
                   │
           ┌───────▼───────┐
           │ Top 10 Results│
           └───────────────┘
Enter fullscreen mode Exit fullscreen mode

Try This Yourself

Weekend project: Build a tiny search engine for your blog or docs site.

Starter code (10-minute version):

# pip install whoosh
from whoosh.index import create_in
from whoosh.fields import Schema, TEXT
from whoosh.qparser import QueryParser
import os

# 1. Define schema
schema = Schema(title=TEXT(stored=True), content=TEXT)

# 2. Create index
if not os.path.exists("indexdir"):
    os.mkdir("indexdir")
ix = create_in("indexdir", schema)

# 3. Add documents
writer = ix.writer()
writer.add_document(
    title="Python Tutorial",
    content="Learn Python programming basics and advanced concepts"
)
writer.add_document(
    title="JavaScript Guide", 
    content="Modern JavaScript development techniques"
)
writer.commit()

# 4. Search
with ix.searcher() as searcher:
    query = QueryParser("content", ix.schema).parse("python")
    results = searcher.search(query)
    for result in results:
        print(result['title'])
Enter fullscreen mode Exit fullscreen mode

Resources to Go Deeper