
Igor NosatovLast Tuesday, I was debugging why our app's search feature kept showing cat videos when users...
Last Tuesday, I was debugging why our app's search feature kept showing cat videos when users searched for "database optimization." Classic Tuesday. Instead of diving into Elasticsearch docs (again), I did something weird: I grabbed index cards, Post-its, and built a physical search engine on my kitchen table.
What I learned in those 3 hours changed how I think about every search query I write.
We always compare search engines to librarians. Wrong metaphor.
Search engines are more like that friend who always knows which restaurant to recommend. They don't just remember every restaurant (indexing). They remember:
This is ranking. And it's where the magic happens.
Here's what I did. You can do this too (seriously, it helps).
[Card A: "Python Tutorial"] ←─┐
↓ │
[Card B: "Django Guide"] │
↓ │
[Card C: "Web Dev Blog"] ──────┘
↓
[Card D: "Random Page"]
When I searched for "Python learning," which card should win?
Naive me at 2 AM: "Card A says 'Python Tutorial' - that's literally the query!"
Slightly smarter me at 2:15 AM: "Wait... Card C has THREE cards linking to it. People trust it more."
I just rediscovered PageRank at my kitchen table.
# Oversimplified crawler
visited = set()
to_visit = ["https://example.com"]
while to_visit:
url = to_visit.pop()
if url in visited:
continue
page = fetch(url)
visited.add(url)
# Extract links
for link in extract_links(page):
to_visit.append(link)
# Store in index
index[url] = {
'content': page.text,
'links': extract_links(page),
'timestamp': now()
}
Real-world gotcha: Googlebot hits your site billions of times. My first production site crashed because I didn't implement rate limiting. The CEO was... not happy.
This is where search engines build their "inverted index" - a fancy term for "every word points to every page where it appears."
# Building an inverted index
inverted_index = {}
for doc_id, document in documents.items():
words = tokenize(document.content)
for position, word in enumerate(words):
if word not in inverted_index:
inverted_index[word] = []
inverted_index[word].append({
'doc_id': doc_id,
'position': position,
'context': get_context(words, position)
})
# Now searching is instant:
# "python" → [doc1, doc5, doc89, doc234...]
Mind-blowing fact: Google's index is estimated at 100+ petabytes. That's 100,000,000 GB. Your entire life's photos? Maybe 500 GB.
This is where I spent most of my kitchen time. And it's not just PageRank anymore.
Google doesn't use a single algorithm. It's a committee of algorithms voting. Here are the big players:
The "does this page actually talk about what you're searching for" algorithm.
import math
from collections import Counter
def calculate_tfidf(term, document, all_documents):
# Term Frequency: how often does word appear?
tf = document.count(term) / len(document)
# Inverse Document Frequency: is this word rare/special?
docs_with_term = sum(1 for doc in all_documents if term in doc)
idf = math.log(len(all_documents) / (1 + docs_with_term))
return tf * idf
# Example:
doc = "python tutorial python programming python guide"
all_docs = [doc, "javascript tutorial", "java guide"]
score = calculate_tfidf("python", doc.split(), all_docs)
print(f"Score: {score}") # High because "python" appears often but not everywhere
Translation: Common words like "the" score low. Specific words like "Django" score high. The word "python" in a Python tutorial? Very high.
TF-IDF has a problem: a page that says "python" 100 times doesn't deserve to rank 100x higher than one that says it 10 times.
def bm25_score(term, document, all_documents, k1=1.5, b=0.75):
# Term frequency with diminishing returns
tf = document.count(term)
doc_length = len(document)
avg_doc_length = sum(len(d) for d in all_documents) / len(all_documents)
# IDF component
docs_with_term = sum(1 for doc in all_documents if term in doc)
idf = math.log((len(all_documents) - docs_with_term + 0.5) / (docs_with_term + 0.5))
# The magic formula
numerator = tf * (k1 + 1)
denominator = tf + k1 * (1 - b + b * (doc_length / avg_doc_length))
return idf * (numerator / denominator)
This is what Elasticsearch uses by default. It's beautiful because:
def calculate_pagerank(graph, damping=0.85, iterations=100):
"""
Graph = {page_id: [list_of_pages_it_links_to]}
"""
num_pages = len(graph)
pagerank = {page: 1.0 / num_pages for page in graph}
for _ in range(iterations):
new_ranks = {}
for page in graph:
# Base probability: random surfer teleports
rank = (1 - damping) / num_pages
# Add rank from incoming links
for other_page, links in graph.items():
if page in links:
rank += damping * (pagerank[other_page] / len(links))
new_ranks[page] = rank
pagerank = new_ranks
return pagerank
# Example graph
web_graph = {
'A': ['B', 'C'],
'B': ['C'],
'C': ['A'],
'D': ['C']
}
ranks = calculate_pagerank(web_graph)
for page, score in sorted(ranks.items(), key=lambda x: x[1], reverse=True):
print(f"{page}: {score:.4f}")
Output:
C: 0.3874 ← Most linked-to page wins
A: 0.2703
B: 0.2141
D: 0.1282
Google considers:
Here's a simplified but functional search engine:
from dataclasses import dataclass
from typing import List, Dict
import math
@dataclass
class SearchResult:
url: str
title: str
snippet: str
score: float
class SimpleSearchEngine:
def __init__(self):
self.index = {}
self.documents = {}
self.pagerank = {}
def add_document(self, doc_id: str, title: str, content: str, links: List[str]):
"""Index a document"""
self.documents[doc_id] = {
'title': title,
'content': content,
'links': links
}
# Build inverted index
words = self._tokenize(content + " " + title)
for word in words:
if word not in self.index:
self.index[word] = []
self.index[word].append(doc_id)
def calculate_pagerank(self):
"""Calculate PageRank for all documents"""
graph = {doc_id: doc['links'] for doc_id, doc in self.documents.items()}
# Use the pagerank function from above
self.pagerank = calculate_pagerank(graph)
def search(self, query: str, top_k: int = 10) -> List[SearchResult]:
"""Search and rank results"""
words = self._tokenize(query)
# Find candidate documents
candidates = set()
for word in words:
if word in self.index:
candidates.update(self.index[word])
# Score each candidate
scored_results = []
for doc_id in candidates:
doc = self.documents[doc_id]
# Combine multiple signals
text_score = self._calculate_bm25(words, doc)
popularity_score = self.pagerank.get(doc_id, 0)
# Weighted combination
final_score = (
0.7 * text_score + # Content relevance
0.3 * popularity_score # Popularity
)
scored_results.append(SearchResult(
url=doc_id,
title=doc['title'],
snippet=self._create_snippet(doc['content'], words),
score=final_score
))
# Sort by score and return top K
scored_results.sort(key=lambda x: x.score, reverse=True)
return scored_results[:top_k]
def _tokenize(self, text: str) -> List[str]:
"""Simple tokenization"""
return text.lower().split()
def _calculate_bm25(self, query_terms: List[str], document: Dict) -> float:
"""Simplified BM25 scoring"""
content = document['content'] + " " + document['title']
score = 0
for term in query_terms:
tf = content.lower().count(term)
if tf > 0:
# Simplified BM25
score += math.log(1 + tf)
return score
def _create_snippet(self, content: str, query_terms: List[str]) -> str:
"""Create search result snippet"""
words = content.split()
for i, word in enumerate(words):
if word.lower() in query_terms:
start = max(0, i - 10)
end = min(len(words), i + 10)
return "..." + " ".join(words[start:end]) + "..."
return content[:100] + "..."
# Usage
engine = SimpleSearchEngine()
# Add some documents
engine.add_document(
"python-tutorial",
"Python Tutorial for Beginners",
"Learn Python programming from scratch. Python is easy to learn.",
links=["django-guide"]
)
engine.add_document(
"django-guide",
"Django Web Framework Guide",
"Django is a Python web framework. Build web apps with Django.",
links=["python-tutorial"]
)
engine.add_document(
"javascript-intro",
"JavaScript Introduction",
"JavaScript is a programming language for web browsers.",
links=[]
)
# Calculate PageRank
engine.calculate_pagerank()
# Search
results = engine.search("python programming")
for i, result in enumerate(results, 1):
print(f"{i}. {result.title}")
print(f" URL: {result.url}")
print(f" Score: {result.score:.4f}")
print(f" {result.snippet}")
print()
Output:
1. Python Tutorial for Beginners
URL: python-tutorial
Score: 1.8945
...Learn Python programming from scratch. Python is easy to learn...
2. Django Web Framework Guide
URL: django-guide
Score: 0.9123
...Django is a Python web framework. Build web apps...
Just because everyone links to a page doesn't mean it answers your question. Good search engines balance both:
Final Score = α × Content_Match + β × Popularity + γ × Freshness + ...
Tweak those Greek letters (α, β, γ) and you change everything.
The word "bank" means different things in:
Modern search engines use semantic search (embeddings, BERT) to understand context. That's why searching "best restaurants nearby" works even though no page says "best restaurants nearby [your exact location]."
Google updates its algorithm 500-600 times per year. Why?
New pages have no PageRank. How do they rank?
Google uses:
# Don't reinvent the wheel
docker run -p 9200:9200 elasticsearch:8.11.0
Use BM25 as your baseline
It's the industry standard for good reason.
Add domain-specific signals
E-commerce? Factor in price, reviews, stock.
News site? Prioritize freshness.
A/B test everything
Users click your #1 result 28% of the time. If that drops to 20%, investigate.
Write for humans, optimize for robots
Technical SEO matters
<!-- Good -->
<title>Python Tutorial for Beginners | Learn Programming</title>
<meta name="description" content="Step-by-step Python tutorial...">
<!-- Bad -->
<title>Page 1</title>
Here's how I visualize ranking now:
Query: "python tutorial"
┌─────────────────────────────────────┐
│ Candidate Documents (1000s) │
└─────────────────┬───────────────────┘
│
┌─────────┴─────────┐
│ │
┌────▼─────┐ ┌─────▼────┐
│ Content │ │ Signals │
│ Matching │ │ (200+) │
└────┬─────┘ └─────┬────┘
│ │
│ ┌───────────────┐│
└──▶ ML Ranker ◀┘
│ (RankBrain) │
└───────┬───────┘
│
┌───────▼───────┐
│ Top 10 Results│
└───────────────┘
Weekend project: Build a tiny search engine for your blog or docs site.
Starter code (10-minute version):
# pip install whoosh
from whoosh.index import create_in
from whoosh.fields import Schema, TEXT
from whoosh.qparser import QueryParser
import os
# 1. Define schema
schema = Schema(title=TEXT(stored=True), content=TEXT)
# 2. Create index
if not os.path.exists("indexdir"):
os.mkdir("indexdir")
ix = create_in("indexdir", schema)
# 3. Add documents
writer = ix.writer()
writer.add_document(
title="Python Tutorial",
content="Learn Python programming basics and advanced concepts"
)
writer.add_document(
title="JavaScript Guide",
content="Modern JavaScript development techniques"
)
writer.commit()
# 4. Search
with ix.searcher() as searcher:
query = QueryParser("content", ix.schema).parse("python")
results = searcher.search(query)
for result in results:
print(result['title'])