Python-T Pointβ Can you build a fast, scalable semantic search with Pinecone and FastAPI? Yes β and...
Yes β and you donβt need a team of ML engineers. With semantic search using Pinecone and FastAPI , you can index unstructured text, serve low-latency queries, and deploy to production in hours. Most implementations treat embeddings as opaque vectors without considering performance trade-offs. This becomes a problem when recall drops at scale or latency spikes under load. Fix it by designing the system with data structure and query behavior in mind.
π Table of Contents
An embedding is a fixed-length vector that maps semantic meaning into a continuous space, enabling similarity search via geometric distance. The transformation is performed by a pre-trained transformer model like all-MiniLM-L6-v2 from Sentence Transformers, which maps variable-length text into a 384-dimensional vector space.
The model tokenizes input text, processes it through transformer layers, then applies mean pooling over the final hidden states to generate a single vector. Because the training objective includes contrastive learning on sentence pairs, semantically similar phrases β such as βHow do I reset a password?β and βForgot my loginβ β are embedded close together.
Distance in this space correlates with semantic similarity. Cosine similarity, which measures angular difference, is typically used instead of Euclidean distance because itβs invariant to vector magnitude.
from sentence_transformers import SentenceTransformer # Load a lightweight but effective model
model = SentenceTransformer('all-MiniLM-L6-v2') # Generate embedding for a query
sentence = "How to deploy FastAPI on Kubernetes"
embedding = model.encode(sentence) print(type(embedding), embedding.shape)
<class 'numpy.ndarray'> (384,)
The output is a 384-dimensional numpy array. These embeddings must be computed once per document and stored for search. Query embeddings are generated on-demand and compared against indexed vectors.
"Semantic search isn't about keywords β it's about intent. The vector space learns what users mean, not just what they type."
Traditional databases are not optimized for high-dimensional vector similarity search. A full scan over 1 million vectors at 384 floats per vector requires ~1.5 GB of data movement and O(n) comparisons β far too slow for interactive use.
Pinecone uses approximate nearest neighbor (ANN) algorithms like HNSW (Hierarchical Navigable Small World) to achieve search in roughly O(log n) time. HNSW builds a multi-layer graph structure that allows fast navigation to nearby vectors, trading a small reduction in recall for orders-of-magnitude lower latency.
Distances are computed using cosine similarity or Euclidean distance, depending on index configuration. The service exposes a simple API over gRPC via HTTPS, with each vector stored alongside metadata for retrieval.
Install the Pinecone client:
$ pip install pinecone-client
Collecting pinecone-client Downloading pinecone_client-3.1.0-py3-none-any.whl (48 kB)
...
Successfully installed pinecone-client-3.1.0
Initialize and create an index:
import pinecone # Initialize connection
pinecone.init(api_key="your-api-key", environment="us-west1-gcp") # Create index if it doesn't exist
if 'semantic-search' not in pinecone.list_indexes(): pinecone.create_index( name='semantic-search', dimension=384, # Match embedding size metric='cosine' )
The dimension must exactly match the embedding size (384 for all-MiniLM-L6-v2). The metric should be cosine for sentence embeddings, as angular similarity reflects semantic alignment better than magnitude-sensitive metrics.
To index content, generate embeddings and upsert them as tuples of (id, vector, metadata):
index = pinecone.Index('semantic-search') documents = [ { "id": "doc_1", "text": "How to deploy FastAPI with Docker", "url": "/guides/fastapi-docker" }, { "id": "doc_2", "text": "Kubernetes secrets management best practices", "url": "/guides/k8s-secrets" }
] # Generate and upsert vectors
vectors = []
for doc in documents: vector = model.encode(doc["text"]).tolist() vectors.append((doc["id"], vector, {"text": doc["text"], "url": doc["url"]})) index.upsert(vectors=vectors)
The upsert operation inserts new vectors or overwrites existing ones by ID. Pinecone batches writes internally and returns confirmation asynchronously.
print(index.describe_index_stats())
{'dimension': 384, 'index_fullness': 0.0, 'namespaces': {'': {'vector_count': 2}}, 'total_vector_count': 2}
The index now contains two vectors. Metadata is stored alongside each vector and can be filtered on during queries. Avoid storing large fields in metadata β it increases transfer size and query latency. (More onPythonTPoint tutorials)
A production search endpoint must respond in under 200ms. This requires minimizing blocking operations, leveraging async I/O, and reusing embeddings where possible.
FastAPI supports this through Pydantic request validation and async route handlers. The endpoint accepts a query string, encodes it, searches Pinecone, and returns ranked results.
from fastapi import FastAPI
from pydantic import BaseModel
import uvicorn app = FastAPI() class SearchRequest(BaseModel): query: str top_k: int = 5 @app.post("/search")
async def semantic_search(request: SearchRequest): # Step 1: Encode the query query_vector = model.encode(request.query).tolist() # Step 2: Query Pinecone result = index.query( vector=query_vector, top_k=request.top_k, include_metadata=True ) # Step 3: Format response matches = [] for match in result['matches']: matches.append({ "id": match['id'], "score": match['score'], "text": match['metadata']['text'], "url": match['metadata']['url'] }) return {"results": matches} # Run with: uvicorn main:app -reload
Start the server:
$ uvicorn main:app -reload
INFO: Uvicorn running on http://127.0.0.1:8000
INFO: Application startup complete.
INFO: reloading active
Query the endpoint:
$ curl -X POST http://127.0.0.1:8000/search \ -H "Content-Type: application/json" \ -d '{"query": "how to deploy a Python API"}'
{ "results": [ { "id": "doc_1", "score": 0.876, "text": "How to deploy FastAPI with Docker", "url": "/guides/fastapi-docker" } ]
}
The response includes cosine similarity scores. Higher values indicate greater relevance. Metadata filtering and namespace isolation can be added later for multi-tenancy or domain-specific routing.
Approximately 20% of user queries repeat within short intervals. Cache results using Redis to avoid recomputing embeddings and reduce Pinecone call volume.
import redis r = redis.Redis(host='localhost', port=6379, db=0) @app.post("/search")
async def semantic_search(request: SearchRequest): cache_key = f"search:{request.query}:{request.top_k}" cached = r.get(cache_key) if cached: return json.loads(cached) # ... (compute result) # Cache for 10 minutes r.setex(cache_key, 600, json.dumps({"results": matches})) return {"results": matches}
With caching, repeated queries drop from ~150ms to ~10ms. The embedding computation accounts for most of the saved latency, as the model inference is the slowest step in the chain.
Correctness matters. Use recall@k to measure the percentage of queries where at least one relevant result appears in the top K results.
Construct a test set of query-ground truth pairs:
test_cases = [ { "query": "deploy FastAPI", "relevant_ids": ["doc_1"] }, { "query": "manage secrets in Kubernetes", "relevant_ids": ["doc_2"] }
]
Compute recall@5:
def evaluate_recall(test_cases, top_k=5): hits = 0 for case in test_cases: result = index.query( vector=model.encode(case["query"]).tolist(), top_k=top_k ) returned_ids = {match['id'] for match in result['matches']} if any(rid in returned_ids for rid in case['relevant_ids']): hits += 1 return hits / len(test_cases) print(f"Recall@5: {evaluate_recall(test_cases):.2f}")
Recall@5: 1.00
A score of 1.00 means all relevant items were retrieved in the top 5. Expand the test set to hundreds of labeled queries for meaningful benchmarking. For production systems, aim for recall@5 β₯ 0.90.
Building semantic search with Pinecone and FastAPI is not integration work β itβs systems design. The performance and accuracy depend on understanding each componentβs role: embedding models for semantic representation, vector databases for efficient similarity search, and API frameworks for low-latency delivery.
The stack is accessible, but success requires attention to detail. Model choice affects embedding quality and compute cost. Index parameters determine recall and speed. Caching reduces latency variance. These arenβt incidental β they define the user experience. Handle them deliberately, and youβll ship a search system that works β not just one that runs.
Yes, but only for low-traffic applications. The free tier supports up to 100MB of storage and limited queries per second. For higher load, upgrade to a paid plan with dedicated pods.
For multilingual support, use paraphrase-multilingual-MiniLM-L12-v2 from Sentence Transformers. It supports 50+ languages and maintains strong cross-lingual similarity.
Re-encode the updated document and call upsert() with the same ID. Pinecone will overwrite the old vector. For bulk updates, batch the upserts to reduce latency.