I Scanned 100 Open Source Repos for Security Vulnerabilities — Here Are the 7 Patterns That Keep Repeating

# security# opensource# python# machinelearning

manja316

Over the past month, I've been systematically scanning open source repositories for security...

Over the past month, I've been systematically scanning open source repositories for security vulnerabilities. Not hypothetical ones — real, exploitable bugs that I've reported through bounty programs.

After scanning 100+ repos across ML frameworks, web servers, and developer tools, the same patterns keep showing up. Here's what I found, with real code examples.

The Setup

I use a combination of Semgrep rules, custom Python scripts, and manual code review. The workflow:

Clone the repo
Run automated scanners (Semgrep + custom rules for ML-specific bugs)
Trace data flows manually from user input to dangerous sinks
Build a proof-of-concept
Verify on the latest release

If you want to automate step 2, I built a Security Scanner Skill for Claude Code that runs 50+ checks in one pass.

Pattern 1: Zip Slip in Archive Extraction

Found in: 23% of repos that handle file uploads

The classic: extracting a zip/tar without validating that file paths stay inside the target directory.

# VULNERABLE — real pattern from a top ML framework
def extract_archive(archive_path, dest_dir):
    with zipfile.ZipFile(archive_path) as zf:
        zf.extractall(dest_dir)  # No path validation!

An attacker crafts a zip with entries like ../../etc/cron.d/backdoor. The fix:

def safe_extract(archive_path, dest_dir):
    with zipfile.ZipFile(archive_path) as zf:
        for info in zf.infolist():
            target = os.path.realpath(os.path.join(dest_dir, info.filename))
            if not target.startswith(os.path.realpath(dest_dir)):
                raise ValueError(f"Path traversal detected: {info.filename}")
        zf.extractall(dest_dir)

I found two distinct zip slip vulnerabilities in a single ML experiment tracking framework — both in functions that extract model artifacts. Combined bounty value: $3,000.

Pattern 2: Pickle/Joblib Deserialization Without Sandboxing

Found in: 41% of ML repos

This is the big one in ML. pickle.load() executes arbitrary Python code. Everyone knows this. Nobody fixes it.

# VULNERABLE — seen in multiple model serving frameworks
def load_model(path):
    with open(path, 'rb') as f:
        return pickle.load(f)  # RCE if attacker controls the file

I discovered 31 distinct techniques to bypass the most popular ML model security scanner. The scanner blocks os.system and subprocess, but misses:

http.client.HTTPConnection (Python 2/3 naming gap — httplib is blocked, http.client isn't)
multiprocessing.Pool (spawns worker processes via fork(), but only subprocess is blocked)
smtplib.SMTP (wraps socket which IS blocked, but the wrapper isn't)
tempfile.mkstemp (creates persistent files on disk during deserialization)

31 bypass techniques across 4 categories: module blocklist gaps, Python 2/3 naming inconsistencies, wrapper modules, and format-specific skips.

Pattern 3: SQL Injection in ORM Query Builders

Found in: 18% of repos with database integrations

Even with ORMs, developers drop to raw SQL for complex queries. When they do, they concatenate user input directly.

# VULNERABLE — real pattern from a data platform
def search_documents(query, db_type):
    if db_type == "postgres":
        sql = f"SELECT * FROM docs WHERE content LIKE '%{query}%'"
        cursor.execute(sql)

I found SQL injection and NoSQL injection in a popular LLM framework's database connectors — the query builder trusted user input for column names and filter values.

The fix is always parameterized queries:

cursor.execute("SELECT * FROM docs WHERE content LIKE %s", [f"%{query}%"])

Pattern 4: SSRF Through URL Parameters

Found in: 29% of repos with HTTP client functionality

Any feature that fetches a URL based on user input is an SSRF candidate. ML frameworks are especially vulnerable because they download models, datasets, and configs from URLs.

# VULNERABLE
@app.route('/fetch-model')
def fetch_model():
    url = request.args.get('url')
    response = requests.get(url)  # SSRF — attacker can hit internal services
    return response.content

Cloud metadata endpoints (169.254.169.254) are the classic target. But the real damage comes from hitting internal APIs that assume network-level trust.

Pattern 5: eval()/exec() on User-Controlled Input

Found in: 15% of repos (but 34% of ML/notebook tools)

The worst pattern. I found a critical RCE in a popular ML UI framework where a single unauthenticated POST request writes arbitrary Python code to a file and exec() runs it within 50ms.

# VULNERABLE — real pattern from an ML UI framework
@app.post("/run-code/")
async def run_code(request):
    code = (await request.json())["code"]
    exec(code)  # Full RCE, no auth required

This one was particularly bad because the endpoint had no authentication. Any network-adjacent attacker gets full code execution on the server.

Pattern 6: Path Traversal in File Serving

Found in: 21% of web-facing repos

Serving files based on user-supplied paths without proper sanitization:

# VULNERABLE
@app.route('/files/<path:filename>')
def serve_file(filename):
    return send_file(os.path.join(UPLOAD_DIR, filename))

An attacker requests /files/../../../etc/passwd. The fix is os.path.realpath() + prefix check, same as the zip slip fix.

Pattern 7: Missing Authentication on Sensitive Endpoints

Found in: 12% of repos (but nearly universal in "local-first" tools that get deployed to servers)

Tools designed for local development often expose admin endpoints without auth. When someone deploys them on a server (which always happens), those endpoints become attack surface.

# VULNERABLE — "it's just a local tool"
@app.route('/admin/delete-all', methods=['POST'])
def delete_all():
    db.drop_all()
    return "Done"

The Numbers

After 100+ repos:

Pattern	Prevalence	Avg Bounty	Difficulty to Find
Pickle deserialization	41% of ML repos	$500-2,000	Easy (grep for pickle.load)
SSRF	29% of HTTP repos	$500-1,500	Medium
Zip slip	23% of upload repos	$1,000-1,500	Easy (grep for extractall)
Path traversal	21% of file-serving	$500-1,000	Easy
SQL injection	18% of DB repos	$500-3,000	Medium
eval/exec	15% overall	$1,000-3,000	Easy (grep for eval/exec)
Missing auth	12% overall	$500-1,500	Hard (need to understand intended access model)

How to Scan Your Own Projects

The fastest way to start:

1. Semgrep for the obvious stuff:

semgrep --config=p/python-security --config=p/owasp-top-ten .

2. Custom grep for ML-specific patterns:

# Dangerous deserialization
grep -rn "pickle\.load\|joblib\.load\|torch\.load\|yaml\.load" --include="*.py" .

# Dangerous execution
grep -rn "eval(\|exec(\|compile(" --include="*.py" .

# Archive extraction without validation
grep -rn "extractall\|extract(" --include="*.py" .

3. Trace user input to sinks. This is the step that catches what automated tools miss. Follow every request.args, request.json, function parameter from an API endpoint, and see where it ends up.

I automated most of this into a Claude Code skill that chains Semgrep scanning with custom pattern matching and generates structured reports. It catches patterns 1-6 automatically and flags pattern 7 for manual review.

What I Learned

ML repos are softer targets than web frameworks. Web devs have 20 years of OWASP education. ML engineers are focused on model accuracy, not input validation.
Scanner bypasses are everywhere. The most popular ML model scanner has 31 distinct bypass techniques. Don't trust a single tool — layer your defenses.
"Local-only" tools always end up on servers. If it has a web interface, assume it will be network-accessible. Add auth.
The money is in specificity. A generic "you have XSS" report gets ignored. A working PoC that demonstrates data exfiltration gets paid.
Volume beats perfection. Scanning 100 repos with basic patterns found more real bugs than deep-diving 5 repos with advanced techniques.

If you're building developer tools, I also built a Dashboard Builder skill for creating monitoring dashboards and an API Connector for integrating with vulnerability databases and security APIs.

All vulnerabilities mentioned have been responsibly disclosed through appropriate bounty programs before publication.