I Scanned 100 Open Source Repos for Security Vulnerabilities — Here Are the 7 Patterns That Keep Repeating

# security# opensource# python# machinelearning
I Scanned 100 Open Source Repos for Security Vulnerabilities — Here Are the 7 Patterns That Keep Repeatingmanja316

Over the past month, I've been systematically scanning open source repositories for security...

Over the past month, I've been systematically scanning open source repositories for security vulnerabilities. Not hypothetical ones — real, exploitable bugs that I've reported through bounty programs.

After scanning 100+ repos across ML frameworks, web servers, and developer tools, the same patterns keep showing up. Here's what I found, with real code examples.

The Setup

I use a combination of Semgrep rules, custom Python scripts, and manual code review. The workflow:

  1. Clone the repo
  2. Run automated scanners (Semgrep + custom rules for ML-specific bugs)
  3. Trace data flows manually from user input to dangerous sinks
  4. Build a proof-of-concept
  5. Verify on the latest release

If you want to automate step 2, I built a Security Scanner Skill for Claude Code that runs 50+ checks in one pass.

Pattern 1: Zip Slip in Archive Extraction

Found in: 23% of repos that handle file uploads

The classic: extracting a zip/tar without validating that file paths stay inside the target directory.

# VULNERABLE — real pattern from a top ML framework
def extract_archive(archive_path, dest_dir):
    with zipfile.ZipFile(archive_path) as zf:
        zf.extractall(dest_dir)  # No path validation!
Enter fullscreen mode Exit fullscreen mode

An attacker crafts a zip with entries like ../../etc/cron.d/backdoor. The fix:

def safe_extract(archive_path, dest_dir):
    with zipfile.ZipFile(archive_path) as zf:
        for info in zf.infolist():
            target = os.path.realpath(os.path.join(dest_dir, info.filename))
            if not target.startswith(os.path.realpath(dest_dir)):
                raise ValueError(f"Path traversal detected: {info.filename}")
        zf.extractall(dest_dir)
Enter fullscreen mode Exit fullscreen mode

I found two distinct zip slip vulnerabilities in a single ML experiment tracking framework — both in functions that extract model artifacts. Combined bounty value: $3,000.

Pattern 2: Pickle/Joblib Deserialization Without Sandboxing

Found in: 41% of ML repos

This is the big one in ML. pickle.load() executes arbitrary Python code. Everyone knows this. Nobody fixes it.

# VULNERABLE — seen in multiple model serving frameworks
def load_model(path):
    with open(path, 'rb') as f:
        return pickle.load(f)  # RCE if attacker controls the file
Enter fullscreen mode Exit fullscreen mode

I discovered 31 distinct techniques to bypass the most popular ML model security scanner. The scanner blocks os.system and subprocess, but misses:

  • http.client.HTTPConnection (Python 2/3 naming gap — httplib is blocked, http.client isn't)
  • multiprocessing.Pool (spawns worker processes via fork(), but only subprocess is blocked)
  • smtplib.SMTP (wraps socket which IS blocked, but the wrapper isn't)
  • tempfile.mkstemp (creates persistent files on disk during deserialization)

31 bypass techniques across 4 categories: module blocklist gaps, Python 2/3 naming inconsistencies, wrapper modules, and format-specific skips.

Pattern 3: SQL Injection in ORM Query Builders

Found in: 18% of repos with database integrations

Even with ORMs, developers drop to raw SQL for complex queries. When they do, they concatenate user input directly.

# VULNERABLE — real pattern from a data platform
def search_documents(query, db_type):
    if db_type == "postgres":
        sql = f"SELECT * FROM docs WHERE content LIKE '%{query}%'"
        cursor.execute(sql)
Enter fullscreen mode Exit fullscreen mode

I found SQL injection and NoSQL injection in a popular LLM framework's database connectors — the query builder trusted user input for column names and filter values.

The fix is always parameterized queries:

cursor.execute("SELECT * FROM docs WHERE content LIKE %s", [f"%{query}%"])
Enter fullscreen mode Exit fullscreen mode

Pattern 4: SSRF Through URL Parameters

Found in: 29% of repos with HTTP client functionality

Any feature that fetches a URL based on user input is an SSRF candidate. ML frameworks are especially vulnerable because they download models, datasets, and configs from URLs.

# VULNERABLE
@app.route('/fetch-model')
def fetch_model():
    url = request.args.get('url')
    response = requests.get(url)  # SSRF — attacker can hit internal services
    return response.content
Enter fullscreen mode Exit fullscreen mode

Cloud metadata endpoints (169.254.169.254) are the classic target. But the real damage comes from hitting internal APIs that assume network-level trust.

Pattern 5: eval()/exec() on User-Controlled Input

Found in: 15% of repos (but 34% of ML/notebook tools)

The worst pattern. I found a critical RCE in a popular ML UI framework where a single unauthenticated POST request writes arbitrary Python code to a file and exec() runs it within 50ms.

# VULNERABLE — real pattern from an ML UI framework
@app.post("/run-code/")
async def run_code(request):
    code = (await request.json())["code"]
    exec(code)  # Full RCE, no auth required
Enter fullscreen mode Exit fullscreen mode

This one was particularly bad because the endpoint had no authentication. Any network-adjacent attacker gets full code execution on the server.

Pattern 6: Path Traversal in File Serving

Found in: 21% of web-facing repos

Serving files based on user-supplied paths without proper sanitization:

# VULNERABLE
@app.route('/files/<path:filename>')
def serve_file(filename):
    return send_file(os.path.join(UPLOAD_DIR, filename))
Enter fullscreen mode Exit fullscreen mode

An attacker requests /files/../../../etc/passwd. The fix is os.path.realpath() + prefix check, same as the zip slip fix.

Pattern 7: Missing Authentication on Sensitive Endpoints

Found in: 12% of repos (but nearly universal in "local-first" tools that get deployed to servers)

Tools designed for local development often expose admin endpoints without auth. When someone deploys them on a server (which always happens), those endpoints become attack surface.

# VULNERABLE — "it's just a local tool"
@app.route('/admin/delete-all', methods=['POST'])
def delete_all():
    db.drop_all()
    return "Done"
Enter fullscreen mode Exit fullscreen mode

The Numbers

After 100+ repos:

Pattern Prevalence Avg Bounty Difficulty to Find
Pickle deserialization 41% of ML repos $500-2,000 Easy (grep for pickle.load)
SSRF 29% of HTTP repos $500-1,500 Medium
Zip slip 23% of upload repos $1,000-1,500 Easy (grep for extractall)
Path traversal 21% of file-serving $500-1,000 Easy
SQL injection 18% of DB repos $500-3,000 Medium
eval/exec 15% overall $1,000-3,000 Easy (grep for eval/exec)
Missing auth 12% overall $500-1,500 Hard (need to understand intended access model)

How to Scan Your Own Projects

The fastest way to start:

1. Semgrep for the obvious stuff:

semgrep --config=p/python-security --config=p/owasp-top-ten .
Enter fullscreen mode Exit fullscreen mode

2. Custom grep for ML-specific patterns:

# Dangerous deserialization
grep -rn "pickle\.load\|joblib\.load\|torch\.load\|yaml\.load" --include="*.py" .

# Dangerous execution
grep -rn "eval(\|exec(\|compile(" --include="*.py" .

# Archive extraction without validation
grep -rn "extractall\|extract(" --include="*.py" .
Enter fullscreen mode Exit fullscreen mode

3. Trace user input to sinks. This is the step that catches what automated tools miss. Follow every request.args, request.json, function parameter from an API endpoint, and see where it ends up.

I automated most of this into a Claude Code skill that chains Semgrep scanning with custom pattern matching and generates structured reports. It catches patterns 1-6 automatically and flags pattern 7 for manual review.

What I Learned

  1. ML repos are softer targets than web frameworks. Web devs have 20 years of OWASP education. ML engineers are focused on model accuracy, not input validation.

  2. Scanner bypasses are everywhere. The most popular ML model scanner has 31 distinct bypass techniques. Don't trust a single tool — layer your defenses.

  3. "Local-only" tools always end up on servers. If it has a web interface, assume it will be network-accessible. Add auth.

  4. The money is in specificity. A generic "you have XSS" report gets ignored. A working PoC that demonstrates data exfiltration gets paid.

  5. Volume beats perfection. Scanning 100 repos with basic patterns found more real bugs than deep-diving 5 repos with advanced techniques.

If you're building developer tools, I also built a Dashboard Builder skill for creating monitoring dashboards and an API Connector for integrating with vulnerability databases and security APIs.


All vulnerabilities mentioned have been responsibly disclosed through appropriate bounty programs before publication.