01 / 08

COMP 312 — Open Source Computing · Spring 2026

c0rtex

layered defenses against indirect prompt injection

MIT License

Ollama + Qwen3.5

Python / Flask

~30 Tools

Defense-in-Depth

02 / 08

// what is c0rtex

Local. Private. Free.

Personal AI assistant running entirely on local hardware — no API costs
Python wrapper for Ollama + Qwen3.5, chat via Matrix
Define a SOUL (system prompt), schedule tasks with cron
~30 tools: file ops, system monitoring, web browsing, CTF helpers, 3rd-party integrations
All tools guardrailed — scoped dirs, auto backups, no raw shell

USER

→

c0rtex loop

→

Ollama

→

tool calls

→

Ollama

→

response

★ Class contribution: the base project was a pre-existing hobby. PR #4 on main is what we built for COMP 312.

03 / 08

// the threat

Why This Matters

Direct Injection

User tells the AI to misbehave

A model-alignment problem. Scoped to what a user can type. Easier to detect and handle.

⚠ Indirect Injection — the real threat

Malicious instructions in web content

c0rtex browses the web autonomously. A poisoned page can silently instruct it to overwrite its SOUL, delete files, or exfiltrate data — without the user ever seeing it.

04 / 08

// defense layers — PR #4

5 Layers of Defense

Jakub — your part goes here. Send it over.

01

Content isolation wrapper

[ jakub fills this in ]

02

System prompt hardening

[ jakub fills this in ]

03

Keyword blocklist

[ jakub fills this in ]

04

Command whitelist

[ jakub fills this in ]

05

Security logging

[ jakub fills this in ]

05 / 08

// evaluation setup

How We Tested It

6 attack pages — each disguised as a modular synth article
Tested across 3 model sizes: qwen3.5:2b, 4b, 9b
Run once with full mitigations, once with all layers stripped
Each page plants unique marker strings — grading script checks if the model parrots them back

PAGE 00

Control

No injection — clean baseline

PAGE 01

Overt

"Ignore previous instructions"

PAGE 02

Hidden CSS/HTML

Same-color text, zero font, offscreen, comments

PAGE 03

Social Engineering

Fake editor's note, polite exfil requests

PAGE 04

Blocklist Evasion

Synonyms, whitespace, unicode, non-English

PAGE 05

Tool Abuse

Read /etc/passwd, write files, overwrite SOUL

06 / 08

// results

The Data

Overall defense rate by model

2b: 88% · 4b: 100% · 9b: 94% — all improved significantly with mitigations active

Defense rate by attack class

Hidden CSS hurt the 2b (67%). Social engineering fooled the 9b (50%). Blocklist bypass + tool abuse held across all sizes.

Status heatmap: test × model

3 failures across 54 tests. Content isolation wrapper stopped the majority of attacks.

Status breakdown per model

4b: clean sweep. 2b and 9b: isolated failures only.

07 / 08

// key findings + limitations

What We Learned

Defense-in-depth works — no single layer is enough, but stacking them measurably raised the bar
Bigger ≠ safer — the 9b model fell for social engineering that the 4b blocked. More capable models are more persuadable.
The content isolation wrapper did the most work — a simple tag outperformed more complex layers
Social engineering is the hardest to defend — no trigger words, just persuasion. The most realistic real-world threat. The blocklist can't catch what it can't pattern-match
All defenses ultimately depend on model behavior — no cryptographic enforcement Future: semantic filtering via secondary LLM, stronger content sandboxing

08 / 08

// open source

It's on GitHub.

The entire project is available as an open-source repository.