vLLM – |S|A|R|M|A|D|

So what is Foundry Local?

Foundry Local is Microsoft’s way of letting apps run AI models directly on your device. No cloud, no Azure account, no token costs. It runs fully offline on Windows, macOS with Apple Silicon, and Android.

At first glance, it looks similar to tools like Ollama or LM Studio. They all run models locally and expose APIs. But the real difference is how it’s packaged.

Ollama and similar tools run as separate services. Users install them, then your app talks to them over localhost.

Foundry Local flips that idea. It’s an SDK you bundle inside your own app. You ship it with your installer. The runtime is small, around 20 MB, and your app becomes self-contained. It downloads models when needed, caches them locally, and automatically uses the right hardware, whether that’s NVIDIA, AMD, Apple Silicon, or even NPUs.

In simple terms, instead of asking users to install an AI runtime, your app is the runtime.

Why that packaging actually matters

If you’ve ever depended on something like Ollama in a real product, you already know the pain.

Some users won’t install it. Others install the wrong version. Ports conflict. IT blocks background services. Suddenly you’re debugging someone else’s setup instead of your own app.

Foundry Local removes all of that. Everything lives inside your application. No external dependency, no background service, no guessing what environment the user has.

That’s really the core idea. It’s built for companies shipping software, not for people experimenting with models.

How it’s used

There are two main ways this shows up.

One is on devices, which is what most people will touch. You use SDKs in common languages and embed AI directly into your app.

The other is a more enterprise setup running on Azure Local with Kubernetes for edge environments like factories or hospitals.

One important detail. Foundry Local is designed for single-user scenarios. One app, one user, one model at a time. It’s not trying to be a shared AI server.

If you need high concurrency, something like vLLM is still the better choice.

Where it sits compared to other tools

To make sense of it, it helps to see the roles each tool plays.

llama.cpp is the core engine that started local LLMs. Fast, simple, and widely used.

Ollama makes it easy to download and run models quickly. It’s the easiest entry point for most developers.

LM Studio is more of a user-friendly interface for exploring models.

vLLM is built for scale and handling multiple users at once.

Foundry Local sits in a different spot. It’s not about running or serving models. It’s about shipping them inside applications.

The important part most people miss: model formats

Each system is built around a specific format.

GGUF is what most local tools use. It’s simple, portable, and heavily optimized for running models efficiently on CPUs and GPUs.

ONNX, which Foundry Local uses, is different. It doesn’t just store weights. It stores the full computation graph. Basically, it describes exactly how the model runs.

That makes it hardware-agnostic. You can run the same model across different devices and let the runtime figure out whether to use CPU, GPU, or NPU.

There’s also MLX, which is optimized specifically for Apple Silicon and performs very well there, but doesn’t really exist outside that ecosystem.

So the tradeoff is pretty clear. GGUF gives you the biggest ecosystem. ONNX gives you the most flexibility across hardware. MLX gives you peak performance on Apple devices.

Why Microsoft is doing this now

This part is actually the real story.

Hardware is changing. CPUs aren’t getting dramatically better for this kind of workload. GPUs are great but not always available on enterprise machines.

Meanwhile, NPUs are showing up everywhere. Intel, AMD, Qualcomm, Apple. New laptops increasingly have dedicated AI hardware.

The problem is each vendor has its own way of using that hardware. Without a common layer, developers would have to write separate code for each one.

That doesn’t scale.

This is where ONNX Runtime comes in. It acts as a bridge. One model, one API, and it runs on whatever hardware is available.

Foundry Local is essentially Microsoft building a developer-friendly layer on top of that idea.

So does it actually matter?

It depends on what you’re doing.

If you’re building a real application and want AI built in, this matters a lot. It solves distribution, compatibility, and hardware acceleration in one go.

If you’re experimenting or running models for yourself, it probably doesn’t. Tools like Ollama are still faster to get started with and have way more ready-to-use models.

If you’re building something that needs to serve many users, Foundry Local isn’t the right fit yet. That’s still a job for vLLM or similar systems.

The simple way to think about it

Each tool has a clear role.

Ollama is for running models
vLLM is for serving models
Foundry Local is for shipping models inside apps
That’s really it.

The bigger picture is where things get interesting. If NPUs become standard in everyday devices, then whoever controls the layer that connects apps to that hardware becomes very important. Microsoft is betting that layer will be ONNX Runtime, and Foundry Local is how developers interact with it.

Whether that bet pays off depends on how many real apps start using it. But the direction is already clear.

|S|A|R|M|A|D|

enjoying every bit

Tag: vLLM

Azure Foundry Local: What It Is, Why It’s Different, and When It Matters