llm – |S|A|R|M|A|D|

So what is Foundry Local?

Foundry Local is Microsoft’s way of letting apps run AI models directly on your device. No cloud, no Azure account, no token costs. It runs fully offline on Windows, macOS with Apple Silicon, and Android.

At first glance, it looks similar to tools like Ollama or LM Studio. They all run models locally and expose APIs. But the real difference is how it’s packaged.

Ollama and similar tools run as separate services. Users install them, then your app talks to them over localhost.

Foundry Local flips that idea. It’s an SDK you bundle inside your own app. You ship it with your installer. The runtime is small, around 20 MB, and your app becomes self-contained. It downloads models when needed, caches them locally, and automatically uses the right hardware, whether that’s NVIDIA, AMD, Apple Silicon, or even NPUs.

In simple terms, instead of asking users to install an AI runtime, your app is the runtime.

Why that packaging actually matters

If you’ve ever depended on something like Ollama in a real product, you already know the pain.

Some users won’t install it. Others install the wrong version. Ports conflict. IT blocks background services. Suddenly you’re debugging someone else’s setup instead of your own app.

Foundry Local removes all of that. Everything lives inside your application. No external dependency, no background service, no guessing what environment the user has.

That’s really the core idea. It’s built for companies shipping software, not for people experimenting with models.

How it’s used

There are two main ways this shows up.

One is on devices, which is what most people will touch. You use SDKs in common languages and embed AI directly into your app.

The other is a more enterprise setup running on Azure Local with Kubernetes for edge environments like factories or hospitals.

One important detail. Foundry Local is designed for single-user scenarios. One app, one user, one model at a time. It’s not trying to be a shared AI server.

If you need high concurrency, something like vLLM is still the better choice.

Where it sits compared to other tools

To make sense of it, it helps to see the roles each tool plays.

llama.cpp is the core engine that started local LLMs. Fast, simple, and widely used.

Ollama makes it easy to download and run models quickly. It’s the easiest entry point for most developers.

LM Studio is more of a user-friendly interface for exploring models.

vLLM is built for scale and handling multiple users at once.

Foundry Local sits in a different spot. It’s not about running or serving models. It’s about shipping them inside applications.

The important part most people miss: model formats

Each system is built around a specific format.

GGUF is what most local tools use. It’s simple, portable, and heavily optimized for running models efficiently on CPUs and GPUs.

ONNX, which Foundry Local uses, is different. It doesn’t just store weights. It stores the full computation graph. Basically, it describes exactly how the model runs.

That makes it hardware-agnostic. You can run the same model across different devices and let the runtime figure out whether to use CPU, GPU, or NPU.

There’s also MLX, which is optimized specifically for Apple Silicon and performs very well there, but doesn’t really exist outside that ecosystem.

So the tradeoff is pretty clear. GGUF gives you the biggest ecosystem. ONNX gives you the most flexibility across hardware. MLX gives you peak performance on Apple devices.

Why Microsoft is doing this now

This part is actually the real story.

Hardware is changing. CPUs aren’t getting dramatically better for this kind of workload. GPUs are great but not always available on enterprise machines.

Meanwhile, NPUs are showing up everywhere. Intel, AMD, Qualcomm, Apple. New laptops increasingly have dedicated AI hardware.

The problem is each vendor has its own way of using that hardware. Without a common layer, developers would have to write separate code for each one.

That doesn’t scale.

This is where ONNX Runtime comes in. It acts as a bridge. One model, one API, and it runs on whatever hardware is available.

Foundry Local is essentially Microsoft building a developer-friendly layer on top of that idea.

So does it actually matter?

It depends on what you’re doing.

If you’re building a real application and want AI built in, this matters a lot. It solves distribution, compatibility, and hardware acceleration in one go.

If you’re experimenting or running models for yourself, it probably doesn’t. Tools like Ollama are still faster to get started with and have way more ready-to-use models.

If you’re building something that needs to serve many users, Foundry Local isn’t the right fit yet. That’s still a job for vLLM or similar systems.

The simple way to think about it

Each tool has a clear role.

Ollama is for running models
vLLM is for serving models
Foundry Local is for shipping models inside apps
That’s really it.

The bigger picture is where things get interesting. If NPUs become standard in everyday devices, then whoever controls the layer that connects apps to that hardware becomes very important. Microsoft is betting that layer will be ONNX Runtime, and Foundry Local is how developers interact with it.

Whether that bet pays off depends on how many real apps start using it. But the direction is already clear.

Building a high-performance Arabic-English AI deployment solution with benchmarking

The JAIS (Jebel Jais) AI model represents a breakthrough in bilingual Arabic-English language processing, developed by Inception AI, MBZUAI, and Cerebras Systems. This post details the implementation of a production-ready deployment solution with comprehensive performance analysis comparing Docker containerization versus native Metal GPU acceleration.

In this project, I used a model provided by mradermacher/jais-family-30b-16k-chat-i1-GGUF, a recognized quantization specialist in the community. The mradermacher quantized version was chosen because:

iMatrix Quantization: Advanced i1-Q4_K_M provides superior quality vs static quantization. Research shows that weighted/imatrix quants offer significantly better model quality than classical static quants at the same quantization level
GGUF Format: Optimized for llama.cpp inference with Metal GPU acceleration
Balanced Performance: Q4_K_M offers the ideal speed/quality/size ratio (25.97 GiB)
Production Ready: Pre-quantized and extensively tested for deployment
Community Trusted: mradermacher is known for creating high-quality quantizations with automated processes and extensive testing
Superior Multilingual Performance: Studies indicate that English imatrix datasets show better results even for non-English inference, as most base models are primarily trained on English

Solution Architecture

The deployment solution consists of several key components designed for maximum flexibility and performance:

Project Structure

jais-ai-docker/
├── run.sh                      # Main server launcher
├── test.sh                     # Comprehensive test suite  
├── build.sh                    # Build system (Docker/Native)
├── cleanup.sh                  # Project cleanup utilities
├── Dockerfile                  # ARM64 optimized container
├── src/
│   ├── app.py                  # Flask API server
│   ├── model_loader.py         # GGUF model loader with auto-detection
│   └── requirements.txt        # Python dependencies
├── config/
│   └── performance_config.json # Performance presets
└── models/
    └── jais-family-30b-16k-chat.i1-Q4_K_M.gguf  # Quantized model

Python Implementation Overview

Flask API Server

The core server implements a robust Flask application with proper error handling and environment detection:

# Configuration with environment variable support
MODEL_PATH = os.environ.get("MODEL_PATH", "/app/models/jais-family-30b-16k-chat.i1-Q4_K_M.gguf")
CONFIG_PATH = os.environ.get("CONFIG_PATH", "/app/config/performance_config.json")

@app.route('/chat', methods=['POST'])
def chat():
    """Main chat endpoint with comprehensive error handling."""
    if not model_loaded:
        return jsonify({"error": "Model not loaded"}), 503
    
    try:
        data = request.json
        message = data.get('message', '')
        max_tokens = data.get('max_tokens', 100)
        
        # Generate response with timing
        start_time = time.time()
        response_data = jais_loader.generate_response(message, max_tokens=max_tokens)
        generation_time = time.time() - start_time
        
        # Add performance metrics
        response_data['generation_time_seconds'] = round(generation_time, 3)
        response_data['model_load_time_seconds'] = round(model_load_time, 3)
        
        return jsonify(response_data)
        
    except Exception as e:
        logger.error(f"Error in chat endpoint: {e}")
        return jsonify({"error": str(e)}), 500

Key Features:

Environment Variable Configuration: Flexible path configuration for different deployment modes
Performance Metrics: Built-in timing for load time and generation speed
Error Handling: Comprehensive exception handling with proper HTTP status codes
Health Checks: Monitoring endpoint for deployment orchestration

Complete Flask implementation: src/app.py

Smart Model Loader

The model loader implements intelligent environment detection and optimal configuration:

class JaisModelLoader:
    """
    Optimized model loader for mradermacher Jais AI GGUF models with proper error handling
    and resource management.
    """
    
    def _detect_runtime_environment(self) -> str:
        """Auto-detect the runtime environment and return optimal performance mode."""
        # Check if running in Docker container
        if os.path.exists('/.dockerenv') or os.path.exists('/proc/1/cgroup'):
            return 'docker'
        
        # Check if running natively on macOS with GGML_METAL environment variable
        if (platform.system() == 'Darwin' and 
            platform.machine() == 'arm64' and 
            os.environ.get('GGML_METAL') == '1'):
            return 'native_metal'
        
        return 'docker'  # Default fallback

    def _get_performance_preset(self) -> Dict[str, Any]:
        """Get optimized settings based on detected environment."""
        presets = {
            'native_metal': {
                'n_threads': 12,
                'n_ctx': 4096,
                'n_gpu_layers': -1,  # All layers to GPU
                'n_batch': 128,
                'use_metal': True
            },
            'docker': {
                'n_threads': 8,
                'n_ctx': 2048,
                'n_gpu_layers': 0,   # CPU only
                'n_batch': 64,
                'use_metal': False
            }
        }
        
        return presets.get(self.performance_mode, presets['docker'])

Key Innovations:

Automatic Environment Detection: Distinguishes between Docker and native execution
Performance Presets: Optimized configurations for each environment
Resource Management: Intelligent GPU/CPU allocation based on available hardware
Metal GPU Support: Full utilization of Apple Silicon capabilities

Complete model loader implementation: src/model_loader.py

Comprehensive Testing Framework

The testing framework provides automated performance benchmarking across deployment modes:

# Automated test execution
./test.sh performance  # Performance benchmarking
./test.sh full         # Complete functional testing
./test.sh quick        # Essential functionality tests

The test suite automatically detects running services and performs comprehensive evaluation with detailed metrics collection for tokens per second, response times, and system resource usage.

Complete test suite: test.sh

Performance Test Results and Analysis

Comprehensive benchmarking was conducted comparing Docker containerization versus native Metal GPU acceleration:

Test Environment

Hardware: Apple M4 Max
Model: JAIS 30B (Q4_K_M quantized, 25.97 GiB)
Tests: 5 different scenarios across languages and complexity levels

Performance Comparison Results

Test Scenario	Docker (tok/s)	Native Metal (tok/s)	Speedup	Performance Gain
Arabic Greeting	3.53	12.58	3.56x	+256%
Creative Writing	3.93	13.06	3.32x	+232%
Technical Explanation	4.08	12.98	3.18x	+218%
Simple Greeting	2.54	10.24	4.03x	+303%
Arabic Question	4.44	13.24	2.98x	+198%

Average Performance Summary:

Docker CPU-only: 3.70 tokens/second
Native Metal GPU: 12.42 tokens/second
Overall Improvement: +235% performance gain

Configuration Analysis

Aspect	Docker Container	Native Metal
GPU Acceleration	CPU-only	Metal GPU (All 49 layers)
Threads	8	12
Context Window	2,048 tokens	4,096 tokens
Batch Size	64	128
Memory Usage	26.6 GB CPU	26.6 GB GPU + 0.3 GB CPU
Load Time	~5.2 seconds	~7.7 seconds

Testing Methodology

The testing approach followed controlled environment principles:

# Build and deploy Docker version
./build.sh docker --clean
./run.sh docker

# Run performance benchmarks
./test.sh performance

# Switch to native and repeat
docker stop jais-ai
./run.sh native
./test.sh performance

Test Design Principles:

Controlled Environment: Same hardware, same model, same prompts
Multiple Iterations: Each test repeated for consistency
Comprehensive Metrics: Token generation speed, total response time, memory usage
Language Diversity: Tests in both Arabic and English
Complexity Variation: From simple greetings to complex explanations

Key Findings and Recommendations

Performance Findings

Native Metal provides 3.36x average speedup over Docker CPU-only
Consistent performance gains across all test scenarios (2.98x – 4.03x)
Metal GPU acceleration utilizes Apple Silicon effectively
Docker offers portability with acceptable performance trade-offs

Deployment Recommendations

Use Native Metal When:

Maximum performance is critical
Interactive applications requiring low latency
Development and testing environments
Apple Silicon hardware available

Use Docker When:

Deploying to production servers
Cross-platform consistency required
Container orchestration needed
GPU resources unavailable

Technical Insights

Model Quantization: Q4_K_M provides optimal balance of speed/quality/size
Environment Detection: Automatic configuration prevents manual tuning
Resource Utilization: Full GPU offloading maximizes Apple Silicon capabilities
Production Readiness: Both deployments pass comprehensive functional tests

Repository and Resources

Complete Source Code: GitHub Repository

The repository includes full Python implementation with detailed comments, comprehensive test suite and benchmarking tools, Docker configuration and build scripts, performance analysis reports and metrics, deployment documentation and setup guides, and configuration presets for different environments.

Quick Start

git clone https://github.com/sarmadjari/jais-ai-docker
cd jais-ai-docker
./scripts/model_download.sh  # Download the model
./run.sh                     # Interactive mode selection

Conclusion

This implementation demonstrates effective deployment of large language models with optimal performance characteristics. The combination of intelligent environment detection, automated performance optimization, and comprehensive testing provides a robust foundation for production AI deployments.

The 3.36x performance improvement achieved through Metal GPU acceleration showcases the importance of hardware-optimized deployments, while Docker containerization ensures portability and scalability for diverse production environments.

The complete solution serves as a practical reference for deploying bilingual AI models with production-grade performance monitoring and testing capabilities.

This is just a start, I will keep tuning and hopefully updating the documentations as I get some time in the future.

|S|A|R|M|A|D|

enjoying every bit

Tag: llm

Azure Foundry Local: What It Is, Why It’s Different, and When It Matters