Containers – |S|A|R|M|A|D|

Building a high-performance Arabic-English AI deployment solution with benchmarking

The JAIS (Jebel Jais) AI model represents a breakthrough in bilingual Arabic-English language processing, developed by Inception AI, MBZUAI, and Cerebras Systems. This post details the implementation of a production-ready deployment solution with comprehensive performance analysis comparing Docker containerization versus native Metal GPU acceleration.

In this project, I used a model provided by mradermacher/jais-family-30b-16k-chat-i1-GGUF, a recognized quantization specialist in the community. The mradermacher quantized version was chosen because:

iMatrix Quantization: Advanced i1-Q4_K_M provides superior quality vs static quantization. Research shows that weighted/imatrix quants offer significantly better model quality than classical static quants at the same quantization level
GGUF Format: Optimized for llama.cpp inference with Metal GPU acceleration
Balanced Performance: Q4_K_M offers the ideal speed/quality/size ratio (25.97 GiB)
Production Ready: Pre-quantized and extensively tested for deployment
Community Trusted: mradermacher is known for creating high-quality quantizations with automated processes and extensive testing
Superior Multilingual Performance: Studies indicate that English imatrix datasets show better results even for non-English inference, as most base models are primarily trained on English

Solution Architecture

The deployment solution consists of several key components designed for maximum flexibility and performance:

Project Structure

jais-ai-docker/
├── run.sh                      # Main server launcher
├── test.sh                     # Comprehensive test suite  
├── build.sh                    # Build system (Docker/Native)
├── cleanup.sh                  # Project cleanup utilities
├── Dockerfile                  # ARM64 optimized container
├── src/
│   ├── app.py                  # Flask API server
│   ├── model_loader.py         # GGUF model loader with auto-detection
│   └── requirements.txt        # Python dependencies
├── config/
│   └── performance_config.json # Performance presets
└── models/
    └── jais-family-30b-16k-chat.i1-Q4_K_M.gguf  # Quantized model

Python Implementation Overview

Flask API Server

The core server implements a robust Flask application with proper error handling and environment detection:

# Configuration with environment variable support
MODEL_PATH = os.environ.get("MODEL_PATH", "/app/models/jais-family-30b-16k-chat.i1-Q4_K_M.gguf")
CONFIG_PATH = os.environ.get("CONFIG_PATH", "/app/config/performance_config.json")

@app.route('/chat', methods=['POST'])
def chat():
    """Main chat endpoint with comprehensive error handling."""
    if not model_loaded:
        return jsonify({"error": "Model not loaded"}), 503
    
    try:
        data = request.json
        message = data.get('message', '')
        max_tokens = data.get('max_tokens', 100)
        
        # Generate response with timing
        start_time = time.time()
        response_data = jais_loader.generate_response(message, max_tokens=max_tokens)
        generation_time = time.time() - start_time
        
        # Add performance metrics
        response_data['generation_time_seconds'] = round(generation_time, 3)
        response_data['model_load_time_seconds'] = round(model_load_time, 3)
        
        return jsonify(response_data)
        
    except Exception as e:
        logger.error(f"Error in chat endpoint: {e}")
        return jsonify({"error": str(e)}), 500

Key Features:

Environment Variable Configuration: Flexible path configuration for different deployment modes
Performance Metrics: Built-in timing for load time and generation speed
Error Handling: Comprehensive exception handling with proper HTTP status codes
Health Checks: Monitoring endpoint for deployment orchestration

Complete Flask implementation: src/app.py

Smart Model Loader

The model loader implements intelligent environment detection and optimal configuration:

class JaisModelLoader:
    """
    Optimized model loader for mradermacher Jais AI GGUF models with proper error handling
    and resource management.
    """
    
    def _detect_runtime_environment(self) -> str:
        """Auto-detect the runtime environment and return optimal performance mode."""
        # Check if running in Docker container
        if os.path.exists('/.dockerenv') or os.path.exists('/proc/1/cgroup'):
            return 'docker'
        
        # Check if running natively on macOS with GGML_METAL environment variable
        if (platform.system() == 'Darwin' and 
            platform.machine() == 'arm64' and 
            os.environ.get('GGML_METAL') == '1'):
            return 'native_metal'
        
        return 'docker'  # Default fallback

    def _get_performance_preset(self) -> Dict[str, Any]:
        """Get optimized settings based on detected environment."""
        presets = {
            'native_metal': {
                'n_threads': 12,
                'n_ctx': 4096,
                'n_gpu_layers': -1,  # All layers to GPU
                'n_batch': 128,
                'use_metal': True
            },
            'docker': {
                'n_threads': 8,
                'n_ctx': 2048,
                'n_gpu_layers': 0,   # CPU only
                'n_batch': 64,
                'use_metal': False
            }
        }
        
        return presets.get(self.performance_mode, presets['docker'])

Key Innovations:

Automatic Environment Detection: Distinguishes between Docker and native execution
Performance Presets: Optimized configurations for each environment
Resource Management: Intelligent GPU/CPU allocation based on available hardware
Metal GPU Support: Full utilization of Apple Silicon capabilities

Complete model loader implementation: src/model_loader.py

Comprehensive Testing Framework

The testing framework provides automated performance benchmarking across deployment modes:

# Automated test execution
./test.sh performance  # Performance benchmarking
./test.sh full         # Complete functional testing
./test.sh quick        # Essential functionality tests

The test suite automatically detects running services and performs comprehensive evaluation with detailed metrics collection for tokens per second, response times, and system resource usage.

Complete test suite: test.sh

Performance Test Results and Analysis

Comprehensive benchmarking was conducted comparing Docker containerization versus native Metal GPU acceleration:

Test Environment

Hardware: Apple M4 Max
Model: JAIS 30B (Q4_K_M quantized, 25.97 GiB)
Tests: 5 different scenarios across languages and complexity levels

Performance Comparison Results

Test Scenario	Docker (tok/s)	Native Metal (tok/s)	Speedup	Performance Gain
Arabic Greeting	3.53	12.58	3.56x	+256%
Creative Writing	3.93	13.06	3.32x	+232%
Technical Explanation	4.08	12.98	3.18x	+218%
Simple Greeting	2.54	10.24	4.03x	+303%
Arabic Question	4.44	13.24	2.98x	+198%

Average Performance Summary:

Docker CPU-only: 3.70 tokens/second
Native Metal GPU: 12.42 tokens/second
Overall Improvement: +235% performance gain

Configuration Analysis

Aspect	Docker Container	Native Metal
GPU Acceleration	CPU-only	Metal GPU (All 49 layers)
Threads	8	12
Context Window	2,048 tokens	4,096 tokens
Batch Size	64	128
Memory Usage	26.6 GB CPU	26.6 GB GPU + 0.3 GB CPU
Load Time	~5.2 seconds	~7.7 seconds

Testing Methodology

The testing approach followed controlled environment principles:

# Build and deploy Docker version
./build.sh docker --clean
./run.sh docker

# Run performance benchmarks
./test.sh performance

# Switch to native and repeat
docker stop jais-ai
./run.sh native
./test.sh performance

Test Design Principles:

Controlled Environment: Same hardware, same model, same prompts
Multiple Iterations: Each test repeated for consistency
Comprehensive Metrics: Token generation speed, total response time, memory usage
Language Diversity: Tests in both Arabic and English
Complexity Variation: From simple greetings to complex explanations

Key Findings and Recommendations

Performance Findings

Native Metal provides 3.36x average speedup over Docker CPU-only
Consistent performance gains across all test scenarios (2.98x – 4.03x)
Metal GPU acceleration utilizes Apple Silicon effectively
Docker offers portability with acceptable performance trade-offs

Deployment Recommendations

Use Native Metal When:

Maximum performance is critical
Interactive applications requiring low latency
Development and testing environments
Apple Silicon hardware available

Use Docker When:

Deploying to production servers
Cross-platform consistency required
Container orchestration needed
GPU resources unavailable

Technical Insights

Model Quantization: Q4_K_M provides optimal balance of speed/quality/size
Environment Detection: Automatic configuration prevents manual tuning
Resource Utilization: Full GPU offloading maximizes Apple Silicon capabilities
Production Readiness: Both deployments pass comprehensive functional tests

Repository and Resources

Complete Source Code: GitHub Repository

The repository includes full Python implementation with detailed comments, comprehensive test suite and benchmarking tools, Docker configuration and build scripts, performance analysis reports and metrics, deployment documentation and setup guides, and configuration presets for different environments.

Quick Start

git clone https://github.com/sarmadjari/jais-ai-docker
cd jais-ai-docker
./scripts/model_download.sh  # Download the model
./run.sh                     # Interactive mode selection

Conclusion

This implementation demonstrates effective deployment of large language models with optimal performance characteristics. The combination of intelligent environment detection, automated performance optimization, and comprehensive testing provides a robust foundation for production AI deployments.

The 3.36x performance improvement achieved through Metal GPU acceleration showcases the importance of hardware-optimized deployments, while Docker containerization ensures portability and scalability for diverse production environments.

The complete solution serves as a practical reference for deploying bilingual AI models with production-grade performance monitoring and testing capabilities.

This is just a start, I will keep tuning and hopefully updating the documentations as I get some time in the future.

Python

Python is a high-level, dynamically-typed programming language that has taken the software development industry by storm. It’s known for its simplicity, readability, and vast library ecosystem. Python has become the language of choice for many in web development, data science, artificial intelligence, scientific computing, and more. Its versatile nature makes it ideal for both beginners and experienced developers.

Docker

Docker is a revolutionary tool that allows developers to create, deploy, and run applications in containers. Containers can be thought of as lightweight, stand-alone packages that contain everything needed to run an application, including the code, runtime, libraries, and system tools. Docker ensures that an application runs consistently across different environments, eliminating the infamous “it works on my machine” problem. It simplifies the process of setting up, distributing, and scaling applications, making it an invaluable tool for modern development.

Visual Studio Code

Visual Studio Code (VS Code) is a powerful, open-source code editor developed by Microsoft. It provides a lightweight yet feature-rich environment that supports a multitude of programming languages, including Python. With a vast ecosystem of extensions, integrated Git support, debugging capabilities, and an intuitive interface, VS Code has quickly become the editor of choice for many developers around the world.

Why Combine Python, Docker, and Visual Studio Code?

You might be wondering why one would want to combine Python, Docker, and Visual Studio Code. The answer lies in the fusion of simplicity, consistency, and efficiency. By using Docker, you can ensure that your Python application runs the same way, irrespective of where it’s deployed. This means no more headaches about dependency issues or system incompatibilities. On the other hand, VS Code provides a seamless development experience, with features that play nicely with both Python and Docker. Combining these three tools gives you a streamlined, consistent, and efficient development workflow.

Steps to Set Up Your Dev Environment:

Install Prerequisites:
- Install Docker and ensure it’s running.
- Download and install Visual Studio Code.
- Install the ‘Python’ and ‘Docker’ extensions from the Visual Studio Code marketplace.
Setup Docker:
- Create a new directory for your project.
- Inside this directory, create a file named Dockerfile.
- In the Dockerfile, start with the following content:
- Create a requirements.txt file in the same directory, listing any Python libraries your project depends on,following content:
  
  numpy
  pandas
  
  or you can specify the library version:
  
  tensorflow==2.3.1
  uvicorn==0.12.2
  fastapi==0.63.0
Build the Docker Container Image:
- In VS Code, open the folder containing your Dockerfile and other project files.
- Use the Docker extension to build your Docker image by right-clicking the Dockerfile and selecting ‘Build Image’ or run the command
```
docker build -t mypythonenv .
```
- Run the container and mount your working directory or folder where you have your Python code into the container
```
docker run -it --rm -v C:\Users\Sarmad\Projects\MyPythonProject:/usr/src/app mypythonenv
```
Attach the running Docker Container
- Attach the running Python container into Visual Studio Code to run and debug your Python Code, click on the Docker icon, then right-click on the running container (in our example called “mypythonenv”) then attach it to Visual Studio Code
- We have now Visual Studio Code accessing the Python environment running inside the Docker container, the container has access to your Python code files that were mounted in the docker run command line
Run the Python code
- To run our “hello-world.py” code, click on the Run and Debug icon, then the blue “Run and Debog” button, select Python File.
- The Python Code will be running inside your container
Clean Up & Share:
- Once done with development, you can push your Docker image to a registry (like Docker Hub) or your own private registry for sharing or deployment.

By following these steps, you’ll have a Python development environment that’s clean, consistent, and easy to use.

Happy coding!

|S|A|R|M|A|D|

enjoying every bit

Tag: Containers

Deploying JAIS AI: Docker vs Native Performance Analysis with Python Implementation