More organizations are looking at running GenAI models on their own infrastructure, whether to meet data residency requirements, reduce cloud spend, or simply maintain control over sensitive workloads. This article walks through how to do that on Windows Server 2025 using Microsoft Foundry Local: what it is, how it’s structured, and how to get your first model running.

Foundry Local is only supported on Windows Server 2025. When installed, it automatically selects the right model variant for your hardware: CUDA for NVIDIA GPUs, the NPU variant for Qualcomm, and a CPU fallback when no accelerator is present. Supported GPU hardware includes NVIDIA (2000 series or newer), AMD GPU (6000 series or newer), AMD NPU, and Intel iGPU.

Windows Server 2025 as a Local AI Platform

Windows Server 2025 introduced several capabilities that make it a legitimate platform for AI workloads: GPU partitioning (GPU-P), Discrete Device Assignment (DDA) for passing physical GPUs directly into VMs, and Hyper-V scaling up to 2,048 vCPUs per Gen 2 VM. These aren’t marginal improvements, they matter when you’re trying to run inference workloads on shared infrastructure without rebuilding your virtualization stack.

That said, Windows Server handles the OS and virtualization layer. For actual GenAI inference, you need an inference engine on top of that, which is what Foundry Local provides.

How Foundry Local Works

Foundry Local is built around three components that sit on top of each other.

ONNX Runtime is the inference engine underneath. It’s a high-performance runtime that supports deep neural networks, traditional ML models, and generative AI. Its key advantage is hardware abstraction: it integrates with TensorRT on NVIDIA, OpenVINO on Intel, and DirectML on Windows, so the same deployment works across different accelerator configurations without hardware-specific code.

Model Cache stores downloaded models locally so they’re available for inference immediately. You manage it through the Foundry CLI or the REST API. The cache location is configurable, which matters on servers where the OS drive has limited space.

Foundry Local Service sits on top of both. It exposes an OpenAI-compatible REST server, so any tool or SDK that works with OpenAI endpoints will work here with minimal changes. The endpoint is dynamically allocated when the service starts, find it with foundry service status.

Getting Started

Foundry Local isn’t installed by default, but winget makes it straightforward. Run the following in PowerShell or Windows Terminal:

# Install Foundry Local

winget install Microsoft.FoundryLocal

# Upgrade to a newer version when available

winget upgrade –id Microsoft.FoundryLocal

# Start the service

foundry service start

# Check status and find the active endpoint

foundry service status

# List available models from the Foundry catalog

foundry model list

The first time you run foundry model list, it downloads execution providers for your hardware. You’ll see a progress bar – this only happens once.

foundry model list — first-run download of hardware execution providers

foundry model list — first-run download of hardware execution providers

Once the catalog is loaded, pull down a model and run it:

foundry model download phi-4-mini

foundry model run phi-4-mini

Downloading and running phi-4-mini

Downloading and running phi-4-mini

On this machine, an Azure VM without a GPU, Foundry selected the generic-cpu variant automatically. Inference runs directly on the CPU, which is fine for evaluation. Phi-4-mini is useful for verifying that the service works end-to-end, though it has a high hallucination rate and isn’t suitable for production use cases where accuracy matters.

Once the model is loaded, you get an interactive prompt for direct testing and a live REST endpoint for your applications.

Interactive mode and REST endpoint ready for use

Interactive mode and REST endpoint ready for use

The REST interface follows the OpenAI API convention. Key things to know:

Endpoint: It changes each time the service starts. Find it with foundry service status or the /openai/status endpoint, don’t hardcode it.
Usage: Send standard HTTP requests to run models and retrieve results. Any OpenAI-compatible SDK works out of the box.

The Foundry team has also published a browser-based WebUI for managing models without the CLI: FoundryWebUI on GitHub. It’s IIS-compatible and a good option if you prefer a visual interface.

Managing the Model Cache

A few commands worth knowing for day-to-day model management:

# List models currently in cache

foundry cache list

# Remove a specific model

foundry cache remove <model-name>

# Change the cache directory

foundry cache cd <path>

Model Lifecycle

Models move through five stages in Foundry Local:

Download: Pulls the model from the Foundry catalog to local disk. One-time operation per model version.

Load: Moves the model into memory for inference. A TTL (time-to-live) controls how long it stays loaded, default is 600 seconds.

Run: Executes inference for incoming requests. This is where CPU or GPU resources are consumed.

Unload: Removes the model from memory when the TTL expires. It remains on disk and reloads on demand.

Delete: Removes the model from the local cache entirely to reclaim disk space.

Scenarios for On-Premises AI

Running AI inference on-premises makes sense for several concrete reasons, even for organizations already invested in cloud AI:

Data residency. Finance, healthcare, and government organizations often operate under regulations that require sensitive data to stay within specific borders or facilities. Running inference on-premises means that data, including the payloads sent to the model, never leaves the datacenter.
Low latency. For real-time applications like factory automation, edge equipment, or high-frequency systems, the round-trip to a cloud endpoint is often unacceptable. Local inference eliminates that delay.
Disconnected environments. Ships, remote industrial sites, and air-gapped facilities can’t depend on cloud connectivity. Once models are cached locally, Foundry Local runs with no external dependencies.
Control and auditability. Some organizations require full ownership of the infrastructure and software stack, particularly when working with proprietary or fine-tuned models they’re unwilling to process outside their own environment.

Limitations: What Foundry Local Is Not

It’s worth being direct: Foundry Local is designed for single-user or developer scenarios. It processes inference requests sequentially, one at a time, which creates a hard ceiling on concurrent load.

The root cause is the absence of continuous batching. Without it, every request is treated as an isolated operation regardless of how many arrive simultaneously. GPU utilization stays low, queue depth grows linearly with concurrent users, and latency for anyone waiting in the queue is entirely dependent on when the previous request finishes.

Under increasing load, this shows up in two ways:

Throughput drops as requests pile up and processing remains strictly sequential.

Latency grows rapidly, making the service feel slow to users beyond the first one.

Microsoft doesn’t position Foundry Local as a multi-user inference server, and it isn’t one. For prototyping, model evaluation, and single-user integrations it works well. For anything serving multiple users or applications at scale, you’ll need a different solution.

Alternatives for High-Throughput On-Premises Workloads

If the requirement is AI at scale with everything staying on-premises, there are two viable paths:

Dedicated AI platforms such as Red Hat OpenShift AI provide a managed, scalable environment for deploying ML models on-premises. They handle GPU virtualization, resource scheduling, and model lifecycle management at an enterprise level.
Custom inference services built on vLLM. vLLM has become the standard framework for high-throughput LLM inference. Its PagedAttention mechanism significantly improves GPU memory utilization and handles concurrent requests far more efficiently than standard runtimes, making it practical to build a scalable self-hosted inference service. The operational overhead is real, but so is the performance headroom.

Foundry Local is the right starting point for evaluating models and building on Windows Server. When you outgrow it, these are the natural next steps.

from StarWind Blog https://ift.tt/5Z3HEgt
via IFTTT

Cloudy Journey

Pages

Wednesday, March 25, 2026

Running GenAI Models On-Premises with Microsoft Foundry Local