Loading...

Azure AI Confidential Inferencing: Technical Deep-Dive

Azure AI Confidential Inferencing: Technical Deep-Dive

Generative AI powered by Large Language Models (LLMs) has revolutionized the way we interact with technology. Through chatbots, co-pilots, and agents, AI is amplifying human productivity across sectors such as healthcare, finance, government, and cybersecurity. Microsoft’s AI platform has been at the forefront of this revolution, supporting state-of-the-art AI models, enabling organizations to differentiate with their business, by enabling developers to deploy AI applications at scale.

 

At Microsoft, we recognize the trust that consumers and enterprises place in our cloud platform as they integrate our AI services into their workflows. We believe all use of AI must be grounded in the principles of responsible AI – fairness, reliability and safety, privacy and security, inclusiveness, transparency, and accountability. Microsoft’s commitment to these principles is reflected in Azure AI’s strict data security and privacy policy, and the suite of responsible AI tools supported in Azure AI, such as fairness assessments and tools for improving interpretability of models. Whether you’re using Microsoft 365 copilot, a Copilot+ PC, or building your own copilot, you can trust that Microsoft’s responsible AI principles extend to your data as part of your AI transformation. For example, your data is never shared with other customers or used to train our foundational models.

 

Confidential computing is a set of hardware-based technologies that help protect data throughout its lifecycle, including when data is in use. This complements existing methods to protect data at rest on disk and in transit on the network. Confidential computing uses hardware-based Trusted Execution Environments (TEEs) to isolate workloads that process customer data from all other software running on the system, including other tenants’ workloads and even our own infrastructure and administrators. Crucially, thanks to remote attestation, users of services hosted in TEEs can verify that their data is only processed for the intended purpose.

 

We foresee that all cloud computing will eventually be confidential. Our vision is to transform the Azure cloud into the Azure confidential cloud, empowering customers to achieve the highest levels of privacy and security for all their workloads. Over the last decade, we have worked closely with hardware partners such as Intel, AMD, Arm and NVIDIA to integrate confidential computing into all modern hardware including CPUs and GPUs. We have taken a full stack approach across infrastructure, containers, and services. We have the most comprehensive IaaS, PaaS and developer offerings including Confidential VMs, Confidential Containers on ACI and AKS, Microsoft Azure Attestation and Azure Key Vault managed HSMs, Azure Confidential Ledger, and SQL Server Always Encrypted.

 

Our approach is rooted in hardware-based TEEs, in industry standards such as EAT, SCITT and TDISP which we have helped define, and in open source hardware (e.g., the Caliptra root of trust) and software (e.g. the OpenHCL paravisor for confidential VMs). In fact, as part of our Secure Future Initiative, we have committed to protect Azure’s own infrastructure and services using confidential computing.

 

Azure AI Confidential Inferencing

The Azure OpenAI Service team just announced the upcoming preview of confidential inferencing, our first step towards confidential AI as a service (you can sign up for the preview here). While it is already possible to build an inference service with Confidential GPU VMs (which are moving to general availability for the occasion), most application developers prefer to use model-as-a-service APIs for their convenience, scalability and cost efficiency.  Our goal with confidential inferencing is to provide those benefits with the following additional security and privacy goals:

  • End-to-end prompt protection. Clients submit encrypted prompts that can only be decrypted within inferencing TEEs (spanning both CPU and GPU), where they are protected from unauthorized access or tampering even by Microsoft. All intermediate services (frontends, load balancers, etc.) only see the encrypted prompts and completions.
  • Stateless processing. User prompts are used only for inferencing within TEEs. The prompts and completions are not stored, logged, or used for any other purpose such as debugging or training.
  • User anonymity. Users can remain anonymous while interacting with AI models.
  • Remote verifiability. Users can independently and cryptographically verify our privacy claims using evidence rooted in hardware.
  • Transparency. All artifacts that govern or have access to prompts and completions are recorded on a tamper-proof, verifiable transparency ledger. External auditors can review any version of these artifacts and report any vulnerability to our Microsoft Bug Bounty program.

These goals are a significant leap forward for the industry by providing verifiable technical evidence that data is only processed for the intended purposes (on top of the legal protection our data privacy policies already provides), thus greatly reducing the need for users to trust our infrastructure and operators. The hardware isolation of TEEs also makes it harder for hackers to steal data even if they compromise our infrastructure or admin accounts. Lastly, since our technical evidence is universally verifiability, developers can build AI applications that provide the same privacy guarantees to their users. Throughout the rest of this blog, we explain how Microsoft plans to implement and operationalize these confidential inferencing requirements.

 

How Confidential Inferencing Works

Architecture of Azure AI confidential inferencingArchitecture of Azure AI confidential inferencing

 

Architecture

Confidential inferencing provides end-to-end verifiable protection of prompts using the following building blocks:

  • Inference runs in Azure Confidential GPU VMs created with an integrity-protected disk image, which includes a container runtime to load the various containers required for inference.
  • The node agent in the VM enforces a policy over deployments that verifies the integrity and transparency of containers launched in the TEE.
  • An immutable, append-only transparency ledger records the container hashes and policies that have been deployed to the service, with additional auditing information when available such as pointers to container registries, SBOMs, sources, CI/CD logs, etc.
  • Oblivious HTTP (OHTTP) is used to encrypt the prompt from the client to the TEE, ensuring our untrusted services between the client and the TEE (TLS termination, load balancing, DoS protection, authentication, billing) only see encrypted prompts and completions.
  • A confidential and transparent key management service (KMS) generates and periodically rotates OHTTP keys. It releases private keys to confidential GPU VMs after verifying that they meet the transparent key release policy for confidential inferencing. Clients get the current set of OHTTP public keys and verify associated evidence that keys are managed by the trustworthy KMS before sending the encrypted request.
  • The client application may optionally use an OHTTP proxy outside of Azure to provide stronger unlinkability between clients and inference requests.

 

Attested Oblivious HTTP

The simplest way to achieve end-to-end confidentiality is for the client to encrypt each prompt with a public key that has been generated and attested by the inference TEE. Usually, this can be achieved by creating a direct transport layer security (TLS) session from the client to an inference TEE. But there are several operational constraints that make this impractical for large scale AI services. For example, efficiency and elasticity require smart layer 7 load balancing, with TLS sessions terminating in the load balancer. Therefore, we opted to use application-level encryption to protect the prompt as it travels through untrusted frontend and load balancing layers.

 

Oblivious HTTP (OHTTP, RFC9458) is a standard protocol that achieves this goal: a client serializes and seals the real inference request (including the prompt) with HPKE (RFC9180), a standard message sealing algorithm that uses Diffie-Hellman public shares to represent the recipient’s identity, and sends it as an encapsulated request (visible to the untrusted TLS terminator, load balancer, ingress controllers, etc.). Even though all clients use the same public key, each HPKE sealing operation generates a fresh client share, so requests are encrypted independently of each other. Requests can be served by any of the TEEs that is granted access to the corresponding private key.

 

To submit a confidential inferencing request, a client obtains the current HPKE public key from the KMS, along with hardware attestation evidence proving the key was securely generated and transparency evidence binding the key to the current secure key release policy of the inference service (which defines the required attestation attributes of a TEE to be granted access to the private key). Clients verify this evidence before sending their HPKE-sealed inference request with OHTTP.

 

Inbound requests are processed by Azure ML’s load balancers and routers, which authenticate and route them to one of the Confidential GPU VMs currently available to serve the request. Within the TEE, our OHTTP gateway decrypts the request before passing it to the main inference container. If the gateway sees a request encrypted with a key identifier it hasn't cached yet, it must obtain the private key from the KMS. To this end, it gets an attestation token from the Microsoft Azure Attestation (MAA) service and presents it to the KMS. If the attestation token meets the key release policy bound to the key, it gets back the HPKE private key wrapped under the attested vTPM keyWhen the OHTTP gateway receives a completion from the inferencing containers, it encrypts the completion using a previously established HPKE context, and sends the encrypted completion to the client, which can locally decrypt it.

 

Azure Confidential GPU VMs

Internal architecture of confidential GPU VMs with H100 Tensor Core GPUsInternal architecture of confidential GPU VMs with H100 Tensor Core GPUs

 

In confidential inferencing, all services that require access to prompts in cleartext are hosted in Azure Confidential GPU VMs. These VMs combine SEV-SNP capabilities in 4th Generation AMD EPYC processors and confidential computing primitives in NVIDIA H100 Tensor Core GPUs to create a unified Trusted Execution Environment (TEE) across the CPU and GPU. These VMs enable deployment of high-performance AI workloads while significantly in Azure infrastructure and admins. In a Confidential GPU VM, all code and data (including keys, prompts, and completions) remains encrypted in CPU memory and when they are transferred between the CPU and GPU over the PCIe bus. The data is decrypted only within the CPU package and the on-package High-Bandwidth Memory (HBM) in the GPU, where it remains protected even from privileged access using hardware firewalls.

 

Azure Confidential GPU VMs support a two-phase attestation protocol. When a Confidential GPU VM starts, it boots into a layer of Microsoft-provided firmware known as the Hardware Compatibility Layer (see the recent OpenHCL blog for details). The HCL is measured by the Platform Security Processor (PSP), the hardware root of trust in the AMD EPYC processors. The measurement is included in SEV-SNP attestation reports signed by the PSP using a processor and firmware specific VCEK key. HCL implements a virtual TPM (vTPM) and captures measurements of early boot components including initrd and the kernel into the vTPM. These measurements are available in the vTPM attestation report, which can be presented along SEV-SNP attestation report to attestation services such as MAA.

 

When the GPU driver within the VM is loaded, it establishes trust with the GPU using SPDM based attestation and key exchange. The driver obtains an attestation report from the GPU’s hardware root-of-trust containing measurements of GPU firmware, driver micro-code, and GPU configuration. This report is signed using a per-boot attestation key rooted in a unique per-device key provisioned by NVIDIA during manufacturing. After authenticating the report, the driver and the GPU utilize keys derived from the SPDM session to encrypt all subsequent code and data transfers between the driver and the GPU.

 

Applications within the VM can independently attest the assigned GPU using a local GPU verifier. The verifier validates the attestation reports, checks the measurements in the report against reference integrity measurements (RIMs) obtained from NVIDIA’s RIM and OCSP services, and enables the GPU for compute offload. When the VM is destroyed or shutdown, all content in the VM’s memory is scrubbed. Similarly, all sensitive state in the GPU is scrubbed when the GPU is reset.

 

Hardened VM Images

Confidential inferencing will further reduce trust in service administrators by utilizing a purpose built and hardened VM image. In addition to OS and GPU driver, the VM image contains a minimal set of components required to host inference, including a hardened container runtime to run containerized workloads. The root partition in the image is integrity-protected using dm-verity, which constructs a Merkle tree over all blocks in the root partition, and stores the Merkle tree in a separate partition in the image. During boot, a PCR of the vTPM is extended with the root of this Merkle tree, and later verified by the KMS before releasing the HPKE private key. All subsequent reads from the root partition are checked against the Merkle tree. This ensures that the entire contents of the root partition are attested and any attempt to tamper with the root partition is detected.

 

Container Execution Policies

Much like many modern services, confidential inferencing deploys models and containerized workloads in VMs orchestrated using Kubernetes. However, this places a significant amount of trust in Kubernetes service administrators, the control plane including the API server, services such as Ingress, and cloud services such as load balancers.

 

Confidential inferencing reduces trust in these infrastructure services with a container execution policies that restricts the control plane actions to a precisely defined set of deployment commands. In particular, this policy defines the set of container images that can be deployed in an instance of the endpoint, along with each container’s configuration (e.g. command, environment variables, mounts, privileges). The policy is measured into a PCR of the Confidential VM's vTPM (which is matched in the key release policy on the KMS with the expected policy hash for the deployment) and enforced by a hardened container runtime hosted within each instance. The runtime monitors commands from the Kubernetes control plane, and ensures that only commands consistent with attested policy are permitted. This prevents entities outside the TEEs to inject malicious code or configuration.

 

Stateless Processing

Confidential inferencing adheres to the principle of stateless processing. Our services are carefully designed to use prompts only for inferencing, return the completion to the user, and discard the prompts when inferencing is complete. The prompts (or any sensitive data derived from prompts) will not be available to any other entity outside authorized TEEs.

 

Confidential inferencing minimizes side-effects of inferencing by hosting containers in a sandboxed environment. For example, inferencing containers are deployed with limited privileges. All traffic to and from the inferencing containers is routed through the OHTTP gateway, which limits outbound communication to other attested services. We also mitigate side-effects on the filesystem by mounting it in read-only mode with dm-verity (though some of the models use non-persistent scratch space created as a RAM disk).

 

Some benign side-effects are essential for running a high performance and a reliable inferencing service. For example, our billing service requires knowledge of the size (but not the content) of the completions, health and liveness probes are required for reliability, and caching some state in the inferencing service (e.g. the attention KV) or in hardware (e.g. L3 cache) is necessary for competitive performance. All such side effects are implemented in attested and transparent code and are subject to independent review. We are also actively conducting research to understand and effectively mitigate any security risks arising through these side-effects. 

 

Confidential and Transparent Keying

Clients of confidential inferencing get the public HPKE keys to encrypt their inference request from a confidential and transparent key management service (KMS). The KMS ensures that private HPKE keys are securely generated, stored, periodically rotated, and released only to Azure Confidential GPU VMs hosting a transparent software stack.

 

The release of private HPKE keys is governed by key release policies. When a Confidential GPU VM requests a private HPKE key, it presents an attestation token issued by MAA that includes measurements of its TPM PCRs. The KMS validates this attestation token against the key release policy and wraps the private HPKE key with a wrapping key generated and only accessible by the Confidential GPU VM. Key wrapping protects the private HPKE key in transit and ensures that only attested VMs that meet the key release policy can unwrap the private key.

 

The KMS permits service administrators to make changes to key release policies e.g., when the Trusted Computing Base (TCB) requires servicing. However, all changes to the key release policies will be recorded in a transparency ledger. External auditors will be able to obtain a copy of the ledger, independently verify the entire history of key release policies, and hold service administrators accountable. When clients request the current public key, the KMS also returns evidence (attestation and transparency receipts) that the key was generated within and managed by the KMS, for the current key release policy. Clients of the endpoint (e.g., the OHTTP proxy) can verify this evidence before using the key for encrypting prompts.

 

Using a confidential KMS allows us to support complex confidential inferencing services composed of multiple micro-services, and models that require multiple nodes for inferencing. For example, an audio transcription service may consist of two micro-services, a pre-processing service that converts raw audio into a format that improve model efficiency, and a model that transcribes the resulting stream. Most language models rely on a Azure AI Content Safety service consisting of an ensemble of models to filter harmful content from prompts and completions. Each of these services can obtain service-specific HPKE keys from the KMS after attestation, and use these keys for securing all inter-service communication.

 

User Anonymity

In addition to protection of prompts, confidential inferencing can protect the identity of individual users of the inference service by routing their requests through an OHTTP proxy outside of Azure, and thus hide their IP addresses from Azure AI. Enterprise users can set up their own OHTTP proxy to authenticate users and inject a tenant level authentication token into the request. This allows confidential inferencing to authenticate requests and perform accounting tasks such as billing without learning about the identity of individual users.

 

Transparency

Confidential inferencing is hosted in Confidential VMs with a hardened and fully attested TCB. As with other software service, this TCB evolves over time due to upgrades and bug fixes. Some of these fixes may need to be applied urgently e.g., to address a zero-day vulnerability. It is impractical to wait for all users to review and approve every upgrade before it is deployed, especially for a SaaS service shared by many users.

 

Our solution to this problem is to allow updates to the service code at any point, as long as the update is made transparent first (as explained in our recent CACM article) by adding it to a tamper-proof, verifiable transparency ledger. This provides two critical properties: first, all users of the service are served the same code and policies, so we cannot target specific customers with bad code without being caught. Second, every version we deploy is auditable by any user or third party. Although we aim to provide source-level transparency as much as possible (using reproducible builds or attested build environments), this is not always possible (for instance, some OpenAI models use proprietary inference code). In such cases, we may have to fall back to properties of the attested sandbox (e.g. limited network and disk I/O) to prove the code doesn't leak data. All claims registered on the ledger will be digitally signed to ensure authenticity and accountability. Incorrect claims in records can always be attributed to specific entities at Microsoft.  

 

When an instance of confidential inferencing requires access to private HPKE key from the KMS, it will be required to produce receipts from the ledger proving that the VM image and the container policy have been registered. Therefore, when users verify public keys from the KMS, they are guaranteed that the KMS will only release private keys to instances whose TCB is registered with the transparency ledger.

 

Roadmap and Resources

Confidential inferencing is a reaffirmation of Microsoft’s commitment to the Secure Future Initiative and our Responsible AI principles. It brings together state-of-the-art AI models and Azure infrastructure, with cutting edge confidential computing in Azure Confidential GPU VMs based on AMD SEV-SNP and NVIDIA H100 Tensor Core GPUs to deliver end-to-end, independently verifiable privacy.

 

We will continue to work closely with our hardware partners to deliver the full capabilities of confidential computing. We will make confidential inferencing more open and transparent as we expand the technology to support a broader range of models and other scenarios such as confidential Retrieval-Augmented Generation (RAG), confidential fine-tuning, and confidential model pre-training.

 

 

 

 

Published on:

Learn more
Azure Confidential Computing Blog articles
Azure Confidential Computing Blog articles

Azure Confidential Computing Blog articles

Share post:

Related posts

Azure Database for PostgreSQL Flexible Server - Elastic Clusters, faster disks, and AI updates

Increase scalability, optimize performance, and integrate advanced AI features with Azure Database for PostgreSQL Flexible Server. Scale up wi...

1 day ago

Introducing the new Linux-based Azure Cosmos DB Emulator (Preview)

We are excited to announce the preview release of the new Linux-based Azure Cosmos DB Emulator! This latest version is built to provide faster...

3 days ago

Azure Cosmos DB Shines at Microsoft Ignite 2024!

Microsoft Ignite 2024 took over the Windy City this week, bringing with it new technological innovation and exciting product announcements apl...

4 days ago
Stay up to date with latest Microsoft Dynamics 365 and Power Platform news!
* Yes, I agree to the privacy policy