HomeLatest NewsGoogle Cloud, NVIDIA launch A5X to slash inference costs

Google Cloud, NVIDIA launch A5X to slash inference costs

Posted: 23 April 2026, 14:33 CET 3 min read

Google Cloud and NVIDIA introduced A5X bare-metal instances on Vera Rubin NVL72 racks to cut inference cost per token up to 10× and boost token throughput per megawatt 10×.

Google Cloud and NVIDIA announced A5X bare-metal instances at the Google Cloud Next conference. The instances run on NVIDIA Vera Rubin NVL72 rack-scale systems and are designed to reduce AI inference cost per token by up to 10× while increasing token throughput per megawatt by 10×.

A5X pairs NVIDIA ConnectX-9 SuperNICs with Google Virgo networking to provide the bandwidth needed to link thousands of processors. The configuration scales to 80,000 Rubin GPUs in a single-site cluster and to 960,000 GPUs across a multisite deployment for large-scale inference and agentic workloads that require tight synchronization across many parallel processors.

For data governance, Google Distributed Cloud will offer preview access to Google Gemini models running on NVIDIA Blackwell and Blackwell Ultra GPUs, allowing organizations to run models inside their controlled environments next to sensitive data stores. The architecture supports NVIDIA Confidential Computing, which encrypts training data and prompts at the hardware level so cloud operators cannot view or modify protected data.

Google is previewing Confidential G4 VMs equipped with NVIDIA RTX PRO 6000 Blackwell GPUs to extend cryptographic protections to multi-tenant public cloud environments. The companies described this as the first cloud-based confidential computing offering for NVIDIA Blackwell GPUs.

NVIDIA Nemotron 3 Super is available on the Gemini Enterprise Agent Platform to support reasoning and multimodal models used in agentic systems. Google Cloud and NVIDIA introduced Managed Training Clusters on the platform with a managed reinforcement learning API built with NVIDIA NeMo RL to automate cluster sizing, failure recovery and job execution during long training cycles.

Customers and partners are testing parts of the stack. CrowdStrike runs NVIDIA NeMo libraries, including NeMo Data Designer and NeMo Megatron Bridge, to generate synthetic data and fine-tune cybersecurity models on Managed Training Clusters with Blackwell GPUs. OpenAI uses GB300 and GB200 NVL72 systems for large-scale inference. Thinking Machines Lab runs its Tinker API on A4X Max VMs. Snap moved data pipelines to GPU-accelerated Spark on Google Cloud to reduce costs for large-scale A/B testing. Schrödinger uses NVIDIA-accelerated computing on Google Cloud to shorten drug discovery simulations from weeks to hours.

Deployment options range from full NVL72 racks to fractional G4 VMs offering one-eighth of a GPU so customers can match compute to specific workloads. NVIDIA Omniverse libraries and the open-source NVIDIA Isaac Sim framework are available through the Google Cloud Marketplace for building digital twins and training robotics pipelines. NVIDIA NIM microservices, including the Cosmos Reason 2 model, can be deployed to Google Vertex AI and Google Kubernetes Engine to support vision-based agents.

The developer community for the joint platform has grown quickly, with Google and NVIDIA reporting more than 90,000 developers in one year. Startups including CodeRabbit and Factory use Nemotron-based models on Google Cloud for code review and autonomous software agents. Companies such as Aible, Mantis AI, Photoroom and Baseten build enterprise data, video intelligence and generative imagery solutions on the platform.

Mark Lohmeyer, vice president and general manager of AI and Computing Infrastructure at Google Cloud, commented: “At Google Cloud, we believe the next decade of AI will be shaped by customers’ ability to run their most demanding workloads on a truly integrated, AI-optimized infrastructure stack. By combining Google Cloud’s scalable infrastructure and managed AI services with NVIDIA’s industry-leading platforms, systems and software, we’re giving customers flexibility to train, tune, and serve everything from frontier and open models to agentic and physical AI workloads-while optimizing for performance, cost, and sustainability.”

The announcements at Google Cloud Next outlined a combined hardware, networking and software approach intended to support large-scale inference, confidential model deployments and production-ready agentic systems across regulated industries and industrial applications.

Content on BlockPort is provided for informational purposes only and does not constitute financial guidance.
We strive to ensure the accuracy and relevance of the information we share, but we do not guarantee that all content is complete, error-free, or up to date. BlockPort disclaims any liability for losses, mistakes, or actions taken based on the material found on this site.
Always conduct your own research before making financial decisions and consider consulting with a licensed advisor.
For further details, please review our Terms of Use, Privacy Policy, and Disclaimer.