Industry information

Nvidia's $700 million acquisition, what is the plan?

2024-05-08

Source: Content from Semiconductor Industry Watch (ID: icbank) Integrated from hpcwire, thank you.

Nvidia's market cap reached $2 trillion on the strength of GPU sales alone, and the company has room 

to grow in software. The company is hoping to fill a huge gap in software with its $700 million deal to

 buy Run.ai.


Ai deployments are getting bigger, more complex, and spread across more Gpus and accelerators. Run.

ai provides middleware to orchestrate and manage these deployments and ensure that resources are not

 wasted.



This middleware includes tools to speed up workloads, manage resources, and ensure that errors don't 

impact the entire AI or HPC operation. This middleware runs on the Kubernetes layer to virtualize AI workloads

 on Gpus.


Nvidia's Gpus are a hot product in the AI boom, and customers can buy them through all major cloud providers.

The Run.ai acquisition will help Nvidia build an independent cloud service without building a data center. Nvidia wants to create its own network of Gpus and DGX systems across all major cloud providers. Run.ai's middleware will provide an important hook for customers to get more Gpus available online or locally.


"Run:ai enables enterprise customers to manage and optimize their computing infrastructure, whether on-premises, in the cloud, or in a hybrid environment," Nvidia said in a blog post.


At the top of Nvidia's software stack is AI Enterprise, which includes programming, deployment, and other tools. It has 300 libraries and 600 models.


The stack includes a proprietary CUDA parallel programming framework, compiler, AI large language model, microservices, and other tools. The toolset also includes container toolkits, but Run.ai's middleware supports open source large language model deployment.




Nvidia Gpus are cloud-native, and Google, Amazon, and Oracle have powerful Kubernetes stacks. Nvidia already has its own container runtime as a Kubernetes plugin for GPU devices, but Run.ai will bring finer control to AI container management and orchestration. As a result, Nvidia will use more of these tools rather than relying entirely on cloud provider configurations.


problem


Nvidia Gpus are cloud-native, and Google, Amazon, and Oracle have powerful Kubernetes stacks. Nvidia already has its own container runtime as a Kubernetes plugin for GPU devices, but Run.ai will bring finer control to AI container management and orchestration. As a result, Nvidia will use more of these tools rather than relying entirely on cloud provider configurations.


Assigning multiple Gpus to AI tasks is still not a simple matter. Nvidia's Gpus are located in DGX server boxes 

deployed in all major cloud providers.


Nvidia's Triton inference server automatically distributes inference workloads among multiple Gpus in a configuration, 

but there are problems. AI workloads also require Python code to point to the cloud operator, and only after that will 

the AI workloads execute on Nvidia Gpus in the cloud service.



Nvidia is buying Run.ai. The company wants to reduce its dependence on cloud operators - another step in locking 

customers into its software stack. Customers can rent GPU time in the cloud and then head to Nvidia for all their

 software needs.


At the same time, it fulfills Nvidia's primary need to deliver a complete software stack.



Prepare for the future of artificial intelligence


Right now, AI training and reasoning is mostly done on Gpus in data centers, but that will change in a few years.



Over time, artificial intelligence (especially reasoning) will move from the data center to the edge. Artificial 

intelligence computers are already being used for reasoning.


The current state of AI processing using power-hungry Gpus is not sustainable. This is the same problem that 

cryptocurrencies face - a large number of hungry Gpus running complex math operations at full speed, with the

 ability to mine results quickly.


Nvidia has tried to work with Blackwell to reduce the power consumption of its chips. But the company is adding 

software that Run.ai will help coordinate workloads between Gpus and further connect to AI PCS and edge devices

 over the network.


Ai processing will also be done at various waypoints, such as telecommunications chips, as it travels through wireless

 and wired networks. However, more demanding AI workloads will remain on servers with Gpus, while less demanding

 workloads will be offloaded to the edge.



Companies including Rescale have partnered with others to keep high-priority tasks on Gpus in the cloud, while low-priority

 tasks are delivered to low-end chips elsewhere. The orchestration of Run.ai can manage this through a powerful combination 

of speed, energy efficiency, and resource utilization.


Run:ai stack


A small mistake could paralyze an entire AI operation. Run.ai's stack has three operational layers that prevent such incidents and provide a safe and efficient deployment.


At the bottom is the AI cluster engine, which ensures that the GPU is fully utilized and runs efficiently.


The engine provides granular insight into the entire AI stack, including the compute nodes, users, and workloads

 running on it. Companies can prioritize specific tasks and make sure idle resources are utilized.


If the GPU looks busy, Run.ai will reallocate resources. It can also allocate GPU quotas based on users or segmented 

resources within the GPU to ensure proper allocation.


The second layer, called the control plane Engine, provides granular visibility of the resources used in the cluster 

engine and cluster management tools to ensure that metrics are met. It also sets policies regarding access control,

 resource management, and workloads. It also has reporting tools.


The top layer includes apis and development tools. The development tools also support the open source model.


In line with Nvidia's new GPU


The biggest variable is whether Run.ai will take advantage of some of the RAS (Reliability, Availability and serviceability)

 features in Nvidia's latest Blackwell Gpus. The Blackwell GPU was launched in March and includes more fine-grained 

features to ensure the chip is performing as intended.


The GPU has on-chip software to point out healthy and unhealthy GPU nodes. "We're looking at the data trail of all these

 Gpus, monitoring thousands of data points per second to see how best to get the job done," said Charlie Boyle, vice president

 and general manager of the DGX systems division at Nvidia. In an interview in March.


Run.ai could be more efficient if it could leverage Blackwell's metrics or information. This kind of fine-grained reporting can go a 

long way toward ensuring that AI tasks run smoothly.


Nvidia's acquisition history


Nvidia's revenue for the most recent quarter was $22.1 billion, up 265 percent from the same period last year. Data center revenue

 is $18.4 billion.


The company is generating software revenue through a subscription model and eventually hopes it will become a multi-billion dollar

 market. The Run.ai acquisition will make that happen.


Nvidia made a name for itself with its failed acquisition of ARM, long before the company became a $2 trillion behemoth. ARM's bid 

has been held up by monopoly and regulatory concerns, but if it goes through, the chipmaker would dominate the CPU and GPU 

markets. ARM already dominates the mobile market and is moving into the server and PC markets.


In 2011, the chipmaker spent $367 million to acquire software modem maker Icera, which turned out to be a flop. Nvidia eventually 

gave up its pursuit of the mobile phone market, and Icera products were abandoned.












Copyright  ©  2021-  BEIJING ESTEK ELECTRONICS CO.,LTD.  All Rights Reserved.   备案号:京ICP备20031462号京公网安备:11010802032726 网站地图   腾云建站仅向商家提供技术服务