SoftBank and Ampere Optimize AI Inference on CPUs

Subscribe to Our Newsletter

Success! Now Check Your Email

To complete Subscribe, click the confirmation link in your inbox. If it doesn’t arrive within 3 minutes, check your spam folder.

Ok, Thanks
SoftBank and Ampere Optimize AI Inference on CPUs
Source: SoftBank

SoftBank and Ampere Optimize AI Inference on CPUs

SoftBank and Ampere start joint verification to run SLM and MoE AI models on CPUs, reducing power and increasing efficiency for AI agents.

Philip Lee profile image
by Philip Lee

Tokyo, Japan - SoftBank Corp. and semiconductor designer Ampere Computing LLC announced joint testing to evaluate whether central processing units can handle artificial intelligence workloads more efficiently than graphics chips, which currently dominate the sector.

The tests focused on using Ampere's CPUs for AI inference—the process of running a trained model to generate responses—alongside SoftBank's proprietary software for managing and distributing AI models.

According to the companies, technical evaluations conducted in a multi-node environment that combined CPU-only and hybrid CPU-GPU servers showed lower power consumption and more concurrent model executions than standard GPU-based configurations.

A customized version of Llama.cpp, an open-source inference framework, was used in the testing.

The configuration reportedly enabled faster switching between AI models.

The tests focused on smaller AI architectures, such as Small Language Models and Mixture of Experts models, which are considered for use in autonomous systems and business process automation, where response speed and energy consumption are priorities.

Ryuji Wakikawa, vice president at SoftBank's Advanced Technology Research Institute, stated that the performance and power efficiency of Ampere’s Arm-based CPUs are important for the large-scale deployment of future AI agents.

The companies indicated plans to continue developing an inference platform that dynamically switches among multiple models while maintaining consistent output throughput.

Sean Varley, chief evangelist at Ampere, said the results suggest that CPUs could serve as a lower-cost alternative to GPUs for certain distributed AI workloads, particularly those with variable demand.

Philip Lee profile image
by Philip Lee

Subscribe to The Pickool

Success! Now Check Your Email

To complete Subscribe, click the confirmation link in your inbox. If it doesn’t arrive within 3 minutes, check your spam folder.

Ok, Thanks

Read More