It is well known that artificial intelligence (AI) is expected to revolutionize all aspects of human activities. However, to fully unleash this potential, it is necessary to confront a fundamental fact: the infrastructure that underpinned traditional computing can no longer meet the demands of future AI development.
At present, the industry has witnessed the astonishing scale presented by this transformation:
Training ChatGPT-4 used more than 1PB of data – equivalent to 200 million songs being played continuously for 1,000 years.
OpenAI serves one billion active users every month, and the amount of data consumed by each user is ten thousand times that of traditional applications.
By 2030, this AI revolution will drive infrastructure investment of over one trillion US dollars.
This explosive growth is driving the energy consumption of data centers to jump from the megawatt level to the gigawatt level. The resulting limitations cannot be solved merely by adding more general-purpose servers. The entire industry must fundamentally rethink the architectural design, construction methods and deployment strategies of computing infrastructure. Those enterprises that can successfully navigate this transformation will fully unleash the potential of AI. Enterprises that fail to follow up in a timely manner are highly likely to face the risk of being eliminated from the market.
In the SKYTalk speech at the 62nd Design Automation Conference (DAC) held in San Francisco recently, Mohamed Awad, Senior Vice President of Arm and general Manager of the Infrastructure Division, shared experiences and insights on how to embrace infrastructure changes and seize the tron-dollar opportunities of AI.
The experience and enlightenment of past technological changes
Awad said that there is actually a “blueprint” to follow in dealing with such a huge technological change. Over the past 30 years, from mobile computing to the automotive revolution and then to the deployment of the Internet of Things, each successful technological revolution has followed a similar development path. And those enterprises that eventually stand out and become leaders generally share the following three common characteristics:
Pursue technological leadership
Possess system-level thinking
Cultivate a powerful ecosystem
This development model provides important references for the transformation of AI. Looking back at the mobile revolution, it is not merely about the improvement of processor speed, but also involves a comprehensive innovation in energy efficiency optimization, software stacks and even manufacturing partnerships. Similarly, in the process of the automotive industry’s transformation towards autonomous driving and electrification, an integrated promotion strategy also needs to be adopted in aspects such as chip design, system architecture, and ecological collaboration.
Awad said, “To enable AI to truly achieve the grand goals we have set for it, what is actually needed is still the same path – technological leadership, systems designed from the bottom up, and a powerful ecosystem.”
The urgency of infrastructure evolution
The evolution process of data centers fully demonstrates the industry’s ability to quickly adapt to AI demands. Before 2020, enterprises mainly relied on general-purpose servers and added accelerators through PCI slots. By 2020, the focus shifted to integrated servers with direct connection capabilities between Gpus. In 2023, we witnessed the highly coupled integration of CPU and GPU. Nowadays, the industry is moving towards a complete “AI factory” – starting from the chip level, building entire server cabinets for specific load scenarios.
Leading technology companies are discarding the “one-size-fits-all” universal architecture approach. The Vera Rubin AI cluster of NVIDIA, the AI UltraCluster of Amazon Web Services (AWS), the Cloud TPU cabinet of Google, and the Azure AI cabinet of Microsoft are all customized systems specially designed for their own unique needs. Rather than a general solution.
Awad explained, “All the leading hyperscale cloud service providers are doing the same thing.” They start building highly integrated systems from the chip layer and drive innovation at the chip layer in reverse based on their own system requirements.
This transformation reflects the broad consensus reached throughout the industry: the computing demands of AI must rely on infrastructure specifically designed for AI workloads, rather than solutions modified from general-purpose systems.
Large-scale verified performance
AWS reports that over 50% of the new CPU computing power deployed in the past two years has come from its Graviton processors based on the Arm architecture. Furthermore, key workloads including Amazon Redshift, Prime Day, Google Search and Microsoft Teams are now running on infrastructure built on advanced technologies such as Arm Neoverse. Significant performance improvement and energy efficiency optimization have been achieved.
Awad further explained that these measures were not for cost-cutting considerations but for the pursuit of performance. Enterprises develop custom chips not because they are cheaper, but because they can achieve performance and energy efficiency levels that general solutions cannot reach in specific data center environments.
Accelerate innovation through collaboration
Creating custom chips faces many challenges, including high costs, complex designs and long development cycles. The solution lies in lowering the threshold and accelerating innovation through a collaborative ecosystem. Pre-integrated computing Subsystems such as Arm CSS (Compute Subsystems), shared design resources, and proven toolflows can significantly shorten the development cycle.
Existing industry examples have demonstrated the potential of ecological collaboration. Some cooperative projects have enabled partners to save 80 engineers per year and shorten the development cycle from several years to several months by using pre-configured and pre-verified CSS in the design. Awad said that one of the projects took only 13 months from its launch to manufacturing a chip capable of running Linux on 128 cores – a speed that is astonishing for top-notch chip development.
The emerging Chiplet ecosystem represents another major breakthrough in industry collaboration. Industry initiatives like the Arm Chiplet System Architecture (CSA) are defining common interfaces and protocols. Many partners in the Asia-Pacific region have participated in it, jointly developing standardized computing modules that can be combined and applied to different scenarios as needed. So as to construct a more flexible and cost-effective development path. Furthermore, through ecosystem projects such as Arm Total Design, this type of collaborative framework closely connects foundries, design service providers, IP suppliers, and firmware partners to simplify the entire development process.
The synergy of software and hardware unleashes the potential of AI
Hardware innovation alone cannot truly unleash the potential of AI. Achieving success also requires a strong software ecosystem as support – this is backed by continuous investment over 15 years: the participation of millions of developers, extensive support from open source projects, and the joint efforts of tens of thousands of suppliers to create compatible solutions.
Today’s leading AI infrastructure deployments rely on mature software stacks, covering Linux distributions, cloud-native technologies, enterprise-level SaaS applications, and AI/ML frameworks, etc. The maturity of this software enables enterprises to deploy new hardware architectures with confidence and be assured that their entire technology stack can operate seamlessly.
Awad said, “Hardware makes no sense without software.” This point is of vital importance. Because when we talk about accelerators, devices and chips built for AI, people often ask me about the software aspect. Often, start-up companies come to me and say, “Hey, I have developed this great hardware product.” But when I asked them, “How many people develop software specifically for it?” At that time, the answers are often not so persuasive.
Embrace infrastructure transformation
As AI continues to grow exponentially, the challenges faced by infrastructure will also become increasingly severe. Enterprises cannot achieve expansion merely by adding traditional servers. What they need are customized systems optimized for AI workloads, and they must also have the ability to operate efficiently on an unprecedented scale.
Enterprises and technologies that can successfully cope with this transformation often share common characteristics: they pursue breakthrough performance through technological leadership, adopt system-level holistic thinking rather than component-level thinking, and build collaborative ecosystems to accelerate innovation while reducing individual risks.
This infrastructure transformation is both a challenge and an opportunity. Those enterprises that are currently preparing – by understanding these core principles and building the appropriate technological foundation – will have a greater chance of seizing the trillion-dollar market opportunities brought about by AI. Enterprises that still adhere to the old model may miss the greatest technological opportunities of the contemporary era.
Awad concluded, “The future belongs to those who are ready to create it.” The transformation of infrastructure has begun.