Liquid cooling’s moment comes courtesy of AI

by Pelican Press
8 views 9 minutes read

Liquid cooling’s moment comes courtesy of AI

Liquid cooling machinery dates back more than a century, but its use in the data center is newer and becoming more popular thanks to AI. In the last few years, the power demand for compute has ratcheted up, and liquid cooling has emerged to keep servers that generate heat when processing data workloads cool.

As electricity passes through conductors, heat is created. The higher the power demands, the higher the heat production and the need to dissipate that heat. For data ingestion and model training, AI workloads rely on high-end GPUs for computation, which are built to consume greater amounts of electricity with each model. In just a few years, Nvidia GPUs more than doubled their energy consumption from 400 W each for the A100 to 700 W for the H100. The upcoming Blackwell B200 models Nvidia plans to ship this quarter are built to consume more than 1,000 W.

How do vendors satiate tech’s ever-growing energy hunger? That will be the question going forward, according to Laura Musgrave, lead researcher on responsible AI at BJSS, a U.K.-based IT services and consulting firm.

Citing a report from the International Energy Agency (IEA), “Electricity 2024,” Musgrave said data center energy demands could double by 2026, driven by AI. The IT industry needs to plan for ways to mitigate this energy consumption.

“We’re going to need to look at all areas that contribute to the AI life cycle,” she said. “So we’re obviously thinking about things like the cooling systems [for] the hardware used to run AI.”

Unprecedented time for AI

There are several ways to use liquid to cool data center hardware. One is immersion cooling, which transfers heat from the equipment by submerging the hardware in a non-conductive solution. Another uses rear-door heat exchangers to dissipate heat through a liquid coolant at the back of the server. However, for AI workloads where the GPUs generate most of the heat, experts said direct-to-chip is the method gaining the most traction. Direct-to-chip uses a liquid coolant that runs to a cold plate where the GPU is sitting and removes heat to dissipate at a heat exchanger elsewhere in the system.

Different methods address different needs. The liquid cooling market, especially for AI chips, is evolving quickly, according to Vlad Galabov, director of cloud and data center research at global analyst firm Omdia.

“[No one vendor is] fighting for a small chunk of pie,” he said. “The pie is huge. It’s growing enormously fast.”

Bernie Malouin, CEO of direct-to-chip cooling startup Jet Cool, agreed with the modern growth assessment of the market. Three years ago, about 7% of the data center market used liquid cooling; today it’s around 22%, he said, citing Uptime Institute’s “2024 Cooling Systems Survey: Direct Liquid Cooling Results.”

“Everybody’s got a little piece of the pie. That whole pie keeps getting bigger and bigger as it goes forward,” Malouin said.

To date, there are over a dozen major liquid cooling vendors and several startups. But what they offer collectively is not enough to satisfy the market as it stands today, according to Joe Capes, CEO of LiquidStack, a liquid cooling vendor that offers both immersion and direct-to-chip cooling.

“The market is changing almost weekly,” Capes said. “I’ve never seen anything like this before in my career. I think what we can safely say is that the period that we’re going through and into for the next year or two is historic in nature.”

Diagrams of Jet Cool's architecture.
Jet Cool’s microconvective cooling, an array of cooling jets aimed at hot spots, and its microchannel cooling, a method that uses channels to dissipate heat.

More than one answer

While AI workloads create business for liquid cooling vendors, they still need to differentiate themselves, Galabov said.

“I do think that there is some tremendous opportunity — and opportunity for new people,” he said. “I think what needs to set you apart is the efficiency of your cooling.”

If the GPU uses more than 700 W, the cooling vendor must match that performance, Galabov said.

Accelsius developed NeuCool, which uses targeted cold plates on hot spots. JetCool uses a series of jets aimed at the hottest spots on the chip. LiquidStack offers both direct-to-chip and immersion cooling to fit customer needs. Several vendors are building their own cooling racks to fit customer uses cases, including Dell, HPE, OVHcloud and Lenovo’s long-running Neptune liquid cooling.

Although liquid cooling can alleviate some of the power consumption concerns, there are projections for more data centers and more AI model training that Capes still sees as a concern.

“The reality is that as much as liquid cooling is going to help alleviate the stress on the power grid … we’re seeing projections that over the next 10 years, the amount of power data centers consume can triple,” he said.

Don’t generate the heat in the first place

While liquid cooling vendors work to build systems to conserve energy, one vendor thinks a better solution is to simply not generate the heat in the first place.

Daanaa Resolution, based in Vancouver, British Columbia, is currently testing a power transaction unit (PTU) to optimize power conversion for more efficient energy transfer and less waste. As power comes into a building and goes through multiple form factors to get to the chip, energy is wasted through conversion and heat generation, according to Udi Daon, CEO of Daanaa.

“If you’re going to liquid cool, that means that you created the inefficiency to begin with that you need to cool,” Doan said.

The PTU will allow for more efficient transfer using a patented method of power conversion that doesn’t rely on magnetic conduction, which generates heat through coil excitation. Instead, the PTU works with the electric and magnetic field simultaneously to convert power without heat generation.

However, Doan doesn’t see the PTU as a direct competitor to liquid cooling but just another way to create a more energy-efficient data center architecture moving forward.

“I don’t think any one technology is a silver bullet,” he said. ” It’s always going to be a combination of technologies that need to work well together.”

From sustainability to sustainable

Reducing power consumption reduces a company’s carbon footprint. But the focus for liquid cooling today isn’t on sustainability. It’s on AI performance and finding ways to add more GPUs to an AI cluster.

The push-pull between power consumption for AI and environmental concerns was on stage recently at an AI conference in Washington, D.C. There, former Google CEO Eric Schmidt said, “We’re not going to hit the climate goals anyway because we’re not organized to do it. I’d rather bet on AI solving the problem than constraining it and having the problem.”

We were on the right trajectory with data centers getting more and more efficient. And then we threw GPUs into them, and that curve’s gone the other way.
Steven DickensAnalyst, Futurum Group

If data center providers are going to get back to sustainability, they must first address its need for more power and correct its course, according to Steven Dickens, an analyst at the Futurum Group.

“We were on the right trajectory with data centers getting more and more efficient. And then we threw GPUs into them, and that curve’s gone the other way,” he said.

The only way to get back on track is by investing in and implementing technologies like liquid cooling, Dickens said.

Another step is consumer education. The IEA report proposed AI energy rating system similar to energy rating systems such as the government-backed Energy Star to indicate a certain level of efficiency, Musgrave said.

“Something like that would help consumers to make informed decisions about how and when to use AI and which AI tools to use,” she said. “It would also help AI developers to consider the energy demand of AI systems and tools as well.”

Energy consumption must be addressed, Doan agreed. He said companies and consumers also need to prioritize use cases because there is a limit to how much power can be generated at a given time.

“[In] some desert areas, people are going to be asking, ‘Do I want AC or AI?'” he said.

Adam Armstrong is a TechTarget Editorial news writer covering file and block storage hardware and private clouds. He previously worked at StorageReview.



Source link

#Liquid #coolings #moment #courtesy

You may also like