Home Latest News The Supercomputing Efficiency Curve Bends In The Right Direction

The Supercomputing Efficiency Curve Bends In The Right Direction

by Admin

Issues get just a little wonky at exascale and hyperscale. Issues that don’t matter fairly as a lot at enterprise scale, comparable to the associated fee or the efficiency per watt or the efficiency per greenback per watt for a system or a cluster, find yourself dominating the shopping for selections.

The primary motive is that powering and cooling giant aggregations of machines quickly prices much more than buying the iron, and this makes some machines themselves untenable financially and bodily. For the HPC facilities of the world, the dearth of vitality effectivity means not having sufficient price range to cowl the prices of the system, and for the hyperscalers and public cloud builders, it means not having the ability to compete aggressively with on-premises tools bought to the plenty and never having the ability to garner the excessive working income that the shareholders within the public clouds and hyperscalers count on.

All of it involves the identical. To get a deal with on how the vitality effectivity within the HPC sector has modified over time and the way future exascale-class programs are anticipated to enhance upon this curve, AMD commissioned Jonathan Koomey, a analysis affiliate at Lawrence Berkeley Nationwide Laboratories for practically three many years, a analysis fellow at Stanford College, and an advisor to the Rocky Mountain Institute for practically twenty years, to do some evaluation together with Sam Naffziger, a Fellow at Hewlett Packard, Intel, and now AMD for a lot of cumulative many years who labored on processor designs and energy optimization, to crunch the numbers on HPC programs within the twice-yearly Prime 500 supercomputer rankings.

The ensuing paper, which was simply revealed and which you’ll be able to learn right here, in the end targeted on the efficiency and vitality consumption of the highest system. When you extract that knowledge out, you discover that there are actually two efficiency curves prior to now twenty years – one the place the efficiency doubled each 0.97 years on common between 2002 and 2009 and one other the place the efficiency doubled each 2.34 years – not the typical of 1.35 years for those who simply slapped a line down on the entire dataset from 2002 by 2022, the place the anticipated 1.5 exaflops “Frontier” system being constructed by Cray and AMD for Oak Ridge Nationwide Laboratory will match on the efficiency line. Here’s what the information appears like:

Now, here’s what the efficiency and effectivity curves seem like only for the machines put in from 2009 by 2022, together with the estimate for Frontier, which is anticipated to be the very best performing supercomputer on this planet on the tail finish of that dataset. Have a look:

The efficiency of the Frontier system at double precision floating level is understood at 1.5 exaflops, however the vitality consumption of the system and subsequently the efficiency per watt has some fairly massive error bars on it proper now as a result of these haven’t been divulged. All that Cray has mentioned is that the machine would have in extra of 100 cupboards. Cray has mentioned that the “Shasta” cupboards on which Frontier can be based mostly have 64 blades per cupboard.

We expect that Cray will be capable of get two or 4 compute complexes, every with one single customized “Genoa” Epyc 7004 processor and 4 customized Radeon Intuition GPU accelerators, onto a single blade. As we confirmed with some very tough, again of the drink’s serviette math when speaking in regards to the “Aurora” A21 machine being constructed by Intel and Cray for Argonne Nationwide Laboratory and the long run Frontier machine, our greatest guess is that the Frontier machine could have 25,600 nodes in complete, with the CPU having 3.5 teraflops of its personal FP64 oomph and the Radeon Instincts having 13.eight teraflops of FP64 efficiency every, and for those who do the mathematics, that works out to 1.5 exaflops. Relying on how densely Cray can pack this, it might be 200 racks at 150 kilowatts or 100 racks at 300 kilowatts. Cray has solely mentioned it’s greater than 100 cupboards, however our level, the cupboard density is just not going to vary the variety of nodes or the efficiency or the efficiency per watt. At 4 nodes per blade and a low 150 watts for the CPU and possibly 200 watts to 280 watts for the GPU, you’ll be able to mess around and get under that 300 kilowatt threshold per rack for the uncooked compute with some room for reminiscence, flash, and “Slingshot” interconnect. You may go half as dense with 200 racks and unencumber some house and simply attempt to cram possibly 150 kilowatts to 180 kilowatts per rack, with extra headroom for storage and networking, and nonetheless ne within the vary of 30 megawatts to 36 megawatts for the entire machine.

These are all conjecture, in fact. So take it in that spirit.

What’s fascinating to ponder is how the efficiency curve is bending approach up on the highest supercomputer due to Frontier, however take a look at how arduous it’s to bend that efficiency/watt curve up. Maybe Cray and AMD will do higher than all of us count on for Oak Ridge with regards to vitality effectivity.

After studying this paper, we had just a little chat with Koomey and Naffziger.

Timothy Prickett Morgan: I feel the pattern strains for the Prime 100 machines are affordable and the highest machine is certainly consultant. I get the concept that you’re placing a knowledge level within the floor for Frontier and it’s positively beginning to bend the curve again up the best way we’d wish to see it. And all GPU machine, if somebody constructed an exascale one, would most likely be on the order of 80 megawatts.

Sam Naffziger: AMD takes the vitality effectivity of all our compute options severely and considered one of our objectives is to ship vital – and management – generational effectivity positive factors, measured as joules per computation or efficiency per watt, no matter your metric watt. That’s why we partnered with Jon Koomey, to extract on the supercomputing entrance the effectivity traits have been occurring, and we discovered that there actually was little or no work completed on this. The very first thing to know was what are the traits? After which that units a baseline for what we wish to obtain going ahead. We wish to do at the very least as nicely in effectivity positive factors regardless of the headwinds of Moore’s Regulation challenges and different impediments.

Jonathan Koomey: So there there’s a few key factors on the effectivity facet. First, we’re utilizing Linpack, which everyone knows is historic and never terribly consultant of precise workloads. However nonetheless, one of many methods for getting sooner processing is to do particular function computing. For a selected workload, you design a pc – or a set of programs – to concentrate on simply that workload. That’s a method you’ll be able to get away of the constraints that now we have confronted during the last 15 or 20 years. So long as that workload is sufficiently giant and sufficiently homogeneous, you’ll be able to design a tool that can do a lot, a lot better than a normal function computing machine for that particular job, and we are able to nonetheless do that.

Now, the opposite solution to assault the issue is thru co-design — and we speak about that within the paper as nicely. There’s a reference in our paper to Rethinking {Hardware}-Software program Codesign that managed for a silicon course of dies and as a part of that evaluation, the researchers discovered by making use of this co-design course of – built-in evaluation and optimization of software program and {hardware} collectively – that they had been capable of get an element of 5X to 7.6X enchancment in effectivity past what a traditional design method would result in. So for those who optimize the system, you may do a lot, a lot better than a easy compilation of current processors utilizing normal interconnects and different issues. However once more, it requires some information of workloads. It requires a scientific means of optimization.

TPM: One factor that I discovered fascinating – and IBM has been banging this drum with its Bayesian optimization and cognitive computing efforts – is that the most effective type of computing to do solely that which you really want. Determining what a part of the information to decide on or what sort of ensembles to do is the arduous bit. And for those who try this, then you definitely won’t want an exascale class pc to get the identical reply you’ll have in any other case. I feel we have to get the best reply per watt, and that is how we must be desirous about it. It isn’t nearly exaflops. Reply per watt and time to reply are simply as essential.

Sam Naffziger: This goes again to Lisa Su’s speak at Scorching Chips final yr. There are a selection of these system degree dimensions which speed up time to reply across the cache hierarchy, the balancing of reminiscence bandwidth, the node and internode connectivity, the CPU and GPU coherency – all of that are a part of the Frontier design. And it’s the mix of all people who speed up the algorithms. What you’re speaking about with IBM needs to be the following step.

TPM: The gorgeous factor about IBM’s approaches is that they work on Frontier or another system. It has nothing to do with the structure. I occur to consider we’re going to need to do all of this stuff. My concern prior to now couple of months is that if we preserve doing what we’re doing, it’ll take 100 megawatts to 120 megawatts and $1.eight billion {dollars} to create a 10 exaflops supercomputer. We will argue whether or not or not we should always even fear about that quantity. However that’s the curve we’ve been plotting, and there’s no approach any authorities company goes to provide you with that type of cash. So kudos to AMD and Cray and Oak Ridge for bending that curve down the opposite route, and for everybody else who will do innovation on this entrance.

Jonathan Koomey: What you’re pointing to is that the concentrate on brute power is resulting in diminishing returns and that now we have to get an entire lot extra intelligent in how we accomplish the duties that we wish to accomplish. And which means concentrate on workloads and understanding precisely how one can do them in essentially the most environment friendly approach. Nevertheless it’s I feel that is it’s type of pulling us again from a quite simple view of efficiency in computing and making folks – it’s forcing folks to know that there’s there’s a actual distinction. We shouldn’t be speaking solely about this type of brute power, normal function compute capability – it’s a way more difficult query than what folks have assumed to this point.

TPM: Sadly, although, some HPC facilities need to construct machines that may do many issues as a result of they’re so costly that may’t be one workload workhorses. It’s nice {that a} machine that may do fashionable HPC may do AI, and it’s nice that you just may be capable of combine them collectively. However had it not been for that joyful coincidence of AI needing GPU acceleration to do coaching and presumably inference now too, a few of these massive HPC machines won’t have gotten constructed.

Jonathan Koomey: I feel that we’re going to see much more innovation on the software program facet. The parents on the College of Washington are engaged on approximate computing for sure sorts of issues, like face recognition and different functions the place you don’t want to resolve each single little bit of an issue with each bit of knowledge. You are able to do issues which are approximate and that get you utilize to get you to your finish level with a lot much less compute and with roughly the identical constancy. Tt’s that type of change on the software program facet that I feel goes to yield some massive profit.

Sam Naffziger: That’s what we’re seeing within the machine studying coaching house as we’ve moved from 32 bits right down to 16 bits and now to be right down to bfloat16. We’re buying and selling the dynamic vary versus precision to get adequate solutions considerably extra cheaply. There’s quite a lot of curiosity within the HPC group round exploiting that bfloat16 format for precise excessive efficiency computing workloads and getting adequate precision relative to FP64, however getting a solution way more cheaply.

Related Posts