Category: Hardware
February 20th, 2007
Tera-Scale: What Would We Do with All These Cores and How Would We Feed Them?
Last week’s Tera-scale announcement at the International Solid State Circuits Conference (ISSCC) certainly created a lot of buzz in the press and on the Web. I have to admit being somewhat surprised by how extensively the story was picked up, not just in the technical press, but the popular press as well. From the many interviews I did, it was quite clear that people have an insatiable desire to know what their future computing devices will do and how soon they will do it. Fortunately, researchers at Intel and elsewhere have spent several years, not just thinking about the question, but actually building prototypes of those next-decade applications. Believe me when I say it’s much more credible to talk about a specific example than just blow some smoke and promise that whatever those applications are, they will be really cool.
Back to Recognition, Mining, and Synthesis
I first addressed the issue of why now is the time to create these ideas in my post Cool Codes in which I introduced the RMS categories. The important point is there is an entirely new breed of applications waiting to be invented that doesn’t simply benefit from Tera-scale performance, it requires it. Let me refresh you on RMS by talking about real-time motion capture and rendering and a few other examples to illustrate the idea.
Today, to produce a Pixar-quality image takes about 6 hours of computing on a current-generation, dual-processor rack-mount server. That's to render one frame out of the 144,000 frames required for a feature-length, animated movie. How cool would it be if you could bring that quality of image rendering to your desktop in real-time? Imagine playing the Cars video game with imagery that's comparable to what you see in the theater. To create that user experience, we have to go from 6 hours per frame to 124th of a second per frame, but at least it’s a very well-characterized computational improvement. It will take a combination of teraFLOPS of computing power and huge advances in the algorithms that render the image. Note that synthesis is the “S” in RMS, and this is but one example.
By the way, synthesis is not just about making pictures. It's making sounds, making things move and interact with one another in physically accurate ways. When an animated character speaks in these future desktop animations, their facial muscles will move exactly as they do when a real person speaks. It does beg the question whether we’ll actually need actors at some point, but that’s a topic for another blog.
Here’s another example: Today in our labs we can data mine the imagery found in a recorded multi-camera video of an individual moving within a defined 3D space. The goal of this video stream mining is to extract their full body motion. We can’t quite do it in real-time at this point, but we are pretty close and there’s no need for marks or lights on the clothing or a background blue screen to do it. By the way, mining is the M in RMS.
Once we have the body motion information, we use it to animate a skeletal model of a human. It’s the skeletal model that makes sure we have the kinematics right and the motion is consistent with how people move. At that point, we can put the “skin on the bones” to create a fully synthetic person moving identically to the real one. Adding lights, shadows, and reflections to our little virtual world gives us a synthetic figure moving naturally and accurately within it.
If you started to think how the above technology could replace the Wii handheld remote controllers, you’ve got the idea. Future video entertainment will use full-body motion capture to put your virtual self in the game, dance instruction, or Tai Chi lesson.
Take out the Noise, Take out the Shake
Most of us have cassettes full of VHS quality (or worse) home video. When we put it up on our new 50-inch HD displays, it simply looks awful. Adding video cameras to cell phones has further exacerbated the problem. Fortunately, there is a way to rescue these old videos. The technique is called super-resolution and it takes advantage of the tremendous amount of redundancy in a video stream. Using statistical techniques, we can dramatically reduce camera shake, improve resolution, and fix a variety of other visual problems by exploiting all the extra information provided by each frame. Imagine being able to bring all your cell phone videos up to standard definition quality and reprocess those “obsolete” DVDs into high-definition DVDs. It’s a Tera-scale problem for sure, and the reconnaissance satellite folks have been doing it for years. It’s time to make it safe for home use.
How Is It Possible to Feed Such a Beast?
Silent E was right in pointing out that memory capacity and bandwidth have to match or the cores will “starve” and users will not see the performance benefits. It’s relatively easy to pack a lot of processing power on a single chip. It’s much, much harder to provision the memory and I/O bandwidth to keep those processors productive. Fortunately, there are several approaches which promise to meet the future needs. Let me briefly mention two of them.
First, we need to bring more memory closer to the processors, and three approaches do this with varying degrees in bandwidth and capacity. The first is to use system-in-package (SIP) technology to place memory chips in the same package as the processor. Microsoft uses this approach in the Xbox 360. The next approach is to stack a memory chip underneath the processor, which is what we have planned as a future experiment with the Tera-scale Research Processor. Finally, there is embedding DRAM on the processor, as IBM described last week at ISSCC. Much work is required to decide which approach is best in a given situation, but the point is there is more than one solution.
Getting data on and off the chip is also a challenge. While we continue to push electrical signaling to higher and higher speeds, optical signaling is an increasingly attractive option. Costs are coming down and may decline even further when we move to silicon-based photonic solutions. If we can approach electrical costs, but still provide the flexibility and interference advantages of optical, we might just go optical. Once you make that transition, things look good out to about 10 terabits per second per fiber, which should keep us going for a little while to say the least.
Tera-scale keeps sounding more and more fun. Stay tuned as I continue to paint to complete picture. The blog is long overdue for a discussion of the programming challenges ahead.
February 12th, 2007
80 isn't nearly enough
What an exciting week this has been. We unleashed the ‘Era of Tera’ by showcasing the world’s first programmable processor that can deliver Teraflops performance with remarkable energy efficiency.
It’s rather extraordinary that after decades of single core processors, the high volume processor industry has gone from single to dual to quad-core in just the last two years. Moore’s Law scaling should easily let us hit the 80-core mark in a mainstream processors within the next ten years and quite possibly even less. It is therefore reasonable to ask the question: what are we going to do with this sudden abundance of processors?
The answer is somewhat obvious on the server side of things. More cores and more threads means more transactions per unit time, assuming that all those cores are given the necessary appropriate memory and I/O bandwidth. Other computationally intensive applications in scientific and engineering computing are also likely beneficiaries. I’m talking about seismic analysis, crash simulation, molecular modeling, genetic research, and fluid dynamics.
On the client end of the wire, things aren’t as obvious or straightforward, but they are no less interesting. The abundance of cores is likely to lead to a very different approach to resource allocation. For decades operating systems have been optimized for managing the very scarce processor resources, by cleverly multiplexing many tasks or threads across one or now two or four cores. As quality of service has become more important to users, we’ve all come to realize the limitations of this approach as frames get dropped from video streams or productivity applications pause while the video goes full tilt. A different approach, and one that probably hasn’t received enough attention from the research community, is to dedicate cores to providing particular functions. The allocations become more static than what we see today, but they can certainly be changed over longer periods of time ranging for seconds to hours or even days.
As an example, we could conceive of a multi-function computing appliance that contains a processor with perhaps three dozen cores: we might allocate four of those cores to running the core productivity and collaboration applications. Another cluster of cores, on the order of a dozen, might provide very high quality graphics and visualization. Media processing, beyond encode/decode which would best be handled by dedicated hardware, would be the responsibility of yet another cluster of, say six cores. Still other clusters might be do real-time data mining on various streams of data flowing in from the Internet. Various bots operating within this cluster might be assembling news, shopping, or investing. The key idea here is to let the abundant hardware resources replace a lot of very complex OS code. It’s replaced by cluster or partition management code, which doles out the resources, but stays out of the way until there’s a major shift in the workload.
TJGeezer suggested using Tera-Scale capability along with huge amounts of NAND in an iPOD size container for AI applications. He may be right. One can easily imagine clusters of cores supporting an advanced human interface with real-time speech and vision or language translation. A lot of algorithmic development would have to take place to make this feasible, but there is no doubt in my mind that we’ll have the hardware resources needed to host them. The statistical algorithms that will form the heart of these future recognition systems are highly parallel and thus a great fit for a high core count architecture.
An abundance of cores also enables new ways to deal with challenges associated with system operation in the face of device failures and cosmic radiation. Think of the collection of cores as a redundant array of computing engines (RACE). Two or more cores could be used in tandem to detect and correct faults. If a core becomes unreliable, it can simply be removed from service without significantly affecting overall system performance
As we pack more and more computing resources into smaller areas, managing power and heat in a very fine grain manner will be critical. If we have more cores than are needed to execute the desired set of workloads, we can swap threads between cores whenever one becomes too hot. It’s like the hot potato game – move the potato fast enough and you never get burned. We’ll need the ability to adjust supply voltages, operating frequencies, and sleep states of individual cores in matters of microseconds.
While the challenges are somewhat mind-boggling on both the hardware and software sides to develop and fully utilize these future Tera-Scale platforms, the benefits and opportunities from putting these computing capabilities into the hands of all users are equally incredible.
So how many cores could you use, and what would you use them for? ArsTechnica user dg65536 said it best in his post – “Now that I think about it…80 isn't nearly enough.”
December 18th, 2006
Polaris Points the Way to Terascale Computing
Two months ago at the Intel Developer Forum, Intel’s CEO, Paul Otellini, unveiled a 300mm wafer that contained hundreds of massively multi-core prototype processors each consisting of 80 simple, but programmable floating-point cores. While it was an early wafer fresh from Intel’s Fab 24 in Ireland, it generated a lot of attention and discussion in the press. Numerous excellent points (both positive and negative) were raised – with most of the points centered on what would you do with this many cores and how one would program it. More on my thoughts to these points in a later blog, but today I wanted to give a status update on what we call the Polaris prototype.
Just two weeks ago we received the first packaged Polaris processors. Within the two hours of power-up, the very first chip in the test fixture reached 1.02 TFLOPS at 3.2 GHz while consuming less than 100W. The fact that we broke the TFLOPS barrier on A0 silicon is just amazing. It’s very special for me because it comes almost exactly a decade to the day after ASCI Red was the first system in the world to break that barrier – but consumed over 500 KW watts and 2500 square feet of computing space to do it.
While this 80-core system is still very much an experimental design (go to the International Solid State Circuits Conference, session 5, to get all the technical details), it does point the way to the near future when teraFLOPS capable designs will be commonplace. Just think – within the past two years the industry has gone from single to dual to quad-core – and by Moore’s Law extrapolation, we’ll hit the 80-core mark with production processors in less than ten years.
December 4th, 2006
Mind the Gap
The last few months have been hectic to say the least. After the Intel Developer Forum in late September, I’ve been flying around the planet more or less non-stop. When I was in Europe and Russia last month, before heading to Japan and China just before Thanksgiving, a familiar phrase from the London Underground reminded me of a topic that I’ve wanted to blog about for some time – namely, closing the main memory – bulk storage latency gap that has plagued computer architecture for the last four decades.
Mind the Gap
At the fast end of the memory hierarchy, excluding on-chip caches, we have low-latency DRAM memory, but at over $100 per gigabyte, a lot of PCs still ship with half that amount. While a gig of DRAM may seem like a lot to those of us who can remember when the PDP-8 had only 4KB of core memory and a paper tape reader, a gigabyte is not nearly enough to hold my Outlook archive folders, a high-def movie, or a desktop search index file. I’m sure you share that frustrating feeling when you see your hard-drive light turn on and stay on as an application launches or more data gets paged into memory from disk.
Moving out one level in the hierarchy, magnetic disk has been the bulk storage technology of choice for decades. While disk continues to grow in capacity with relatively fixed cost, those capacity improvements have not been matched with similar reductions in random access latency. Over the past 10 years alone, processor performance has increased by over 30X while measured hard-drive performance has increased by only 1.3X. And, the gap will continue to grow as processor performance scaling moves to the new multi-core trajectory.
To put a finer point on it, we’ve had to make do with a factor of 100,000 difference between DRAM and HDD performance (random read latency of 150 nanoseconds vs. 15 milliseconds) and about two orders of magnitude in cost per bit for equivalent capacity. The trade-off between main memory and hard disk performance and cost affects system design and software design in fundamental and profound ways.
Coping with the Gap
Minding the gap means application developers must constantly manage the placement of data. They need to anticipate huge latency hits that can occur in a seemingly random fashion when a desired datum is not in memory. And, they need to anticipate the different target system configurations which will have a direct bearing on how the user perceives application performance.
OS developers have struggled with the gap for decades and have had some modest success hiding it. Virtual memory was invented to relieve application developers of the hassle of managing overlays, but it is very easy to push the notion of virtual memory too far. Push the ratio of virtual to physical memory too high and “a thrashing we will go” as paging rates turn exponential. The tendency of virtual memory systems to exhibit such poor performance when configured with too little DRAM has given rise to the belief that, “virtual memory is a great idea as long as you never use it.” Fortunately, Moore’s law has made doubling the DRAM in the system the usually affordable fix when the disk activity light never seems to go out.
It should come as no surprise that the search for a “gap filling” memory technology has been going for decades. When I joined Intel three decades ago, we explored (at some considerable expense) magnetic bubble memory and later charge-coupled device (CCD) memory. Neither turned out to either be dense enough and cheap enough to replace rotating magnetic storage. Many other technologies (e.g. holographic memory and, more recently, polymer memory) have been heralded over the years as being the long sought “gap filler” that will be a bit slower, but much cheaper than DRAM. Unfortunately, none of these widely-trumpeted devices panned out.
And the Winner Is?
What is a surprise is that a relatively unheralded technology, NAND flash memory, the same stuff you find in your digital music player or digital camera, looks like it may be the long-sought “gap filler” even though most people had given up looking. There are two approaches to bringing NAND into the memory hierarchy: so-called NAND disks and platform NAND, where the flash memory is integrated onto the motherboard. Let me leave the NAND disk approach for another blog while I focus on platform NAND for this posting. [Note: I have to slightly violate my promise not to tout future Intel products in this blog, but I’ll try to keep my enthusiasm, which is substantial, well in check.] Platform NAND currently goes by the code name Robson Technology at Intel and is slated for introduction with the next-generation mobile platform, codenamed Santa Rosa, in the first half of 2007.
In its initial configurations, Robson consists of up to 1 GB of NAND flash memory and an intelligent controller that fit either on a PCIe mini-card or directly down on the motherboard. In its Robson configuration, the NAND memory is used as a disk cache to temporarily store both applications and data. Since NAND has latency characteristics in the range of tens of microseconds and is non-volatile (maintains the memory image even when power is removed), it enables near “instant” resume from hibernation and applications launch 2X faster on average on Windows Vista. We also see lower overall platform energy consumption as the hard-drive spins up less frequently. The fact that NAND is typically 7X cheaper than DRAM doesn’t hurt either and makes Robson an excellent technology for filling the gap. Note that I say Robson and not NAND, because using plain NAND flash isn’t good enough to do the job on its own.
Overcoming the Weakness of NAND Flash
The one big issue with NAND as a gap filler is write endurance: NAND flash only supports a limited number of erasure cycles before wearing out. That’s where Robson’s smart controller comes into play. Simply put, it uses clever write-leveling algorithms to spread the block erasures evenly across the array giving the NAND flash memory a service life consistent with the rest of the platform.
The use of NAND as a disk cache is just the start of a major overhaul of the memory hierarchy. Samsung recently announced notebooks that use NAND to create a solid-state drive, completely eliminating the hard-drive. Further out in time, Intel and others are exploring technologies, such as phase-change memory (PCM), as a replacement for NAND flash. It’s too early to tell if PCM will go the way of magnetic bubble memory or if it will replace NAND flash, but the race is on for the future of non-volatile solid-state memory.
In the not too distant future, we can expect to see magnetic disk drives relegated to the role that tape drives play today, and even DIMMs may vanish from future motherboards. I’ll say more about that in another blog.
These changes will require us to rethink software architecture and implementation, including tuning of the operating system, drivers and applications. But the benefits are so tangible that the course is set and now the work must get done.
Going forward let’s not just mind the memory / storage gap –it’s time to close the gap for good.
Justin Rattner is an Intel Senior Fellow and director of Intel's Corporate Technology Group. He also serves as the corporation's chief technology officer. The opinions expressed in this blog are his own and not those of his employer.
SponsoredWhite Papers, Webcasts, and Downloads
- VMware Infrastructure: A Guide to Bottom-Line Benefits VMware Frustrated by the costs of maintain ever larger data centers?or building ... Download Now
- The True Costs of Virtual Server Solutions VMware In an economic environment that is repeatedly heralding the message "do ... Download Now
- Three Steps You Need to Know to Stop Data Loss Varonis Sensitive data exposed to misuse or loss... it is the stuff of nightmares ... Download Now
Recent Entries
- Tera-Scale: What Would We Do with All These Cores and How Would We Feed Them?
- 80 isn’t nearly enough
- Polaris Points the Way to Terascale Computing
- Mind the Gap
- Cool Codes
Blogs From Our Sponsors
Top Rated
Premier Vendor Content Whitepapers, webcasts & resources from our Power Center Sponsors
- Save time with automated shipping solutions
-
The Business Essentials Guide provides you useful tools and templates to help grow your business and save you time with automated shipping solutions.
- Visit the UPS Business Essentials Guide
- The best support in the Linux business
-
If Linux is going to power your mission-critical applications, you'd better have the best support known to business. Novell was rated the top provider of Linux technical support.

- Learn more >>
- The more you simplify, the more you save
-
When you transition from your existing Red Hat environment to SUSE Linux Enterprise from Novell, you can recognize dramatic cost savings, perhaps as much 50%

- Learn more >>
- Reduce risk. Reduce complexity. Increase reliability.
-
A simplified IT environment isn't just less complex. It's also more reliable. Standardize on a single Linux platform with SUSE Linux Enterprise from Novell, and get the world's most interoperable Linux

- Learn more >>
Archives
ZDNet Blogs
- All About Microsoft
- The Apple Core
- Between the Lines
- BriefingsDirect
- Collaboration 2.0
- Dev Connection
- Digital Cameras & Camcorders
- Ed Bott's Microsoft Report
- Emerging Tech
- Enterprise Web 2.0
- Forrester Research
- Googling Google
- GreenTech Pastures
- Hardware 2.0
- Home Theater
- iGeneration
- Irregular Enterprise
- IT Project Failures
- Laptops & Desktops
- Lawgarithms
- Linux and Open Source
- Managing L'unix
- The Mobile Gadgeteer
- On Sustainability
- Rational Rants
- The Semantic Web
- Service Oriented
- Smartphones and Cell Phones
- Social Business
- Social CRM: The Conversation
- Software & Services Safari
- Software as Services
- Storage Bits
- Team Think
- Tech Broiler
- Technology and the Global Supply Chain
- Tom Foremski: IMHO
- The ToyBox
- Virtually Speaking
- The Web Life
- ZDNet Education
- ZDNet Government
- ZDNet Healthcare
- Zero Day
White Papers, Webcasts, and Downloads
- Virtualization: Architectural Considerations And Other Evaluation Criteria VMware Of the many approaches to x86 systems virtualization available in the ... Download Now
- Unrivaled support from Novell, now available for Red Hat Novell If Linux is going to power your mission-critical applications, you'd ... Download Now
- The True Costs of Virtual Server Solutions VMware In an economic environment that is repeatedly heralding the message "do ... Download Now
-
-
Smart Tech
Expert advice on innovations in healthcare and the green technologies that make it happen.
Find out more
-
Smart Business
Discussion and advice on management issues that revolve around making your world smarter and more useful.
More Smart Advice
-
Smart People
The best and worst moves in the management and strategy trenches.
Learn More





