When every disk is a supercomputer, then what?

Jim Gray, Microsoft Research
NetStore '99 keynote address
October 14, 1999
Edited by Ben Chinowsky

I'm hard-pressed to think of something to say to you that you don't already know for sure — it's a very diverse group and many of you are experts in this area. The first thing I'd like to do is to start with acknowledgements for this talk. My thinking about the future of computer architectures was radically changed by discussions with Dave Patterson about four years ago, when he started talking about IRAM. His premise was that the RAM was going to migrate to the processors or the processors were going to migrate to the RAM, depending on which perspective you took. But he also argued that in fact disk controllers were going to become supercomputers. I'm going to walk you through the implications of that later in the talk. One of Dave's students, Kim Keeton, finished her dissertation just recently. Another student, Erik Riedel, just two days ago had his thesis defense at CMU. Both Kim and Erik investigated these issues, and their theses are well worth reading. I certainly learned a lot as a member of their thesis committees. There's another student at Berkeley, Remzi Arpaci-Dusseau, who is reinvestigating Amdahl's laws. Amdahl has two laws that he created for system balance. One of them was that if you had a million instructions per second, then you needed a million bits per second of I/O bandwidth. So you can imagine that when we're heading to tera-op machines, then we're going to need a terabit of I/O bandwidth, and that's a pretty interesting issue. The other issue was the ratio of processors to RAM. So each of these people has had a substantial influence on this talk and on my thinking about these issues.

What I thought I'd do first is just to talk about the surprise-free future. What I mean by the surprise-free future is that we see things right now working in the laboratory, and these things will be in the market in five years. When these things come to market it's going to be a very different world. The first thing that's going to happen, I think, is we're going to see very very powerful and very very inexpensive processors. Another thing is that gigabit RAM is going to be upon us. That means the smallest memory you'll be able to buy is 128 MB. The magnetic areal density of disks is going to go up, and is nearing the paramagnetic limit. Networking is going to get very fast — gigabits-per-second WANS and 100 Gbps LANS. So that's the surprise-free future, and if you accept that, then you have some pretty obvious consequences. The way we do protected storage is going to change; the tradeoff between disk and tape, I think, is changing; and there are a few other things.

But the thing that I think is most surprising is that we're likely to have disks that are supercomputers. The simple way of thinking of this is that the processors migrate to where the sheet metal and power supplies are. So a disk has a power supply, a NIC has a power supply, a camera has a power supply, a display has a power supply, so you'll have processors wherever those are, and those processors are going to be enormously powerful. So when gigabit RAM arrives, the smallest memory you can buy will be 128 MB. What that means is that incidentally — and we'll have to talk about this — every disk controller will have at least 128 MB of RAM in it. A few years later it'll have four times that. The magnetic areal density has been improving year by year by year. Something that's really quite fascinating is that there was a ten-year lag between the magnetic areal density in the laboratory and the magnetic areal density in products. If you plot the laboratory density and you plot the product density, the curves intersect in about the year 2004. That is to say, the time lag between what happens in the laboratory and what happens in products has been shrinking. One of my colleagues pointed out that what that means is that you'll be able to go down to Radio Shack and buy your research results in a few years, because the products will be ahead of the research. [Laughter.]

But, no, that's not really what's going on. There are two things that are happening. One is that companies are being less open about what they've got hiding in the laboratory. Second, lag between products and research is shortening. The field is much more competitive, and research is a key part of this competition. Looming on the horizon is the paramagnetic limit. As the density gets higher and higher and higher, the magnetic domains get smaller and smaller and smaller, and at a certain point they have so few atoms that they become unstable. When people started talking about this, about 10 years ago, the paramagnetic limit was at 10 Gb per square inch. We now have laboratory results at 20 Gb per square inch, and the paramagnetic limit has somehow risen from 10 to 50. From what I gather, that is due to much better materials. But the fact is that people are showing 20 publicly, and other people say they have ideas about how they can get very close to the paramagnetic limit. The net of this is that you can expect about a factor-of-ten increase in storage capacity per square inch (or per square something) in the next five years. And what that means is that the 50 GB drive that you can now get from Seagate is going to be a 500 GB drive.

Magnetic tape is something about which I have very controversial views. I think tape is a pain in the neck. It's slow, it's unreliable, and the software to manage it is simply terrible. I'm looking forward to the day when tape goes away, and I think we're close to that day. Right now you can buy 47 GB disks and you can buy 40 GB tapes. The disks run about three times faster in bandwidth. The tape folks manage to cover over some of their sins by using compression to increase bandwidth and capacity by twofold or even fourfold — well, you can do the same compression on disks. The access times for disks are much much better, and in fact you can get 4 TB of storage in a rack of disks, and you can get about 10 TB of storage in a rack of tapes — the same size rack, a size-for-size comparison. One of the things that's interesting is this little picture off here on the right. You can barely see it, I think, so let me try something else so that you can see it — let's go for 400% here. It's an interesting picture. This is a picture of the tapes at CERN. And these are the IBM 3480 tapes, I think — and so each of those tapes is 300 MB. If you take that rack of tapes there, they are equivalent to one 50 GB disk. So it is the case that disk capacity has actually got pretty good volumetric density, and in fact pretty good price/performance, because that disk is going to give you better bandwidth than you have to those tapes. I don't know if you can see that there's a guy standing there, and he's actually going in and picking the tapes off the shelf. He's a fairly expensive peripheral. So the point is that I think many of the tape libraries are going to be supplanted by disks, if disk technology keeps going the way we are predicting.

Another thing that's coming our way is a system on a chip. About 75% of the surface area of a chip today is RAM. The caches on many of the chips are an order of magnitude larger than the memory that I started out with when I started programming. Looking around the room there are a lot of old guys in the room, and I assume that you worked with even smaller machines than I did. But a 4 MB cache seems like a huge amount of space to somebody who grew up with, say, a 7094 or a 1401. So what I think is happening is that the processors have a lot of memory on them, and in addition now the processors are getting I/O interfaces to them. People are integrating standard I/O onto the processors, and it's quite possible that other I/O interfaces are going to appear on processors in the future. You can buy a 486 now for $7. I'm not sure that very many of you want to buy a 486, but Meridian, which has now been bought by Quantum — Meridian makes something called a SNAP server. A SNAP server is a disk with a processor integrated into it, and is sold as a file server. It has an Intel 486 processor in it, and runs FreeBSD. You can buy a 233 MHz ARM for about $10, and you can think of it as a system on a chip. The Celerons and AMD 266 MHz processors are about $30. You can expect these processors to come down to a unit price of $5 or $1 in the next five years. So in five years today’s leading edge processors are going to be very inexpensive, and the high end is going to be something pretty astonishing — billions of instructions per second.

The whole area of I/O inside of a computer is a mess. I believe some of the mess in the open-systems area has been resolved recently. IBM and their friends, and Intel and their friends, had been at war until two or three months ago. They were each going to do a competing I/O standard, and it was going to be a nightmare for the rest of us. They declared peace, and they are now promising a converged standard called Standard I/O. I think that it's likely that within five years it will actually happen — they're promising a lot sooner than that. One of the things this is going to do is to replace PCI with something else, so there'll be a serial link, many serial links, coming out of the processor, and those serial links are going to be running at on the order of 1 to 10 gigabits per second, perhaps in five years at 10 gigabytes per second. And they will be able to extend a few meters. These will give you the ability to build very impressive system area networks of small diameter, and then there'll be switches that can take these and extend them for longer distances. So the system area networks, and the VIA architecture that Intel promulgated along with Compaq and Microsoft, are likely to morph into this standard I/O model.

It's important to get a sense of scale, a sense of what this really means. Many of you old guys remember when I/O was less than 1 megabyte per second, and then it went to 3 and so on. But, many people started with SCSI, at 5 megabytes per second. Then we got fast wide SCSI at 20, and we got ultra SCSI and so on, but the fact is that right now with gigabit Ethernet we are at 120 megabytes per second, that huge pipe that dominates all the others — this attempts to be drawn to scale. And 1 gigabyte per second is enormously more powerful than 1 gigabit per second. I will come back to this in a while.

Bandwidth is improving at an extraordinary rate. In the laboratory, people are demonstrating 3 Tbps on a fiber for substantial distances. Right now the fastest network that Microsoft has is an OC-48 link which is 2 Gbps, that runs between the Microsoft campus and the Westin hotel here, which is a gigaPoP for the Seattle area. There's nothing on the Microsoft campus that runs at 2 Gbps, so this pipe arrives on the campus, and we have to multiplex it down to the slower ATMs and gigabit Ethernets and other slow devices that we have on campus. Where that link is supposed to go is to the University of Washington. The University of Washington has an OC-192 link, just sitting around with, again, nobody really ready to use that link. There's plenty of bandwidth lying around; much of it is protected by tariffs, but it's being deployed at an enormous rate, and it's going to dramatically change the way we use storage.

Does everybody know what WDM [wave division multiplexing] stands for? WDM makes it possible to have very high bandwidth, with many 40 Gbps links per fiber. So, “connected” bandwidth will be huge.

Mobile bandwidth is another matter. It seems to be asymmetric: I can easily get 100 Mbps to the mobile computer, but the power needed for the mobile computer to send high data rates is prohibitive with current battery and radio technologies.

This high bandwidth enables another trend that's going to make life interesting. We're going to a world of thin clients. What many people seem to think thin clients mean, is stupid clients. That's actually, in my opinion, not what a thin client is. What a thin client is, is a stateless client, which is to say, if you lose your PalmPilot or if you lose your cell phone, you buy a new cell phone, you plug it into the network and it refreshes its state. What that means is that all the bits that are in your client can be downloaded from somewhere else. All the bits that are in your digital camera very quickly trickle through the network to a server, and don't get lost — so that you don't have to carry around a huge store inside of your digital camera. In your laptop everything you're doing gradually, or fairly quickly, trickles to a server somewhere.

Hotmail, which is a basically free mail system, allows its 50 million users each a 5 MB mailbox. With time that will grow to a gigabyte, and then 10 GB and 100 GB. And with time, Hotmail hopes that it'll have more than 50 million users. If you do the math, that's lots of petabytes of storage. There are companies now appearing that are offering SOHOs — small offices and home offices — electronic vaulting of their file stores. In that world, you don't have to archive your data. The system either is continuously connected, or dials up at off hours and takes all the changed files and makes copies of them on the server. If you lose a file you go to these folks and they give you a copy of the file.

If you again think about how many bytes there are out in the SOHOs, and then you take the server and say, OK, you have to have that many bytes there, you end up with a huge storage server. Microsoft Windows 2000 is coming very soon, and Windows 2000 has something called IntelliMirror built into it. IntelliMirror is the commercialization of the ideas in the Coda file system. The Coda file system had the idea that the desktop and the laptop are just cache for the server, and so the files that you have here are all replicated back on a server, and when you reconnect the files get resynchronized. Fundamentally the system of record is the server. So however many bytes you have on your laptops, you have approximately that many bytes on your servers.

And then there is, for many people, an interest in actually going back to application hosting, the so-called ASPs — a recentralization of computing. Most of these folks never experienced the horrors of being at the mercy of the “computer center”, but they are about to learn. These ASPs are going to be both huge servers and huge stores. So, for example, Sun has started offering StarOffice. The first instantiation of StarOffice is something that you download to your laptop, but Sun's longer term vision for StarOffice is that it's going to be much more like an X-Windows system, where all of the office work is actually done by the server, and the only thing that your desktop is doing is tracking the keyboard and the mouse.

So, there are many more instances of this, but "there are going to be huge servers in the future" is the point to make.

To argue some of the points I'm about to make, I need to introduce some terminology. First, there are some standard storage metrics that are universally agreed to, and if you go to the data sheets on most devices they will give you these storage metrics. They are: how much capacity the device has, what's the access time, and what's its transfer rate. The storage metrics that people don't talk very much about are how many kilobyte accesses per second (Kaps) and megabyte accesses per second (Maps) the system offers, which actually can be derived from some of the earlier numbers, and how long it takes to scan through the data set (SCAN). If somebody gives you a disk, how long does it take to read all the data on the disk? If somebody gives you a tape archive, how long does it take to read all of the data in the tape archive? Increasingly there are applications that run as a data pump, and I think in fact for this community that must be a very common application. Where you can't afford to scan through your petabyte at random, what you have to do is scan through the petabyte systematically over the space of a day, and you tell people "I'm going to be going through the petabyte of EOS/DIS today, and I'll be going through it tomorrow, and I'll be going through it the day after, and if you have any questions, tell me your question, and I'll give you the answer at the end of the day, because the data will be coming by, sometime today."

So, in addition to the standard storage metrics of capacity, access time, and transfer rate, we have kilobyte accesses per second, megabyte accesses per second, and scan time.

This diagram tries to show that in fact RAM has the best performance numbers and the worst price numbers. Disks have great performance numbers and not such great price numbers. (These slides will be on my web site and I'm happy to mail them to you in whatever form you like, PDF or PowerPoint, if you're interested.) The scan time for tapes is abysmal. Oftentimes it takes several days to scan through a tape archive, depending on how many read stations there are. And the prices are not very attractive either. In particular, it costs about $129 per terabyte to scan through a tape, when you take the amortized cost of the tape. These numbers were all done with DLT 7000 for the tapes.

Another important thing to realize is that there is the access-time myth. The graph on the left is the official story of how much time it takes to access a disk. There's supposed to be a fair amount of time that goes into seeking, there's a fair amount of time that goes into rotating, and then there's a little bit of time that goes into transferring. The measured result is that most seeks are short, in fact the most common seek is of length zero, there's some time spent rotating, in fact the expected time, and then there's a great deal of time spent transferring, and that's because people are moving a lot more data than you would think they're moving. But most of the time goes into waiting for the arm, because there's somebody ahead of you in the queue, and typically half the time is going into waiting, and when we get into the discussion of where we're headed, it's going to turn out that arms are the scarce resource, and that we're going to have much much more data under the same number of arms, and so the queuing for the arms is going to be much much greater.

This particular slide is a blowup of the third graph on the previous slide, and it shows that the ratio between disk and storage costs has been somewhere between 30:1 and 100:1 over the last couple of decades, and that if you wait six years, what you can afford to keep on disk you can afford to keep in main memory. So just as a rule of thumb, whatever you're keeping on disk today, if you can afford to store it on disk, in about six years you can afford to keep that in RAM. And it's also the case that we're going to have huge stores if you project this. If your budgets stay the same, you'll have huge stores in a few years because, in fact, the capacities of both are going to be very large.

So now we get to the consequences side of this. I think if you just project out these numbers, you come up with absolutely absurd computer architectures. You come up with 256-way nUMAs, with terabyte main memories and 500 GB disk drives, and petabyte storage farms. You end up with a very very nice interconnect between them. And in particular the picture you get is of this extreme segregation between processors, RAM, and storage, and the need for huge bandwidth among them. So what we talked about earlier, about the metrics, about how long it takes to access things, and how long it takes to get to data and so on — all the action is happening over there on the left in the purple area, and somehow we have to get the data from the disks over to the processors and then get the data processed and back again. A way of thinking about this problem, and I think it helps a lot, is to think about how many clocks it takes to get to the various levels of storage. The number of clocks it takes to get to registers on a processor is 1, on-chip cache is 2 clocks, on-board cache is 10 clocks, getting down to main memory is 100 clocks, getting to disk is a million clocks, getting to tape is a billion clocks.

In human terms — if you ask them a question they say "just a minute" and they give you the answer, and if they don't know the answer they ask somebody in the room and so it's about 2 minutes, and if they have to ask somebody in this hotel or in this city it's about 10 minutes, an hour to Olympia which is down there — disk is like going to Pluto, it's two years away. [Laughter.] And tape is like going to Andromeda, it's 2000 years away. And I'm not sure how we're going to get to Andromeda in 2000 years. The point is that if you're going to be needing something from tape, you'd better know about it 2000 years in advance, because otherwise your processor is going to click once every 2000 years, if you're trying to do list processing on tape. [Andromeda is about 2 million light years away, so even being able to get there and back in 2000 years would represent a great breakthrough. — ed.]

So now let me go back to this picture here and say that it's very very important for the processors to be close to the data — not 2,000 processor years away from it. It's very very important for the processors to have local memory and to have extremely good locality, and in fact in my opinion the best locality would be to have the processors right next to the disks.

So those are some of the, in my opinion, absurd consequences. These are some of the not-so-absurd consequences. There was a rule of thumb about twenty years ago that said that if you had 10 GB of storage, you needed a system administrator to keep track of that 10 GB. That rule of thumb today is that if you have about 5 TB you need an extra administrator. I've asked around to various people and said, "so how many people do you have managing your disks; doing backups, keeping track of storage, and when a disk breaks fixing it, and so on." People who have 5 TB and 50 TB stores have between one and ten administrators for those stores. This just a rule of thumb — people in the audience may have different experiences. The thing that's interesting is that 5 TB can actually be bought, with the processors and so on, for about $60,000 today, if you go shopping to the right place. I know you can pay a lot more than that for 5 TB, but it's going to be $10,000 in a few years. Having a person manage a $10,000 resource is not a good deal. So if you look at the cost of storage, it is increasingly dominated by the cost of managing the storage. We do not want 200 people managing a petabyte archive.

If you look at the picture in the bottom there, you see that rack of machines, that's all going to shrink down to one rack pretty soon. But that person is not going to shrink, and the cost of that person is going to grow. The burden cost for that person is between $150,000 and $250,000, depending on which organization you work for. The cost of that person is way more than the cost of that storage, so you can burn a lot of storage by automating management and still come out ahead. One of the things that I think is a challenge for us all is to have absolutely automatic storage management. In this new world you're going to be in a situation where you're asked to store a lot of data, and you're going to also be asked to manage it with almost no cost.

If you project out the disk sizes, you come up with half a terabyte to a terabyte of storage, with 100 megabytes per second of bandwidth to it and 200 accesses per second. It takes two and a half hours to go through that disk. You get one access per second per 5 GB. So that there's a notion of how hot the data is, and again, there used to be a rule of thumb that one disk arm could really only handle about 250 MB to 500 MB of storage. And here we are putting one disk arm under a terabyte of storage.

The net of this, I think, is that people will not buy these huge disks. They will end up buying smaller capacity disks. So one possibility is to notice that the disk is ten platters, and so you can treat the arms as ten independent things — at least you can get the bandwidth out of all ten of them. And alternatively you can in fact put ten independent disks with ten platters all inside of one box, and then you get one platter per arm.

Still, having only one arm for 200 GB is not a very attractive deal. In the database area there's a benchmark called the TPCC benchmark, which is an online transaction processing benchmark; there's others called TPCH and TPCR, which are decision support benchmarks. In those benchmarks, people routinely use 4 GB and 9 GB drives, even though 18 GB and 37 GB drives are fairly common these days. The reason they use the smaller drives is that they need the disk arms. The extra capacity doesn't help them at all. So, there's another study that will appear at SIGOPS measuring the occupancy of disks inside of Microsoft. They looked at several terabytes worth of disk storage, many different file systems. They found that many disks are empty, and that most disks are between 40% and 50% full. They also find that the large disks are mostly empty. So it's quite likely that when the disk industry comes to us with 100 GB drives and 500 GB drives, that we will say "please give us more arms with those drives."

So one solution — and this has been a solution all along — is to shrink the drives, to make them smaller, because of course when you make them smaller and you have only one platter in them, they shrink back down to something that is more manageable. So for example right now, a single platter at 3-1/2" is about 3 GB. If you go a few years forward that's going to be 30 GB, but if you shrink it back down to a 1" form factor, then it would shrink back down to something like 5 GB or 10 GB. So I guess one of the predictions I'd make is that in a few years we're going to be buying disks in six-packs in some format — and I'll talk about one of the formats in a moment — but that fundamentally the individual disks will share the packaging, the power, and the interfaces of a disk package, but inside that package will be multiple disks.

What kind of RAID will we use to organize these disks? Another consequence of arms becoming precious is that we are going to go from RAID 5 back to RAID 1 plus RAID 0, called RAID 10. RAID 0 stripes your data across multiple disks. It gives you capacity and additional bandwidth, because now you can read from all the disks in parallel. But it uses multiple disk arms to do the read. RAID 1, variously called "mirrors" and "shadows", gives you fault tolerant disks. Reads are slightly cheaper; writes are twice as expensive, and writes are in fact slower than writes to a single disk. And RAID 5 works more or less as follows. When you read, you read from whatever disk you want to read from. When you write, you first read the original value of the block, then the parity value of that block. Then you rewrite the original value and you rewrite the parity value. So in essence, a write is four I/Os. Now if disk arms are a scarce resource, RAID 5 is a bad idea. Let me say that again. If disk arms are a scarce resource, RAID 5's a bad idea, because it uses four I/Os to do a safe write, rather than two I/Os. There are various approaches to try and take that four and bring it down closer to two, but all of those approaches have their disadvantages.

This slide just repeats what I just said — RAID 5 has certain performance disadvantages, but it saves space. In the future space is not going to be the scarce resource; arms are going to be the scarce resource. So what's generally called RAID 10 will probably become the norm. RAID 10 is stripes of mirrors: you pair disks for reliability, and then you stripe them for capacity and bandwidth.

There's another thing that is kind of interesting, which is that with today's storage racks, you get an 18" rack and you stuff it with 50 GB drives, and you get about 4 TB of storage inside the box. You get a lot of disk controllers; you get about 24 storage processors in there. Those storage processors right now aren't doing anything very useful except RAID. My premise is that those storage processors in the future will be doing your applications, and in fact that's where your computational processors will live.

I guess the last point is that it's hard to archive a petabyte. And it's even harder to restore a petabyte. And appreciate that when your petabyte goes south, people are going to want it back right away. The users are not going to want to wait for a while. So my premise is, if you're building a petabyte store you have to geo-replicate it. My favorite example is EOS/DIS with its 15 PB store, but if you're building Hotmail or you're building something else like that, you have to store the data in two places. The interesting thing is that if you store the data in two places you can store it in two different ways in the two places, so for EOS/DIS for example, you can store it by time in one place and by space in another place. You can store all the data for Seattle clustered together in one data center, and you can store all the data for today clustered together in another data center. What you have to do for both data centers is to scrub the data continuously, which is to say look at it and make sure that it's readable and it's correct, and if you find any errors then go to the other data center and get a copy of it.

The things I've said so far, I think, are fairly predictable, they are things that I think are very likely, and now I'd like to talk about some crazy ideas, or things that I think are quite speculative. One thing is, what happens if the disks shrink down to the small form factors? I believe that will happen, which is to say, shrink down to an inch or less than an inch. When the IBM micro-drives become very inexpensive and ubiquitous, one can think about surface-mounting them the way we surface-mount memory chips. So rather than buying a memory card for your system, you buy a disk card for your system. This is the disk farm on a card, it's fundamentally 100 drives, each drive is 5 GB, so it's 1/2 TB in a standard card that you put into a rack. You can use the disks in lots of different ways, you configure them in lots of different ways, and it gives you lots of accesses per second.

An idea that is, I think, even more speculative — I've been hearing about it for five years, it might happen — is that we might use microelectromechanical systems (MEMS). There's a group of people who are using tunnelling electron microscopes to get the kind of densities that you get for disks, with very very low access times — a millisecond or less. So it's possible that we could have a storage device which has no power consumption when it's turned off — one of the problems about disks is that you have to keep powering them as they're spinning around — so these things would have very low power consumption, be very reliable, and would have the kinds of spatial densities that disks have, and wouldn't have all the heat problems that disks have.

The main crazy idea I want to talk about is what happens when disks become supercomputers — which was the title of the talk. One way of thinking about this is to look at a disk that you buy. One side of the disk is fiberglass and full of integrated circuits. Those integrated circuits are going to shrink to a single circuit someday. That single circuit is going to be a system on a chip, and it's going to have a coprocessor which is a 128 MB RAM, or RAM of some sort. So the controller that comes with the drive, that adapts it, right now, to SCSI or IDE — that thing will be a supercomputer.

A similar thing is happening with the NIC. The network interface cards are increasingly learning about TCP/IP. So, what is happening with these networks that we have been talking about — standard I/O and storage area networks and system area networks — is that the network interface card understands about the protocol you're talking, be it TCP/IP, or VIA (virtual interface architecture), or standard I/O, and it will DMA from your memory to somebody else's memory. You already see specialized cards that are coprocessors that do displays.

This idea is not actually very radical, if you think about printers. When you buy a printer you're actually buying a fairly powerful RISC processor with a very simple operating system in it that runs Postscript, and Postscript in fact is a programming language. If you're into mobile code and you think Java is really cool, you should hear about Forth and Postscript. You can send Forth to almost any printer, and compute primes and do other things with it. Most people don't do that — most people print with it. But the fact is that when you buy a printer you're buying a device that prints but also is a freestanding processor whose interface to the outside world is typically Ethernet. And so what I'm proposing is that disks are going to have a gigabit Ethernet interface built into them, and so there'll be two things coming out of the disk drive. One will be, not a SCSI port and so on, it will be an Ethernet port coming out of it, and it'll have a power port coming out of it, and those will be the two things that come out of the disk.

So, if you look at the TPC benchmarks again, you notice that there's ten times more processing power in the disk controllers than in the CPUs. This was a shock to me — it was pointed out by Erik Riedel — you go out and you count the controllers and you notice that each of those RAID controllers is an I-960 or some other RISC processor — an ARM or something like that. Then you count the number of instructions per second or the megahertz or of each of those processors — and there are many many more disk controllers than there are CPUs. Those processors are actually fairly powerful. If you add up the processing power of the controllers, it's an order of magnitude more than the processing power of the central server. And each of those controllers has some RAM. Now the RAM they have is much less, in aggregate, than the RAM of the server. But it is possible to look at that system and imagine that you throw away the central processor, and you just program the disk controllers — especially if you think about those processors in a few years being substantially more powerful than the processors that we have today. So then the picture that you end up with is a picture more like the one at the bottom, where there's a terabyte-per-second backplane — that's its bisection bandwidth — and there are lots and lots of disks, each with its own processor associated with it. And again, some of those boxes will be NICs talking to the outside world.

So, let's go back to the picture. There is lots of processing in the storage array. Right now it is doing formatting, arm scheduling and RAID control. EMC and others are adding functions like snapshotting, archiving, remote mirroring, and load balancing. They are looking for ways to add value to the disk store. Erik Riedel's work at CMU looked at how much processing was needed in the disks to do tasks like database applications, data mining, or image processing. Kim Keeton of UC Berkeley did a similar analysis as part of her thesis. Both of them concluded that fairly modest processors — 200 MHz — were enough to keep a disk arm utilized. Erik showed that, for example, database systems can run very comfortably in these processors. Image processing tasks such as edge detection or color histogram matching can be done very comfortably in these processors. Some kinds of data mining can be done very comfortably in these processors. One of the things that the database people learned early on is that the best thing to do with data is to not move it. If N is the number of times you move the data, then your performance is proportional to 1/N. So putting the processors next to the disks means that you need less bandwidth and less data movement, because you can filter the data very close to the source.

Today, the conventional strategy is to offload the host — moving RAID outboard, moving networking and crypto outboard. What I'm proposing is to really offload the host, to offload it so much that it's completely offloaded and you don't have to buy it in the first place.

So if you do that, just exactly how do you program all those little yellow boxes off on the right? My proposal is that you program them as a distributed or federated computer system. I have drunk the CORBA kool-aid or the COM kool-aid or the EJB kool-aid, or whatever object church you attend. I believe that all this stuff that we've been hearing about for all these years — object-oriented programming — is here. Your car’s rearview mirror says, "Objects are farther away than they appear." And I think those days are almost past, and objects are actually closer than you think. The premise is that we have system-area networks coming that have huge bandwidth; people are now showing RPC through COM+, or RPC through CORBA/IIOP/RMI, with acceptable response times. For example, the COM+ guys have demonstrated something like 50-microsecond RPC times over a SAN, and they've demonstrated data rates in excess of 100 megabytes per second in moving stuff from one place to another. So Microsoft's working away trying to make its object stuff work; Sun and IBM are working away making their object stuff work, and I think they are going to be successful. There's just a huge amount of momentum behind this. This is going to give us a situation where we can program and manage a distributed system.

A system in which every disk is a supercomputer is simply a distributed system much like the ASCII clusters that you see at Los Alamos or at Livermore. It is true that those systems are difficult to program, but that's because we've invested very little in the tools for programming them so far. I think that in the next five years those tools are going to mature, and in fact it's going to happen. I’m pinning my hopes on those object technologies (COM+ and EJB) and their associated tools.

So, again, the surprise-free future is that everything is going to get about ten times bigger, faster, and better, except for the bandwidth in and out of the processors and disks, and the accesses per second in and out of the disks. So — and I actually haven't emphasized this very much — the main memory latency problem will be a bottleneck along with the disk arm problem. If you just project those numbers out, I think you come up with absurd computer architectures, and I think the absurdity will drive us to a much more decentralized system. In fact I think that it may not be five years, but I'm fairly comfortable that in ten years we'll have a much more distributed system, in the style that I talked about, where the processors are close to the disks, or, as in Patterson's original conception, the processors are close to the RAM. I think some things that are not so speculative are that storage has to be much more auto-managing, that we will go from RAID 5 back to RAID 1, and that disk packs will be a standard package for things. So with that I'm going to stop, and I think we have about ten minutes for questions and comments and criticism.

Questions and Answers

Q: Aren’t database systems too expensive to put into a $100 disk?

A: So the question is, so what happens to the pricing model for database products, or for software in general? I was kind of vague about exactly what software would come packaged on the disk. Something pretty wonderful has happened in the last few years: software has become free. It's not especially wonderful for Microsoft, or Netscape actually. It started with Netscape's browser, I think — or was it FreeBSD? History has a way of forgetting the originators. I think that the software will be nearly free. So that is what I think is going to happen to the pricing. Something interesting has happened with databases — you asked specifically about database products. Microsoft's database system is free, which is to say, there's a version of the Microsoft database engine — not the tools and all of the development products, but the engine itself — which is part of MSDN, and you can include it in any product you want, and do whatever you want with it, and it's free. Pretty amazing. It's been castrated to some extent, it doesn't support thousands of users, I think it's limited to support five concurrent users, etc. IBM has done a similar thing with the DB2 database product. There's a version of DB2 which is free, and you can do whatever you want with it, you can make derived works from it and so on. Inside of FreeBSD there's a free database system. You with me so far? OK. Then there's this operating system that's this darling of the computer industry right now, Linux, which is free. But FreeBSD has been free for a long time, and free in a much freer way than Linux, because if you take Linux and you improve it, then you have to give the improvement back to the Linux community, and if you take FreeBSD and you improve it, you can own your improvement, and there are lots of companies that want to own the improvements that they make.

So the interesting thing about Linux is that many people are going to Red Hat to buy a tested version and an integrated version and so on, of Linux. I guess the hope for the likes of Microsoft and Red Hat is that 60% of Microsoft's expenses are testing and integration. We actually increasingly spend less and less and less of our money on actually developing stuff, and more and more and more of our money on supporting it. Our R&D budget is 15% of the company; 60% of that R&D is integration and test. I expect that Red Hat will have a similar kind of cost structure associated with them. So if you're a customer and you want some software that's supported, then you're probably going to have to go to some support organization and pay for that. One can think of this free software as being an unbundling of the software from the integration, test and support. So if you don't need any support and you don't need any integration and test, and so on, then perhaps you can take one of these free things and just plug it in to your disk. I'm not sure how exactly the software companies are going to be able to fund the development of the next generation of software, but I think it may be that it's the other 80% or 90% of the service that a customer gets from the Microsofts and the Red Hats, it's that 80% or 90% that will fund the development.

That was a long answer to your question — does it answer? — Yeah.

Q: Won't MEMS have the same density and access problems that disks have?

A: Yeah, that's the intuitive thing. The question is — so I mentioned briefly that there were these microelectromechanical storage devices that people were experimenting with. And intuitively there's a limit to how much you can do with mechanical systems. The density is not very attractive and the reliability is not very attractive. So that's my intuition too, and I'm wrong. When you go next to MEMS and microstructures, things get truly bizarre — mechanical intuition fails me. People are proposing to use tunneling electron microscopes, where they use individual atoms as their storage devices. I know this sounds crazy, it sounds crazy to me — I'm with you, right? But magnetic disks sound crazy to me. If somebody had described magnetic disks to me years ago, I would have said "Oh, that'll never work." And as recently as 1968 the people at IBM decided that magnetic disks were dead and they had to go to holographic storage, that there was just no way magnetics were going to keep going. What a crazy idea, heads and bearings and friction and wear and particles getting in the way and scratching and gouging — but in fact the magnetic disk guys have done the impossible. Now they're staring at the paramagnetic limit, and there's one camp that says, "Oh my God, it's the end of life as we know it" [Laughter] — and another group of people are saying "This is an opportunity to innovate, and in fact I've got dozens of ideas of how to get around it." Sir?

Q: Aren’t power density and the speed of light the real scaling limits?

A: So the question is, there are two scaling things that I didn't really mention. One was power, and the other was the speed of light. Well, the speed of light was hiding in there in the clocks and the fact that Olympia is an hour away — you don't go to the speed of light there, but fundamentally getting to main memory is as much a speed-of-light issue as it is a capacitance issue, and a bunch of other things. So moving the processors closer to the storage is in fact a good thing, and there's a lot of discussion here about the grid, and the fact that you can have the processors in California and the disks in New York, and in fact if you have terrible seek times and rotation times on the disks, then the fact that it's 60 milliseconds round-trip coast-to-coast for the light, isn't that big a deal. The fact is that at Los Alamos they have their CPUs and memories in one building and their disks in another building, and they're half a kilometer apart. The speed of light there doesn't really kill them because they have so many other problems in between. Moving the processors close to the disks definitely helps you a little bit with the speed of light, but it's mostly moving the processors close to the memory. That's Patterson's fundamental argument for IRAMs. A system on a chip has much smaller speed-of-light problems. It's not just speed of light, though, it's capacitance. So the polygons-per-second that you get in a game machine today is as much due to the faster processors as it is due to the fact that they can get the entire frame on-chip, and they can operate on it on-chip. It's much more like an IRAM, so they get the bandwidth.

Now, let's talk about the power, which is an interesting thing. One of the things that's interesting about disks is that in the old days they used to have these giant disks that were this big [indicating about 4 feet] and used vast quantities of power and were made out of stainless steel and were machined to incredible tolerances. When they failed, it was spectacular — lots of particles everywhere. In modern disks some of the parts are made out of plastic. The stresses go down as the cube of the size of the device. The distances go down, and so in fact as you make things smaller the power needs of the system go down. And the power requirements of a disk appear to go up as the fourth power of the linear dimension. So it's not that you just shrink it by the volume, but you actually get huge benefits in power consumption by going to smaller drives. It's possible that when we go to 1" drives we'll have much less power consumption. It is the case that for mobile things batteries are the absolute limiting issue for them, but for the storage racks that I was talking about, if you fill those racks up with disks, cooling is a challenge. People put fans on the things, and it works. It is the case that when we go to smaller and smaller form factors, in fact, the heat load goes down. So it's possible that, for example, the surface-mounted disks would be coolable without having heat sinks on top of each one of them. Did you want to elaborate on that?

Q: Fast processors use lots of heat and power. One-chip processors are either slow or they need high power and lots of cooling. Disk vendors cannot afford that as part of their budget.

A: I don't have any comments on that. I do look at the high-speed processors and notice that they use a huge amount of power. On the other hand I notice that the ARMs, which are running at 233 MHz, are actually fairly low-power. I was using Celeron, which is the Intel processor, which is again a fairly low-power device. Trying to cram more and more and more into one chip and do it faster and faster and faster, at some point you're going to have heat-sink problems.

Q: If the data is not stored locally, how can I be sure I can get to it? How can you guarantee the availability?

A: I think I mentioned availability obliquely by saying that you have to store data in two places to guarantee availability. Quality of service is really complicated, so let me try and address two parts of it. One is that if a bunch of people come to the same disk and want to read different parts of the disk, then you get a long queue of requests piled up behind that arm. The only way I know of dealing with that is to overprovision the arms so that you can deal with peak load. I just don't see any other way — you can cache the stuff in main memory, but that's also expensive. There are various algorithms, various ways of arguing and saying, OK, if the data is really hot then keep it in main memory — but what about the Starr report? It comes and all of a sudden everybody wants to read it, and you didn't actually know that everybody was going to want to read it that much.

So that's the overprovisioning for disk arms. There's a similar problem in the network. How do you get the bandwidth? And there are all sorts of tricks you can play. You can play caching tricks, you can play multicasting tricks, you can do all sorts of things, and indeed all those things helped in distributing the Starr report. But in the end — subtract all of that out — you have to have enough bandwidth to carry the demand. What the phone companies have done historically is overprovision. They've just put in more wires than were needed. What the Internet did was to show that overprovisioning is incredibly wasteful — the wires are empty most of the time — and that if you do packet switching you can use 100 times less capacity and still carry the load, and everybody's very happy. We're now in that statistical multiplexing space — there's a large community in the supercomputing area and in the networking area who are really excited about quality of service and giving people quality of service and so on. But when you talk guarantees, you are talking reservation of bandwidth, and statistical multiplexing goes out the window. If bandwidth is growing 300% per year, and if bandwidth is as inexpensive as I say, then presumably demand is growing 300% per year too, because if you deploy this bandwidth and the demand doesn't keep up with it, it's kind of crazy. So my premise is that we will simply overprovision. That's how we will deal with quality of service. We will in fact keep a year ahead of demand and the network will have 30% utilization. Now that is a very unpopular view, that is to say there are a lot of people who say no no no no no — that won't work. But the fact is that I don't think that quality of service guarantees are going to work either, because they go back to the circuit switching model that we had in the old days, they give you extremely low utilization. So I personally have a T1 line coming out of our lab. Its utilization is 1%. I would be delighted to have a T3 line and pay the T1 price, and sometimes be denied service. I know there are emergency-care units who don't feel that way, and there are people who want an absolute guarantee, and for those people they will want to have a circuit-switched guarantee, bandwidth reserved. But fundamentally, if you reserve the bandwidth, you've reserved the bandwidth, and you can't let other people sneak in there. This is a very very controversial area, the disk arms thing is comparable to it, and I think the solution is just overprovisioning.