Watts and Faults explained by Google’s Luiz Barroso

I was in NY today and went to a Google speaker series talk by Luiz Barroso, Google Distinguished Engineer. The talk was “Watts, Faults, and Other Fascinating Dirty Words Computer Architects Can No Longer Afford to Ignore”.

The best part of the talk was that Luiz did what the Heath brothers so recommend in Made-to-Stick: he told a story. He told a story about the little guy overcoming the big guy. He told us at the beginning that this would be just like David and Goliath, like Seabiscuit. :) So who are these little guys who came over and said,

“Hey! Hey! Look at us! We’re important. Not only for computer design, but because we hear that these days you’re concerned about cost and reliability!!!! Look! Look! Look at us!!”

Two Little Guys The two little guys are the two newer research areas in computer design, and the two that are leaving the picture are two that were popular in the 90’s. In short, out with the MHz race (the race for more transistors) and out with the DSM race (the race for improved shared-memory machines). In with Mr. WATTS and Mr. FAULTS.

Meet Watts and Faults

Luiz gave us the big picture first, and showed how computers are becoming significantly energy-inefficient. Specifically, he said, suppose that you’re getting a server and the cost to power the computer over its life are much higher than the cost of the server hardware itself. Isn’t that a little strange? Shouldn’t you be a little worried? (Luiz mentioned that in an unlibertarian move the U.S. government is starting to be worried for you! On Dec 20, 2006, there was an act – in Congress or the House? who knows? – to research energy inefficiency in servers!!! Hallo! Since when is that the government’s business?)

Watts In any case, suppose actually YOU are worried instead of big brother being worried for you. If you’re worried about energy inefficiency, you should know a couple of things that may make you even more worried! A computer not processing any information uses HALF the power that it uses at full capacity. Luiz Barroso suggested that a good problem to resolve may be how to get a server to use less energy when it’s idling. This is the WATTS problem. A computer goes from 80W to 160W when going from idling to full capacity. On the other hand, a person goes from about 60W of energy at regular idling not doing anything and to about 1200W if that person is a serious athlete. Luiz says, “We are the energy equivalent of a three-year-old-PC… or of a light bulb.” There’s a lot more variability in energy burned. Can we get computers to do the same? Can we get computers to use 1/10 of the energy at idling compared to that at full jolt?

Disk Drives And then come in FAULTS. Luiz described the big, big problems if hard drives just fail out on you. So Google has a lot of monitoring now of the System Health infrastructure. But even though “system health” for, say, all the hard drives and all the servers is being measured, is it possible to predict which disk drive may fail? Luiz and two colleagues researched this and presented the results as two papers in 2007.

Their conclusion? Faults are not individually predictable – not predictable well for individual disk drives. But faults are somewhat predictable for a population of disk drives because as the number of machines increases, it’s extremely unlikely that they’ll all fail at the same time. And – interestingly – temperature doesn’t much have to do with failure of the drives… the assumption had always been that the cooler the temperature, the better – well, that’s not necessarily so important find Luiz and his colleagues.

So, in summary, WATTS are useful to think about because you can significantly decrease the costs of your company if you can decrease how much energy you use, and FAULTS are important to think about because even though you can’t predict them right now, maybe there will be new methods in the future to predict disk drive faults.