Dynamic Power Management: A Quantitative Approach
by Johan De Gelas on January 18, 2010 2:00 AM EST- Posted in
- IT Computing
Limitations
First of all, let's discuss the limitations of this review. The benchmark we used allowed us to control the number of threads very accurately, but it is not a real world benchmark for most IT professionals. The fact that it is an integer dominated benchmark means that it has some relevance, but it's still not ideal. In our next article we will be using MS SQL Server 2008 R2. That will allow us to measure power efficiency at a certain performance level, which is also much more relevant than pure performance/watt. Also, the low power six-core Opteron 2419 EE is missing. This CPU just arrived in the labs as we finished this article, so expect an update soon.
"Academic" Conclusion
The days where dynamic frequency scaling offers significant power savings are over. The reason is that you can only lower voltages if you scale the complete package towards a lower clock. In that case the power savings are considerable (P ~ V²), but we did not encounter that situation very often. No, both AMD and Intel favor the strategy of placing the idle cores in higher C-states. The most important power savings come from fine grained clock gating, from placing cores in a completely clock gated C-state (AMD's Smart Fetch + C1), or even better placing them in a power gated stated (Intel's Power Gating into deep C6 sleep).
Practical Conclusions
Windows 2008 makes you choose between Balanced and Performance power plans. If your application runs at idle most of the time and you are heavily power constrained, Balanced is always the right choice. But in all other cases, we would advise using the "Performance" plan for the Opterons. For some reason, the CPU driver does not deliver the performance that is demanded. With Balanced, when you ask for 25% of the total CPU performance, you'll get something like 15% to 20%. The result is that you get up to 25% less performance than the CPU delivers in "Performance" mode, without significant power savings. That's not good. We can already give away that we saw response time increases in MS SQL Server 2008 due to this phenomenon. It is also worth saying that our new measurements confirm that the performance/watt ratio of the six-core Opterons is significantly better than the quad-core Opterons.
The Xeons are a different story. For the normal 95W Xeons it makes sense to run in Balanced mode. The "base" performance is excellent and Turbo Boost adds a bit of performance but also quite a bit of power. Ideally, it should be possible to run in Balanced mode and use Turbo Boost when your application is performing a single threaded batch operation, but unfortunately this is not possible with the default Windows 2008 settings.
For the low power Xeons, it is different. Those CPUs run closer to their specified TDP power limit and will rarely use Turbo Boost as soon as they are loaded at 25% or more. If your application is limited by regular single threaded batch operations, it makes a lot of sense to choose the Performance plan. Turbo Boost pays off in that case: the clock speed is raised from a meager 1.86GHz to an impressive 3.2GHz. As Xeons based on the "Nehalem" architecture place idle cores in C6 very quickly, the Performance mode hardly consumes more than the Balanced mode. As we have shown, frequency scaling does not save much power, as most of the cores are power gated automatically. This aggressive "go to C6 sleep" policy allows the architecture with the highest IPC in the industry to morph into a high performance server CPU with modest power consumption. There is a huge difference between this CPU inside a machine where it is pushed towards 100% load and inside a server where it hovers between 20 and 70% load most of the time. The latter situation allows the CPU to put cores in C6 mode a significant amount of time. As a result the power savings in a server environment are nothing short of impressive.
Now that we understand the nuts and bolts, we are able to move on to our next question: How can we get the best power efficiency at a certain performance point? We will follow up with a power efficiency case study based on SQL Server 2008.
References
[1] "Planet Google": One Company's Audacious Plan to Organize Everything, page 82, Randall Stross, Free Press New York.
[2] "AMD Family 10h Server and Workstation Processor Power and Thermal Data Sheet Publication Revision: 3.07, September 2009"
[3]"Power Reduction through RTL Clock Gating," F. Emnett and M. Biegel, SNUG (Synopsis User Group) Conference San Jose, 2000.
[4] "45nm Next Generation Intel Core™ Microarchitecture (Penryn)", Varghese George Principal Engineer Intel Corp, HOT CHIPS 2007
[5] "Analysis of Dynamic Power Management on Multi-Core Processors", W. Lloyd Bircher and Lizy K. John, The University of Texas at Austin. ICS '08 June 2008
[6] "Intel Xeon Processor 3400 series thermal/mechanical specifications and design guidelines, December 2009
35 Comments
View All Comments
JohanAnandtech - Monday, January 18, 2010 - link
In which utility do you set/manage the frequency of a separate core?n0nsense - Monday, January 18, 2010 - link
Gnome panel applets. CPU frequency monitor I guess it uses cpufreq. Each instance monitors core. So i have 4 of them visible all the time. If you have enabled CPU Frequency scaling (kernel) than you can select the governor (performance, on demand, conservative etc) or a static frequency. I can do it for each core. And it displays what i have set.Of course processor should support frequency scaling.(power now and speed step).
Most mainstream distributions (Ubuntu, Sabayon, Fedora) will use onedemand governor by default when processor with frequency scaling available. No user intervention required.
jordanclock - Monday, January 18, 2010 - link
I really think you're mistaken. Core 2 CPUs don't have any mechanism to allow per-core frequencies. There is one FSB clock and one multiplier. There is no way to set CPU0 to a different frequency than CPU1 (or for quad core, CPU2 and CPU3) because the variables that control the clock speed are chip wide.VJ - Tuesday, January 19, 2010 - link
These people seem to be convinced of per-core Speedstep:https://bugs.launchpad.net/ubuntu/+source/linux-so...">https://bugs.launchpad.net/ubuntu/+source/linux-so...
Maybe someone can ask David Tomaschik for the Intel documentation he refers to?
n0nsense - Monday, January 18, 2010 - link
I heard it in past, but i still tend to believe my eyes :)while writing this reply, i saw any possible combination. My Q9300 has 2 states 2.0GHz and
2.5GHz. It's not a server CPU. Have no reason to mislead you
VJ - Tuesday, January 19, 2010 - link
If there's only two states, then it's possible that one core is in the C2 state while the other is in its C0 state.The core in state C2 may be shown to be operating at 2Ghz (its lowest frequency) while it's really off. The OS may simply be reporting the lowest possible frequency while the core is really not receiving a clock signal.
So in general, if one core is showing its lowest frequency it may be off which still allows the other core to operate (at a different frequency).
It would be very strange if both cores are operating greater than their lowest and less than their highest frequencies at different frequencies.
From a different angle: Has anybody ever seen /proc/cpuinfo report a frequency less than the CPU/Core's lowest active frequency or even zero? Probably not.
n0nsense - Tuesday, January 19, 2010 - link
Nice theory :)But in this case, I see that each core doing something. htop shows that each core somewhere in 15% usage. So the only options left, are
1. Each core frequency can be controlled independently on C2D and C2Q (May be i3 i5 i7 too)
2. The OS is completely unaware of whats going on :) (which is less possible)
mino - Thursday, January 21, 2010 - link
"The OS is completely unaware of whats going on" is the right answer.:)
BTW, only x86 CPU's able to change freq per core are >=K10 for AMD and >=Nehalem for Intel.
VJ - Tuesday, January 19, 2010 - link
Not to defeat your argument/observations, rather for completeness' sake:It's also possible that the differences are due to the reading of the attributes. If the attributes are read in succession, then it's possible that the differences are due to the time of reading the attributes, while at any given instant, notwithstanding the allowable subtle differences in frequency described in this article, all cores are operating at the same frequency.
There's a lot of time at the bottom.
JanR - Tuesday, January 19, 2010 - link
Hi,I completely agree to this:
"It's also possible that the differences are due to the reading of the attributes."
The point is that desktop usage together with ondemand governor leads to a lot of fast frequency changes. Therefore, this is not a good scenario to decide on "per core" vs "per CPU". We did a lot of testing the following way:
Put load on all cores using "taskset" (this avoids C-states). Switch to "userspace" governor and then set frequencies of individual cores differently. You have one control per core but the actual hardware decides what really happens - you can check this in /proc/cpuinfo or using a tool such as "mhz" from lmbench as load generator (this one calculates actual frequency based on CPI and time, it allows also measurement of turbo frequencies).
Trying around, the results are:
AMD K8: One clock domain, maximum of the requested frequencies is taken
Intel Core2 Duo: Same as K8
AMD K10: Individual clock domains, you can clock each core individually
Intel Core 2 Quad: TWO clock domains! These CPUs are two dual core dies glued together so each die has its one multiplicator. Therefore, the cores of each die get the maximum of the requested frequencies but you can clock the two dies independendly.
Intel Nehalem: One clock domain, maximum of requests of all cores that are not in C-state! If you set one core to, e.g., 2.66 GHz and all other to 1.6, all cores clock at 1.6 as long as the core set to 2.66 is not used, they all switch to 2.66 if you put load on that core.
So far to our findings. "cat /proc/cpuinfo" or some funny tools are useless if you do not control the environment (userspace, manual settings). If you then enable ondemand, the system switches fast between different states and looking at it is just a snapshot, maybe taken in the middle of a transition.
Greetings,
Jan