* Dynamic configure max_cstate @ 2009-07-27 5:30 Zhang, Yanmin 2009-07-27 7:33 ` Andreas Mohr 2009-07-28 19:47 ` Len Brown 0 siblings, 2 replies; 21+ messages in thread From: Zhang, Yanmin @ 2009-07-27 5:30 UTC (permalink / raw) To: LKML, linux-acpi; +Cc: yakui_zhao When running a fio workload, I found sometimes cpu C state has big impact on the result. Mostly, fio is a disk I/O workload which doesn't spend much time with cpu, so cpu switch to C2/C3 freqently and the latency is big. If I start kernel with idle=poll or processor.max_cstate=1, the result is quite good. Consider a scenario that machine is busy at daytime and free at night. Could we add a dynamic configuration interface for processor.max_cstate or something similiar with sysfs? So user applications could change the max_cstate dynamically? For example, we could add a new parameter to function cpuidle_governor->select to mark the highest c state. Any idea? Yanmin ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Dynamic configure max_cstate 2009-07-27 5:30 Dynamic configure max_cstate Zhang, Yanmin @ 2009-07-27 7:33 ` Andreas Mohr 2009-07-28 2:42 ` Zhang, Yanmin 2009-07-29 0:17 ` Len Brown 2009-07-28 19:47 ` Len Brown 1 sibling, 2 replies; 21+ messages in thread From: Andreas Mohr @ 2009-07-27 7:33 UTC (permalink / raw) To: Zhang, Yanmin; +Cc: LKML, linux-acpi Hi, > When running a fio workload, I found sometimes cpu C state has > big impact on the result. Mostly, fio is a disk I/O workload > which doesn't spend much time with cpu, so cpu switch to C2/C3 > freqently and the latency is big. Rather than inventing ways to limit ACPI Cx state usefulness, we should perhaps be thinking of what's wrong here. And your complaint might just fit into a thought I had recently: are we actually taking ACPI Cx exit latency into account, for timers??? If we program a timer to fire at some point, then it is quite imaginable that any ACPI Cx exit latency due to the CPU being idle at that moment could add to actual timer trigger time significantly. To combat this, one would need to tweak the timer expiration time to include the exit latency. But of course once the CPU is running again, one would need to re-add the latency amount (read: reprogram the timer hardware, ugh...) to prevent the timer from firing too early. Given that one would need to reprogram timer hardware quite often, I don't know whether taking Cx exit latency into account is feasible. OTOH analysis of the single next timer value and actual hardware reprogramming would have to be done only once (in ACPI sleep and wake paths each), thus it might just turn out to be very beneficial after all (minus prolonging ACPI Cx path activity and thus aggravating CPU power savings, of course). Arjan mentioned examples of maybe 10us for C2 and 185us for C3/C4 in an article. OTOH even 185us is only 0.185ms, which, when compared to disk seek latency (around 7ms still, except for SSD), doesn't seem to be all that much. Or what kind of ballpark figure do you have for percentage of I/O deterioration? I'm wondering whether we might have an even bigger problem with disk I/O related to this than just the raw ACPI exit latency value itself. Andreas Mohr ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Dynamic configure max_cstate 2009-07-27 7:33 ` Andreas Mohr @ 2009-07-28 2:42 ` Zhang, Yanmin 2009-07-28 7:20 ` Corrado Zoccolo 2009-07-29 0:17 ` Len Brown 1 sibling, 1 reply; 21+ messages in thread From: Zhang, Yanmin @ 2009-07-28 2:42 UTC (permalink / raw) To: Andreas Mohr; +Cc: LKML, linux-acpi On Mon, 2009-07-27 at 09:33 +0200, Andreas Mohr wrote: > Hi, > > > When running a fio workload, I found sometimes cpu C state has > > big impact on the result. Mostly, fio is a disk I/O workload > > which doesn't spend much time with cpu, so cpu switch to C2/C3 > > freqently and the latency is big. > > Rather than inventing ways to limit ACPI Cx state usefulness, we should > perhaps be thinking of what's wrong here. Andreas, Thanks for your kind comments. > > And your complaint might just fit into a thought I had recently: > are we actually taking ACPI Cx exit latency into account, for timers??? I tried both tickless kernel and non-tickless kernels. The result is similiar. Originally, I also thought it's related to timer. As you know, I/O block layer has many timers. Such timers don't expire normally. For example, an I/O request is submitted to driver and driver delievers it to disk and hardware triggers an interrupt after finishing I/O. Mostly, the I/O submit and interrupt, not the timer, drive the I/O. > > If we program a timer to fire at some point, then it is quite imaginable > that any ACPI Cx exit latency due to the CPU being idle at that moment > could add to actual timer trigger time significantly. > > To combat this, one would need to tweak the timer expiration time > to include the exit latency. But of course once the CPU is running > again, one would need to re-add the latency amount (read: reprogram the > timer hardware, ugh...) to prevent the timer from firing too early. > > Given that one would need to reprogram timer hardware quite often, > I don't know whether taking Cx exit latency into account is feasible. > OTOH analysis of the single next timer value and actual hardware reprogramming > would have to be done only once (in ACPI sleep and wake paths each), > thus it might just turn out to be very beneficial after all > (minus prolonging ACPI Cx path activity and thus aggravating CPU power > savings, of course). > > Arjan mentioned examples of maybe 10us for C2 and 185us for C3/C4 in an > article. > > OTOH even 185us is only 0.185ms, which, when compared to disk seek > latency (around 7ms still, except for SSD), doesn't seem to be all that much. > Or what kind of ballpark figure do you have for percentage of I/O > deterioration? I have lots of FIO sub test cases which test I/O on single disk and JBOD (a disk bos which mostly has 12~13 disks) on nahelam machines. Your analysis on disk seek is reasonable. I found sequential buffered read has the worst regression while rand read is far better. For example, I start 12 processes per disk and every disk has 24 1-G files. There are 12 disks. The sequential read fio result is about 593MB/second with idle=poll, and about 375MB/s without idle=poll. Read block size is 4KB. Another exmaple is single fio direct seqential read (block size is 4K) on a single SATA disk. The result is about 28MB/s without idle=poll and about 32.5MB with idle=poll. How did I find C state has impact on disk I/O result? Frankly, I found a regression between kernel 2.6.27 and 2.6.28. Bisect located a nonstop tsc patch, but the patch is quite good. I found the patch changes the default clocksource from hpet to tsc. Then, I tried all clocksources and got the best result with acpi_pm clocksource. But oprofile data shows acpi_pm has more cpu utilization. clocksource jiffies has worst result but least cpu utilization. As you know, fio calls gettimeofday frequently. Then, I tried boot parameter processor.max_cstate and idle=poll. I get the similar result with processor.max_cstate=1 like the one with idle=poll. I also run the testing on 2 stoakley machines and don't find such issues. /proc/acpi/processor/CPUXXX/power shows stoakley cpu only has C1. > I'm wondering whether we might have an even bigger problem with disk I/O > related to this than just the raw ACPI exit latency value itself. We might have. I'm still doing more testing. With Venki's tool (write/read MSR registers), I collected some C state switch stat. Current cpuidle has a good consideration on cpu utilization, but doesn't have consideration on devices. So with I/O delivery and interrupt drive model with little cpu utilization, performance might be hurt if C state exit has a long latency. Yanmin -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Dynamic configure max_cstate 2009-07-28 2:42 ` Zhang, Yanmin @ 2009-07-28 7:20 ` Corrado Zoccolo 2009-07-28 9:00 ` Zhang, Yanmin 2009-07-28 19:25 ` Len Brown 0 siblings, 2 replies; 21+ messages in thread From: Corrado Zoccolo @ 2009-07-28 7:20 UTC (permalink / raw) To: Zhang, Yanmin; +Cc: Andreas Mohr, LKML, linux-acpi Hi, On Tue, Jul 28, 2009 at 4:42 AM, Zhang, Yanmin<yanmin_zhang@linux.intel.com> wrote: > On Mon, 2009-07-27 at 09:33 +0200, Andreas Mohr wrote: >> Hi, >> >> > When running a fio workload, I found sometimes cpu C state has >> > big impact on the result. Mostly, fio is a disk I/O workload >> > which doesn't spend much time with cpu, so cpu switch to C2/C3 >> > freqently and the latency is big. >> >> Rather than inventing ways to limit ACPI Cx state usefulness, we should >> perhaps be thinking of what's wrong here. > Andreas, > > Thanks for your kind comments. > >> >> And your complaint might just fit into a thought I had recently: >> are we actually taking ACPI Cx exit latency into account, for timers??? > I tried both tickless kernel and non-tickless kernels. The result is similiar. > > Originally, I also thought it's related to timer. As you know, I/O block layer > has many timers. Such timers don't expire normally. For example, an I/O request > is submitted to driver and driver delievers it to disk and hardware triggers > an interrupt after finishing I/O. Mostly, the I/O submit and interrupt, not > the timer, drive the I/O. > >> >> If we program a timer to fire at some point, then it is quite imaginable >> that any ACPI Cx exit latency due to the CPU being idle at that moment >> could add to actual timer trigger time significantly. >> >> To combat this, one would need to tweak the timer expiration time >> to include the exit latency. But of course once the CPU is running >> again, one would need to re-add the latency amount (read: reprogram the >> timer hardware, ugh...) to prevent the timer from firing too early. >> >> Given that one would need to reprogram timer hardware quite often, >> I don't know whether taking Cx exit latency into account is feasible. >> OTOH analysis of the single next timer value and actual hardware reprogramming >> would have to be done only once (in ACPI sleep and wake paths each), >> thus it might just turn out to be very beneficial after all >> (minus prolonging ACPI Cx path activity and thus aggravating CPU power >> savings, of course). >> >> Arjan mentioned examples of maybe 10us for C2 and 185us for C3/C4 in an >> article. >> >> OTOH even 185us is only 0.185ms, which, when compared to disk seek >> latency (around 7ms still, except for SSD), doesn't seem to be all that much. >> Or what kind of ballpark figure do you have for percentage of I/O >> deterioration? > I have lots of FIO sub test cases which test I/O on single disk and JBOD (a disk > bos which mostly has 12~13 disks) on nahelam machines. Your analysis on disk seek > is reasonable. I found sequential buffered read has the worst regression while rand > read is far better. For example, I start 12 processes per disk and every disk has 24 > 1-G files. There are 12 disks. The sequential read fio result is about 593MB/second > with idle=poll, and about 375MB/s without idle=poll. Read block size is 4KB. > > Another exmaple is single fio direct seqential read (block size is 4K) on a single > SATA disk. The result is about 28MB/s without idle=poll and about 32.5MB with > idle=poll. > > How did I find C state has impact on disk I/O result? Frankly, I found a regression > between kernel 2.6.27 and 2.6.28. Bisect located a nonstop tsc patch, but the patch > is quite good. I found the patch changes the default clocksource from hpet to > tsc. Then, I tried all clocksources and got the best result with acpi_pm clocksource. > But oprofile data shows acpi_pm has more cpu utilization. clocksource jiffies has > worst result but least cpu utilization. As you know, fio calls gettimeofday frequently. > Then, I tried boot parameter processor.max_cstate and idle=poll. > I get the similar result with processor.max_cstate=1 like the one with idle=poll. > Is it possible that the different bandwidths figures are due to incorrect timing, instead of C-state latencies? Entering a deep C state can cause strange things to timers: some of them, especially tsc, become unreliable. Maybe the patch you found that re-enables tsc is actually wrong for your machine, for which tsc is unreliable in deep C states. > I also run the testing on 2 stoakley machines and don't find such issues. > /proc/acpi/processor/CPUXXX/power shows stoakley cpu only has C1. > >> I'm wondering whether we might have an even bigger problem with disk I/O >> related to this than just the raw ACPI exit latency value itself. > We might have. I'm still doing more testing. With Venki's tool (write/read MSR registers), > I collected some C state switch stat. > You can see the latencies (expressed in us) on your machine with: [root@localhost corrado]# cat /sys/devices/system/cpu/cpu0/cpuidle/state*/latency 0 0 1 133 Can you post your numbers, to see if they are unusually high? > Current cpuidle has a good consideration on cpu utilization, but doesn't have > consideration on devices. So with I/O delivery and interrupt drive model > with little cpu utilization, performance might be hurt if C state exit has a long > latency. > > Yanmin > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-acpi" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- __________________________________________________________________________ dott. Corrado Zoccolo mailto:czoccolo@gmail.com PhD - Department of Computer Science - University of Pisa, Italy -------------------------------------------------------------------------- -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Dynamic configure max_cstate 2009-07-28 7:20 ` Corrado Zoccolo @ 2009-07-28 9:00 ` Zhang, Yanmin 2009-07-28 10:11 ` Andreas Mohr 2009-07-30 6:28 ` Zhang, Yanmin 2009-07-28 19:25 ` Len Brown 1 sibling, 2 replies; 21+ messages in thread From: Zhang, Yanmin @ 2009-07-28 9:00 UTC (permalink / raw) To: Corrado Zoccolo; +Cc: Andreas Mohr, LKML, linux-acpi On Tue, 2009-07-28 at 09:20 +0200, Corrado Zoccolo wrote: > Hi, > On Tue, Jul 28, 2009 at 4:42 AM, Zhang, > Yanmin<yanmin_zhang@linux.intel.com> wrote: > > On Mon, 2009-07-27 at 09:33 +0200, Andreas Mohr wrote: > >> Hi, > >> > >> > When running a fio workload, I found sometimes cpu C state has > >> > big impact on the result. Mostly, fio is a disk I/O workload > >> > which doesn't spend much time with cpu, so cpu switch to C2/C3 > >> > freqently and the latency is big. > >> > >> Rather than inventing ways to limit ACPI Cx state usefulness, we should > >> perhaps be thinking of what's wrong here. > > Andreas, > > > > Thanks for your kind comments. > > > >> > >> And your complaint might just fit into a thought I had recently: > >> are we actually taking ACPI Cx exit latency into account, for timers??? > > I tried both tickless kernel and non-tickless kernels. The result is similiar. > > > > Originally, I also thought it's related to timer. As you know, I/O block layer > > has many timers. Such timers don't expire normally. For example, an I/O request > > is submitted to driver and driver delievers it to disk and hardware triggers > > an interrupt after finishing I/O. Mostly, the I/O submit and interrupt, not > > the timer, drive the I/O. > > > >> > >> If we program a timer to fire at some point, then it is quite imaginable > >> that any ACPI Cx exit latency due to the CPU being idle at that moment > >> could add to actual timer trigger time significantly. > >> > >> To combat this, one would need to tweak the timer expiration time > >> to include the exit latency. But of course once the CPU is running > >> again, one would need to re-add the latency amount (read: reprogram the > >> timer hardware, ugh...) to prevent the timer from firing too early. > >> > >> Given that one would need to reprogram timer hardware quite often, > >> I don't know whether taking Cx exit latency into account is feasible. > >> OTOH analysis of the single next timer value and actual hardware reprogramming > >> would have to be done only once (in ACPI sleep and wake paths each), > >> thus it might just turn out to be very beneficial after all > >> (minus prolonging ACPI Cx path activity and thus aggravating CPU power > >> savings, of course). > >> > >> Arjan mentioned examples of maybe 10us for C2 and 185us for C3/C4 in an > >> article. > >> > >> OTOH even 185us is only 0.185ms, which, when compared to disk seek > >> latency (around 7ms still, except for SSD), doesn't seem to be all that much. > >> Or what kind of ballpark figure do you have for percentage of I/O > >> deterioration? > > I have lots of FIO sub test cases which test I/O on single disk and JBOD (a disk > > bos which mostly has 12~13 disks) on nahelam machines. Your analysis on disk seek > > is reasonable. I found sequential buffered read has the worst regression while rand > > read is far better. For example, I start 12 processes per disk and every disk has 24 > > 1-G files. There are 12 disks. The sequential read fio result is about 593MB/second > > with idle=poll, and about 375MB/s without idle=poll. Read block size is 4KB. > > > > Another exmaple is single fio direct seqential read (block size is 4K) on a single > > SATA disk. The result is about 28MB/s without idle=poll and about 32.5MB with > > idle=poll. > > > > How did I find C state has impact on disk I/O result? Frankly, I found a regression > > between kernel 2.6.27 and 2.6.28. Bisect located a nonstop tsc patch, but the patch > > is quite good. I found the patch changes the default clocksource from hpet to > > tsc. Then, I tried all clocksources and got the best result with acpi_pm clocksource. > > But oprofile data shows acpi_pm has more cpu utilization. clocksource jiffies has > > worst result but least cpu utilization. As you know, fio calls gettimeofday frequently. > > Then, I tried boot parameter processor.max_cstate and idle=poll. > > I get the similar result with processor.max_cstate=1 like the one with idle=poll. > > > > Is it possible that the different bandwidths figures are due to > incorrect timing, instead of C-state latencies? I'm not sure. > Entering a deep C state can cause strange things to timers: some of > them, especially tsc, become unreliable. > Maybe the patch you found that re-enables tsc is actually wrong for > your machine, for which tsc is unreliable in deep C states. I'm using a SDV machine, not an official product. But it's rare that cpuid reports non-stop tsc feature while it doesn't support it. I tried different clocksources. For exmaple, I could get a better (30%) result with hpet. With hpet, cpu utilization is about 5~8%. Function hpet_read uses too much cpu time. With tsc, cpu utilization is about 2~3%. I think more cpu utilization causes fewer C state transitions. With idle=poll, the result is about 10% better than the one of hpet. If using idle=poll, I didn't find result difference among different clocksources. > > > I also run the testing on 2 stoakley machines and don't find such issues. > > /proc/acpi/processor/CPUXXX/power shows stoakley cpu only has C1. > > > >> I'm wondering whether we might have an even bigger problem with disk I/O > >> related to this than just the raw ACPI exit latency value itself. > > We might have. I'm still doing more testing. With Venki's tool (write/read MSR registers), > > I collected some C state switch stat. > > > You can see the latencies (expressed in us) on your machine with: > [root@localhost corrado]# cat > /sys/devices/system/cpu/cpu0/cpuidle/state*/latency > 0 > 0 > 1 > 133 > > Can you post your numbers, to see if they are unusually high? [ymzhang@lkp-ne02 ~]$ cat /proc/acpi/processor/CPU0/power active state: C0 max_cstate: C8 maximum allowed latency: 2000000000 usec states: C1: type[C1] promotion[--] demotion[--] latency[003] usage[00001661] duration[00000000000000000000] C2: type[C3] promotion[--] demotion[--] latency[205] usage[00000687] duration[00000000000000732028] C3: type[C3] promotion[--] demotion[--] latency[245] usage[00011509] duration[00000000000115186065] [ymzhang@lkp-ne02 ~]$ cat /sys/devices/system/cpu/cpu0/cpuidle/state*/latency 0 3 205 245 > > > Current cpuidle has a good consideration on cpu utilization, but doesn't have > > consideration on devices. So with I/O delivery and interrupt drive model > > with little cpu utilization, performance might be hurt if C state exit has a long > > latency. -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Dynamic configure max_cstate 2009-07-28 9:00 ` Zhang, Yanmin @ 2009-07-28 10:11 ` Andreas Mohr 2009-07-28 14:03 ` Andreas Mohr ` (2 more replies) 2009-07-30 6:28 ` Zhang, Yanmin 1 sibling, 3 replies; 21+ messages in thread From: Andreas Mohr @ 2009-07-28 10:11 UTC (permalink / raw) To: Zhang, Yanmin; +Cc: Corrado Zoccolo, Andreas Mohr, LKML, linux-acpi Hi, On Tue, Jul 28, 2009 at 05:00:35PM +0800, Zhang, Yanmin wrote: > I tried different clocksources. For exmaple, I could get a better (30%) result with > hpet. With hpet, cpu utilization is about 5~8%. Function hpet_read uses too much cpu > time. With tsc, cpu utilization is about 2~3%. I think more cpu utilization causes fewer > C state transitions. > > With idle=poll, the result is about 10% better than the one of hpet. If using idle=poll, > I didn't find result difference among different clocksources. IOW, this seems to clearly point to ACPI Cx causing it. Both Corrado and me have been thinking that one should try skipping all bigger-latency ACPI Cx states whenever there's an ongoing I/O request where an immediate reply interrupt is expected. I've been investigating this a bit, and interesting parts would perhaps include . kernel/pm_qos_params.c . drivers/cpuidle/governors/menu.c (which acts on the ACPI _cx state structs as configured by drivers/acpi/processor_idle.c) . and e.g. the wait_for_completion_timeout() part in drivers/ata/libata-core.c (or other sources in case of other disk I/O mechanisms) One way to do some quick (and dirty!!) testing would be to set a flag before calling wait_for_completion_timeout() and testing for this flag in drivers/cpuidle/governors/menu.c and then skip deeper Cx states conditionally. As a very quick test, I tried a while :; do :; done loop in shell and renicing shell to 19 (to keep my CPU out of ACPI idle), but bonnie -s 100 results initially looked promising yet turned out to be inconsistent. The real way to test this would be idle=poll. My test system was Athlon XP with /proc/acpi/processor/CPU0/power latencies of 000 and 100 (the maximum allowed value, BTW) for C1/C2. If the wait_for_completion_timeout() flag testing turns out to help, then one might intend to use the pm_qos infrastructure to indicate these conditions, however it might be too bloated for such a purpose, a relatively simple (read: fast) boolean flag mechanism could be better. Plus one could then create a helper function which figures out a "pretty fast" Cx state (independent of specific latency times!). But when introducing this mechanism, take care to not ignore the requirements defined by pm_qos settings! Oh, and about the places which submit I/O requests where one would have to flag this: are they in any way correlated with the scheduler I/O wait value? Would the I/O wait mechanism be a place to more easily and centrally indicate that we're waiting for a request to come back in "very soon"? OTOH I/O requests may have vastly differing delay expectations, thus specifically only short-term expected I/O replies should be flagged, otherwise we're wasting lots of ACPI deep idle opportunities. Andreas Mohr ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Dynamic configure max_cstate 2009-07-28 10:11 ` Andreas Mohr @ 2009-07-28 14:03 ` Andreas Mohr 2009-07-28 17:35 ` ok, now would this be useful? (Re: Dynamic configure max_cstate) Andreas Mohr 2009-07-29 8:20 ` Dynamic configure max_cstate Zhang, Yanmin 2009-07-31 3:43 ` Robert Hancock 2 siblings, 1 reply; 21+ messages in thread From: Andreas Mohr @ 2009-07-28 14:03 UTC (permalink / raw) To: Andreas Mohr; +Cc: Zhang, Yanmin, Corrado Zoccolo, LKML, linux-acpi On Tue, Jul 28, 2009 at 12:11:35PM +0200, Andreas Mohr wrote: > As a very quick test, I tried a > while :; do :; done > loop in shell and renicing shell to 19 (to keep my CPU out of ACPI idle), > but bonnie -s 100 results initially looked promising yet turned out to > be inconsistent. The real way to test this would be idle=poll. > My test system was Athlon XP with /proc/acpi/processor/CPU0/power > latencies of 000 and 100 (the maximum allowed value, BTW) for C1/C2. OK, I just tested it properly. Rebooted, did 5 bonnie -s 100 with ACPI idle, rebooted and did another 5 bonnie -s 100 with idle=poll, results: $ cat bonnie_ACPI_* /tmp/line bonnie_poll_* -------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU 1* 100 20084 95.3 19037 9.5 12286 4.7 18074 99.6 581752 96.6 28792.3 93.6 -------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU 1* 100 19235 93.5 24591 11.8 13916 4.3 17934 99.8 604429 100.3 27993.8 98.0 -------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU 1* 100 17221 86.3 30591 16.1 15404 5.4 18689 99.3 593296 92.7 28146.0 98.5 -------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU 1* 100 20254 99.3 110095 55.9 15722 6.1 17901 99.5 601185 99.8 28675.5 100.4 -------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU 1* 100 18274 88.5 106909 53.2 10614 4.1 18759 99.7 598833 99.4 28461.6 92.5 ======== -------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU 1* 100 15274 98.2 20206 9.7 17286 7.3 18055 99.4 608112 101.0 28424.0 99.5 -------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU 1* 100 20545 99.1 25332 12.6 16392 6.1 17957 99.4 606706 100.7 27906.8 90.7 -------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU 1* 100 20482 99.2 30907 13.6 17585 6.2 17867 99.1 608090 101.0 27919.1 97.7 -------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU 1* 100 20863 99.4 138383 66.2 18945 7.6 17938 99.5 581421 96.5 27094.6 94.8 -------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU 1* 100 20821 98.8 156821 70.4 11536 4.4 18747 99.0 603556 100.2 27677.8 96.9 And these values (cumulative) result in: ACPI poll Per Char 95068 97985 +3.06% Block 291223 371649 +27.62% Rewrite 67942 81744 +20.31% Per Char 91357 90564 -0.87% Block 2979495 3007885 +0.95% RndSeek 142069.2 139022.3 -2.1% average: +8.16% Now the question is how much is due to idle state entry/exit latency and how much is due to ACPI idle/wakeup code path execution. Still, an average of +8.16% during 5 test runs each should be quite some incentive, and once there's a proper "idle latency skipping during expected I/O replies" even with idle/wakeup code path reinstated we should hopefully be able to keep some 5% improvement in disk access. Andreas Mohr ^ permalink raw reply [flat|nested] 21+ messages in thread
* ok, now would this be useful? (Re: Dynamic configure max_cstate) 2009-07-28 14:03 ` Andreas Mohr @ 2009-07-28 17:35 ` Andreas Mohr 0 siblings, 0 replies; 21+ messages in thread From: Andreas Mohr @ 2009-07-28 17:35 UTC (permalink / raw) To: Andreas Mohr; +Cc: Zhang, Yanmin, Corrado Zoccolo, LKML, linux-acpi On Tue, Jul 28, 2009 at 04:03:08PM +0200, Andreas Mohr wrote: > Still, an average of +8.16% during 5 test runs each should be quite some incentive, > and once there's a proper "idle latency skipping during expected I/O replies" > even with idle/wakeup code path reinstated we should hopefully be able to keep > some 5% improvement in disk access. I went ahead and created a small and VERY dirty test for this. In kernel/pm_qos_params.c I added static bool io_reply_is_expected; bool io_reply_expected(void) { return io_reply_is_expected; } EXPORT_SYMBOL_GPL(io_reply_expected); void set_io_reply_expected(bool expected) { io_reply_is_expected = expected; } EXPORT_SYMBOL_GPL(set_io_reply_expected); Then in drivers/ata/libata-core.c I added extern bool set_io_reply_expected(); and updated it to set_io_reply_expected(1); rc = wait_for_completion_timeout(&wait, msecs_to_jiffies(timeout)); set_io_reply_expected(0); ata_port_flush_task(ap); Then I changed ./drivers/cpuidle/governors/menu.c (make sure you're using the menu governor!) to use extern bool io_reply_expected(void); and updated if (io_reply_expected()) data->expected_us = 10; else { /* determine the expected residency time */ data->expected_us = (u32) ktime_to_ns(tick_nohz_get_sleep_length()) / 1000; } Rebuilt, rebootloadered ;), rebooted, and then booting and disk operation _seemed_ to be snappier (I'm damn sure the hdd seek noise is a bit higher-pitched ;). And it's exactly seeks which should be shorter-intervalled now, since the system triggers a hdd operation and then is forced to wait (idle) until the seeking is done. bonnie test results (of patched kernel vs. kernel with set_io_reply_expected() muted) seem to support this, but then a "time make bzImage" (of newly rebooted box each) showed inconsistent results again and a much higher sample rate (with reboots each) would be needed to really confirm this. I'd expect improvements to be in the 3% to 4% range, at most, but still, compared to the yield of other kernel patches this ain't nothing. Now the question becomes whether one should implement such an improvement and especially, how. Perhaps the io reply decision making should be folded into the tick_nohz_get_sleep_length() function (or rather create a higher-level "expected sleep length" function which consults both tick_nohz_get_sleep_length() and io reply mechanism). And another important detail is that my current hack completely ignores per-cpu operation and thus causes suboptimal power savings of _all_ cpus, not just the one waiting for the I/O reply (i.e., we should properly take into account cpu affinity settings of the reply interrupt). And of course it would probably be best to create a mechanism which stores a record of average responsiveness delays of various block devices and then derive a maximum idle wakeup latency value from this to request. Does anyone else have thoughts on this or benchmark numbers which would support this? Andreas Mohr ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Dynamic configure max_cstate 2009-07-28 10:11 ` Andreas Mohr 2009-07-28 14:03 ` Andreas Mohr @ 2009-07-29 8:20 ` Zhang, Yanmin 2009-07-31 3:43 ` Robert Hancock 2 siblings, 0 replies; 21+ messages in thread From: Zhang, Yanmin @ 2009-07-29 8:20 UTC (permalink / raw) To: Andreas Mohr; +Cc: Corrado Zoccolo, LKML, linux-acpi On Tue, 2009-07-28 at 12:11 +0200, Andreas Mohr wrote: > Hi, > > On Tue, Jul 28, 2009 at 05:00:35PM +0800, Zhang, Yanmin wrote: > > I tried different clocksources. For exmaple, I could get a better (30%) result with > > hpet. With hpet, cpu utilization is about 5~8%. Function hpet_read uses too much cpu > > time. With tsc, cpu utilization is about 2~3%. I think more cpu utilization causes fewer > > C state transitions. > > > > With idle=poll, the result is about 10% better than the one of hpet. If using idle=poll, > > I didn't find result difference among different clocksources. > > IOW, this seems to clearly point to ACPI Cx causing it. > > Both Corrado and me have been thinking that one should try skipping all > bigger-latency ACPI Cx states whenever there's an ongoing I/O request where an > immediate reply interrupt is expected. That's a good idea. > > I've been investigating this a bit, and interesting parts would perhaps include > . kernel/pm_qos_params.c > . drivers/cpuidle/governors/menu.c (which acts on the ACPI _cx state > structs as configured by drivers/acpi/processor_idle.c) > . and e.g. the wait_for_completion_timeout() part in drivers/ata/libata-core.c > (or other sources in case of other disk I/O mechanisms) > > One way to do some quick (and dirty!!) testing would be to set a flag > before calling wait_for_completion_timeout() and testing for this flag in > drivers/cpuidle/governors/menu.c and then skip deeper Cx states > conditionally. > > As a very quick test, I tried a > while :; do :; done > loop in shell and renicing shell to 19 (to keep my CPU out of ACPI idle), > but bonnie -s 100 results initially looked promising yet turned out to > be inconsistent. The real way to test this would be idle=poll. > My test system was Athlon XP with /proc/acpi/processor/CPU0/power > latencies of 000 and 100 (the maximum allowed value, BTW) for C1/C2. > > If the wait_for_completion_timeout() flag testing turns out to help, > then one might intend to use the pm_qos infrastructure to indicate > these conditions, however it might be too bloated for such a > purpose, a relatively simple (read: fast) boolean flag mechanism > could be better. > > Plus one could then create a helper function which figures out a > "pretty fast" Cx state (independent of specific latency times!). > But when introducing this mechanism, take care to not ignore the > requirements defined by pm_qos settings! > > Oh, and about the places which submit I/O requests where one would have to > flag this: are they in any way correlated with the scheduler I/O wait > value? Would the I/O wait mechanism be a place to more easily and centrally > indicate that we're waiting for a request to come back in "very soon"? > OTOH I/O requests may have vastly differing delay expectations, > thus specifically only short-term expected I/O replies should be flagged, > otherwise we're wasting lots of ACPI deep idle opportunities. Another issue is we might submit I/O request on cpu A, but the corresponding interrupt is sent to cpu B. It's common. So the SOFTIRQ on cpu B would send an IPI to cpu A to schedule process to run on cpu A to finish the I/O. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Dynamic configure max_cstate 2009-07-28 10:11 ` Andreas Mohr 2009-07-28 14:03 ` Andreas Mohr 2009-07-29 8:20 ` Dynamic configure max_cstate Zhang, Yanmin @ 2009-07-31 3:43 ` Robert Hancock 2009-07-31 7:06 ` Zhang, Yanmin 2 siblings, 1 reply; 21+ messages in thread From: Robert Hancock @ 2009-07-31 3:43 UTC (permalink / raw) To: Andreas Mohr; +Cc: Zhang, Yanmin, Corrado Zoccolo, LKML, linux-acpi On 07/28/2009 04:11 AM, Andreas Mohr wrote: > Hi, > > On Tue, Jul 28, 2009 at 05:00:35PM +0800, Zhang, Yanmin wrote: >> I tried different clocksources. For exmaple, I could get a better (30%) result with >> hpet. With hpet, cpu utilization is about 5~8%. Function hpet_read uses too much cpu >> time. With tsc, cpu utilization is about 2~3%. I think more cpu utilization causes fewer >> C state transitions. >> >> With idle=poll, the result is about 10% better than the one of hpet. If using idle=poll, >> I didn't find result difference among different clocksources. > > IOW, this seems to clearly point to ACPI Cx causing it. > > Both Corrado and me have been thinking that one should try skipping all > bigger-latency ACPI Cx states whenever there's an ongoing I/O request where an > immediate reply interrupt is expected. > > I've been investigating this a bit, and interesting parts would perhaps include > . kernel/pm_qos_params.c > . drivers/cpuidle/governors/menu.c (which acts on the ACPI _cx state > structs as configured by drivers/acpi/processor_idle.c) > . and e.g. the wait_for_completion_timeout() part in drivers/ata/libata-core.c > (or other sources in case of other disk I/O mechanisms) > > One way to do some quick (and dirty!!) testing would be to set a flag > before calling wait_for_completion_timeout() and testing for this flag in > drivers/cpuidle/governors/menu.c and then skip deeper Cx states > conditionally. > > As a very quick test, I tried a > while :; do :; done > loop in shell and renicing shell to 19 (to keep my CPU out of ACPI idle), > but bonnie -s 100 results initially looked promising yet turned out to > be inconsistent. The real way to test this would be idle=poll. > My test system was Athlon XP with /proc/acpi/processor/CPU0/power > latencies of 000 and 100 (the maximum allowed value, BTW) for C1/C2. > > If the wait_for_completion_timeout() flag testing turns out to help, > then one might intend to use the pm_qos infrastructure to indicate > these conditions, however it might be too bloated for such a > purpose, a relatively simple (read: fast) boolean flag mechanism > could be better. > > Plus one could then create a helper function which figures out a > "pretty fast" Cx state (independent of specific latency times!). > But when introducing this mechanism, take care to not ignore the > requirements defined by pm_qos settings! > > Oh, and about the places which submit I/O requests where one would have to > flag this: are they in any way correlated with the scheduler I/O wait > value? Would the I/O wait mechanism be a place to more easily and centrally > indicate that we're waiting for a request to come back in "very soon"? > OTOH I/O requests may have vastly differing delay expectations, > thus specifically only short-term expected I/O replies should be flagged, > otherwise we're wasting lots of ACPI deep idle opportunities. Did the results show a big difference in performance between maximum C2 and maximum C3? Thing with C3 is that it likely will have some interference with bus-master DMA activity as the CPU has to wake up at least partially before the SATA controller can complete DMA operations, which will likely stall the controller for some period of time. There would be an argument for avoiding going into deep C-states which can't handle snooping while IO is in progress and DMA will shortly be occurring.. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Dynamic configure max_cstate 2009-07-31 3:43 ` Robert Hancock @ 2009-07-31 7:06 ` Zhang, Yanmin 2009-07-31 8:07 ` Andreas Mohr 0 siblings, 1 reply; 21+ messages in thread From: Zhang, Yanmin @ 2009-07-31 7:06 UTC (permalink / raw) To: Robert Hancock; +Cc: Andreas Mohr, Corrado Zoccolo, LKML, linux-acpi On Thu, 2009-07-30 at 21:43 -0600, Robert Hancock wrote: > On 07/28/2009 04:11 AM, Andreas Mohr wrote: > > Hi, > > > > On Tue, Jul 28, 2009 at 05:00:35PM +0800, Zhang, Yanmin wrote: > >> I tried different clocksources. For exmaple, I could get a better (30%) result with > >> hpet. With hpet, cpu utilization is about 5~8%. Function hpet_read uses too much cpu > >> time. With tsc, cpu utilization is about 2~3%. I think more cpu utilization causes fewer > >> C state transitions. > >> > >> With idle=poll, the result is about 10% better than the one of hpet. If using idle=poll, > >> I didn't find result difference among different clocksources. > > > > IOW, this seems to clearly point to ACPI Cx causing it. > > > > Both Corrado and me have been thinking that one should try skipping all > > bigger-latency ACPI Cx states whenever there's an ongoing I/O request where an > > immediate reply interrupt is expected. > > > > I've been investigating this a bit, and interesting parts would perhaps include > > . kernel/pm_qos_params.c > > . drivers/cpuidle/governors/menu.c (which acts on the ACPI _cx state > > structs as configured by drivers/acpi/processor_idle.c) > > . and e.g. the wait_for_completion_timeout() part in drivers/ata/libata-core.c > > (or other sources in case of other disk I/O mechanisms) > > > > One way to do some quick (and dirty!!) testing would be to set a flag > > before calling wait_for_completion_timeout() and testing for this flag in > > drivers/cpuidle/governors/menu.c and then skip deeper Cx states > > conditionally. > > > > As a very quick test, I tried a > > while :; do :; done > > loop in shell and renicing shell to 19 (to keep my CPU out of ACPI idle), > > but bonnie -s 100 results initially looked promising yet turned out to > > be inconsistent. The real way to test this would be idle=poll. > > My test system was Athlon XP with /proc/acpi/processor/CPU0/power > > latencies of 000 and 100 (the maximum allowed value, BTW) for C1/C2. > > > > If the wait_for_completion_timeout() flag testing turns out to help, > > then one might intend to use the pm_qos infrastructure to indicate > > these conditions, however it might be too bloated for such a > > purpose, a relatively simple (read: fast) boolean flag mechanism > > could be better. > > > > Plus one could then create a helper function which figures out a > > "pretty fast" Cx state (independent of specific latency times!). > > But when introducing this mechanism, take care to not ignore the > > requirements defined by pm_qos settings! > > > > Oh, and about the places which submit I/O requests where one would have to > > flag this: are they in any way correlated with the scheduler I/O wait > > value? Would the I/O wait mechanism be a place to more easily and centrally > > indicate that we're waiting for a request to come back in "very soon"? > > OTOH I/O requests may have vastly differing delay expectations, > > thus specifically only short-term expected I/O replies should be flagged, > > otherwise we're wasting lots of ACPI deep idle opportunities. > > Did the results show a big difference in performance between maximum C2 > and maximum C3? No big difference. I tried different max cstate by processor.max_cstate. Mostly, processor.max_cstate=1 could get the similiar result like idle=poll. > Thing with C3 is that it likely will have some > interference with bus-master DMA activity as the CPU has to wake up at > least partially before the SATA controller can complete DMA operations, > which will likely stall the controller for some period of time. There > would be an argument for avoiding going into deep C-states which can't > handle snooping while IO is in progress and DMA will shortly be occurring.. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Dynamic configure max_cstate 2009-07-31 7:06 ` Zhang, Yanmin @ 2009-07-31 8:07 ` Andreas Mohr 2009-07-31 14:40 ` Andi Kleen 2009-07-31 15:14 ` Len Brown 0 siblings, 2 replies; 21+ messages in thread From: Andreas Mohr @ 2009-07-31 8:07 UTC (permalink / raw) To: Zhang, Yanmin Cc: Robert Hancock, Andreas Mohr, Corrado Zoccolo, LKML, linux-acpi Hi, On Fri, Jul 31, 2009 at 03:06:46PM +0800, Zhang, Yanmin wrote: > On Thu, 2009-07-30 at 21:43 -0600, Robert Hancock wrote: > > On 07/28/2009 04:11 AM, Andreas Mohr wrote: > > > Oh, and about the places which submit I/O requests where one would have to > > > flag this: are they in any way correlated with the scheduler I/O wait > > > value? Would the I/O wait mechanism be a place to more easily and centrally > > > indicate that we're waiting for a request to come back in "very soon"? > > > OTOH I/O requests may have vastly differing delay expectations, > > > thus specifically only short-term expected I/O replies should be flagged, > > > otherwise we're wasting lots of ACPI deep idle opportunities. > > > > Did the results show a big difference in performance between maximum C2 > > and maximum C3? > No big difference. I tried different max cstate by processor.max_cstate. > Mostly, processor.max_cstate=1 could get the similiar result like idle=poll. OK, but I'd say that this doesn't mean that we should implement a hard-coded mechanism which simply says "in such cases, don't do anything > C1". Instead we should strive for a far-reaching _generic_ mechanism which gathers average latencies of various I/O activities/devices and then uses some formula to determine the maximum (not necessarily ACPI) idle latency that we're willing to endure (e.g. average device I/O reply latency divided by 10 or so). And in addition to this, we should also take into account (read: skip) any idle states which kill busmaster DMA completely (in case of busmaster DMA I/O activities, that is). _Lots_ of very nice opportunities for improvement here, I'd say... (in the 5, 10 or even 40% range in the case of certain network I/O) Andreas ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Dynamic configure max_cstate 2009-07-31 8:07 ` Andreas Mohr @ 2009-07-31 14:40 ` Andi Kleen 2009-07-31 14:56 ` Michael S. Zick 2009-07-31 17:37 ` Pallipadi, Venkatesh 2009-07-31 15:14 ` Len Brown 1 sibling, 2 replies; 21+ messages in thread From: Andi Kleen @ 2009-07-31 14:40 UTC (permalink / raw) To: Zhang, Yanmin Cc: Robert Hancock, Corrado Zoccolo, LKML, linux-acpi, venkatesh.pallipadi Andreas Mohr <andi@lisas.de> writes: > Instead we should strive for a far-reaching _generic_ mechanism > which gathers average latencies of various I/O activities/devices > and then uses some formula to determine the maximum (not necessarily ACPI) > idle latency that we're willing to endure (e.g. average device I/O reply latency > divided by 10 or so). The interrupt heuristics in the menu cpuidle governour are already attempting this, based on interrupt rates (or rather wakeup rates) which are supposed to roughly correspond with IO rates and scheduling events together. Apparently that doesn't work in this case. The challenge would be to find out why and improve the menu algorithm to deal with it. I doubt a completely new mechanism is needed or makes sense. > And in addition to this, we should also take into account (read: skip) > any idle states which kill busmaster DMA completely > (in case of busmaster DMA I/O activities, that is) This is already done. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Dynamic configure max_cstate 2009-07-31 14:40 ` Andi Kleen @ 2009-07-31 14:56 ` Michael S. Zick 2009-07-31 17:37 ` Pallipadi, Venkatesh 1 sibling, 0 replies; 21+ messages in thread From: Michael S. Zick @ 2009-07-31 14:56 UTC (permalink / raw) To: Andi Kleen Cc: Zhang, Yanmin, Robert Hancock, Corrado Zoccolo, LKML, linux-acpi, venkatesh.pallipadi On Fri July 31 2009, Andi Kleen wrote: > Andreas Mohr <andi@lisas.de> writes: > > > Instead we should strive for a far-reaching _generic_ mechanism > > which gathers average latencies of various I/O activities/devices > > and then uses some formula to determine the maximum (not necessarily ACPI) > > idle latency that we're willing to endure (e.g. average device I/O reply latency > > divided by 10 or so). > > The interrupt heuristics in the menu cpuidle governour are already > attempting this, based on interrupt rates (or rather > wakeup rates) which are supposed to roughly correspond with IO rates > and scheduling events together. > > Apparently that doesn't work in this case. The challenge would > be to find out why and improve the menu algorithm to deal with it. > I doubt a completely new mechanism is needed or makes sense. > > > And in addition to this, we should also take into account (read: skip) > > any idle states which kill busmaster DMA completely > > (in case of busmaster DMA I/O activities, that is) > > This is already done. > Almost - the VIA C7-M needs a bit of kernel command line help - - But should be easily fixable when I or one of the VIA support people GATI. (Bus snoops are only fully supported in C0 and C1 but idle=halt takes care of that.) Mike > -Andi ^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: Dynamic configure max_cstate 2009-07-31 14:40 ` Andi Kleen 2009-07-31 14:56 ` Michael S. Zick @ 2009-07-31 17:37 ` Pallipadi, Venkatesh 1 sibling, 0 replies; 21+ messages in thread From: Pallipadi, Venkatesh @ 2009-07-31 17:37 UTC (permalink / raw) To: Andi Kleen, Zhang, Yanmin Cc: Robert Hancock, Corrado Zoccolo, LKML, linux-acpi@vger.kernel.org, Len Brown >-----Original Message----- >From: Andi Kleen [mailto:andi@firstfloor.org] >Sent: Friday, July 31, 2009 7:40 AM >To: Zhang, Yanmin >Cc: Robert Hancock; Corrado Zoccolo; LKML; >linux-acpi@vger.kernel.org; Pallipadi, Venkatesh >Subject: Re: Dynamic configure max_cstate > >Andreas Mohr <andi@lisas.de> writes: > >> Instead we should strive for a far-reaching _generic_ mechanism >> which gathers average latencies of various I/O activities/devices >> and then uses some formula to determine the maximum (not >necessarily ACPI) >> idle latency that we're willing to endure (e.g. average >device I/O reply latency >> divided by 10 or so). > >The interrupt heuristics in the menu cpuidle governour are already >attempting this, based on interrupt rates (or rather >wakeup rates) which are supposed to roughly correspond with IO rates >and scheduling events together. > >Apparently that doesn't work in this case. The challenge would >be to find out why and improve the menu algorithm to deal with it. >I doubt a completely new mechanism is needed or makes sense. > Yes. cpuidle's attempt at guessing the interrupt rate is not working here. I got this test running on a test system here and looks like its not just the cpuidle that causes problems here. I am still collecting more data, but from what I have right now, this is what I see: - cpuidle and deep C-state usage reduces the performance here, as has been discussed in this thread. - cpufreq ondemand governor also has a problem with the workload, as it runs the CPU mostly at lower freq (as CPU utilization is hardly over 20%) and switching the cpus to high frequency increases the performance. - It also depends on where fio and the ahci interrupt handler runs. Looks like, for maximum performance, they have to run on different CPUs sharing the caches. So, getting this workload to give best performance by default will be a major challenge :-). Another thing that will be interesting to look at is performance/power for this workload and I haven't ventured into that territory yet. Thanks, Venki ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Dynamic configure max_cstate 2009-07-31 8:07 ` Andreas Mohr 2009-07-31 14:40 ` Andi Kleen @ 2009-07-31 15:14 ` Len Brown 1 sibling, 0 replies; 21+ messages in thread From: Len Brown @ 2009-07-31 15:14 UTC (permalink / raw) To: Andreas Mohr Cc: Zhang, Yanmin, Robert Hancock, Corrado Zoccolo, LKML, linux-acpi, Arjan van de Ven > And in addition to this, we should also take into account (read: skip) > any idle states which kill busmaster DMA completely > (in case of busmaster DMA I/O activities, that is). It isn't so simple. This is system specific. In the old days, a c3-type C-state would lock down the bus in order to assure no DMA could pass by to memory before the processor could wake up to snoop. Then a few years ago the hardware would allow us to enter C3-type C-states, but transparently "pop-up" into C2 to retire the snoop activity without ever waking the processor to C0. This was good b/c it was more efficient than waking to C0, but bad b/c the OS could not easily tell if it actually got the C3 time it requested, or if it was actually spending a bunch of time demoted to C2... In the most recent hardware, the core's cache is flushed in deep C-states so that the core need not be woken at all to snoop DMA activity. Indeed, Yanmin's Nehalem box advertises two C3-type C-states, but in reality, Nehalem doesn't have _any_ C3-type C-states, only C2-type. The BIOS advertises C3-type C-states to not break the installed base, which uses the presence of C3-type C-states to work around the broken LAPIC timer. I think the issue at hand on the system at hand is waking up the processor in response to an IO interrupt break events. ie. Linux doees a good job with timer interrupts, but isn't so smart about using IO interrupts for demoting C-states. Arjan is looking at fixing this. cheers, Len Brown, Intel Opoen Source TEchnology Center ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Dynamic configure max_cstate 2009-07-28 9:00 ` Zhang, Yanmin 2009-07-28 10:11 ` Andreas Mohr @ 2009-07-30 6:28 ` Zhang, Yanmin 1 sibling, 0 replies; 21+ messages in thread From: Zhang, Yanmin @ 2009-07-30 6:28 UTC (permalink / raw) To: Corrado Zoccolo; +Cc: Andreas Mohr, LKML, linux-acpi, Len Brown On Tue, 2009-07-28 at 17:00 +0800, Zhang, Yanmin wrote: > On Tue, 2009-07-28 at 09:20 +0200, Corrado Zoccolo wrote: > > Hi, > > On Tue, Jul 28, 2009 at 4:42 AM, Zhang, > > Yanmin<yanmin_zhang@linux.intel.com> wrote: > > > On Mon, 2009-07-27 at 09:33 +0200, Andreas Mohr wrote: > > >> Hi, > > >> > > >> > When running a fio workload, I found sometimes cpu C state has > > >> > big impact on the result. Mostly, fio is a disk I/O workload > > >> > which doesn't spend much time with cpu, so cpu switch to C2/C3 > > >> > freqently and the latency is big. > > >> > > >> Rather than inventing ways to limit ACPI Cx state usefulness, we should > > >> perhaps be thinking of what's wrong here. > > > Andreas, > > > > > > Thanks for your kind comments. > > > > > >> > > >> And your complaint might just fit into a thought I had recently: > > >> are we actually taking ACPI Cx exit latency into account, for timers??? > > > I tried both tickless kernel and non-tickless kernels. The result is similiar. > > > > > > Originally, I also thought it's related to timer. As you know, I/O block layer > > > has many timers. Such timers don't expire normally. For example, an I/O request > > > is submitted to driver and driver delievers it to disk and hardware triggers > > > an interrupt after finishing I/O. Mostly, the I/O submit and interrupt, not > > > the timer, drive the I/O. > > > > > >> > > >> If we program a timer to fire at some point, then it is quite imaginable > > >> that any ACPI Cx exit latency due to the CPU being idle at that moment > > >> could add to actual timer trigger time significantly. > > >> > > >> To combat this, one would need to tweak the timer expiration time > > >> to include the exit latency. But of course once the CPU is running > > >> again, one would need to re-add the latency amount (read: reprogram the > > >> timer hardware, ugh...) to prevent the timer from firing too early. > > >> > > >> Given that one would need to reprogram timer hardware quite often, > > >> I don't know whether taking Cx exit latency into account is feasible. > > >> OTOH analysis of the single next timer value and actual hardware reprogramming > > >> would have to be done only once (in ACPI sleep and wake paths each), > > >> thus it might just turn out to be very beneficial after all > > >> (minus prolonging ACPI Cx path activity and thus aggravating CPU power > > >> savings, of course). > > >> > > >> Arjan mentioned examples of maybe 10us for C2 and 185us for C3/C4 in an > > >> article. > > >> > > >> OTOH even 185us is only 0.185ms, which, when compared to disk seek > > >> latency (around 7ms still, except for SSD), doesn't seem to be all that much. > > >> Or what kind of ballpark figure do you have for percentage of I/O > > >> deterioration? > > > I have lots of FIO sub test cases which test I/O on single disk and JBOD (a disk > > > bos which mostly has 12~13 disks) on nahelam machines. Your analysis on disk seek > > > is reasonable. I found sequential buffered read has the worst regression while rand > > > read is far better. For example, I start 12 processes per disk and every disk has 24 > > > 1-G files. There are 12 disks. The sequential read fio result is about 593MB/second > > > with idle=poll, and about 375MB/s without idle=poll. Read block size is 4KB. > > > > > > Another exmaple is single fio direct seqential read (block size is 4K) on a single > > > SATA disk. The result is about 28MB/s without idle=poll and about 32.5MB with > > > idle=poll. > > > > > > How did I find C state has impact on disk I/O result? Frankly, I found a regression > > > between kernel 2.6.27 and 2.6.28. Bisect located a nonstop tsc patch, but the patch > > > is quite good. I found the patch changes the default clocksource from hpet to > > > tsc. Then, I tried all clocksources and got the best result with acpi_pm clocksource. > > > But oprofile data shows acpi_pm has more cpu utilization. clocksource jiffies has > > > worst result but least cpu utilization. As you know, fio calls gettimeofday frequently. > > > Then, I tried boot parameter processor.max_cstate and idle=poll. > > > I get the similar result with processor.max_cstate=1 like the one with idle=poll. > > > > > > > Is it possible that the different bandwidths figures are due to > > incorrect timing, instead of C-state latencies? > I'm not sure. > > > Entering a deep C state can cause strange things to timers: some of > > them, especially tsc, become unreliable. > > Maybe the patch you found that re-enables tsc is actually wrong for > > your machine, for which tsc is unreliable in deep C states. > I'm using a SDV machine, not an official product. But it's rare that cpuid > reports non-stop tsc feature while it doesn't support it. > > I tried different clocksources. For exmaple, I could get a better (30%) result with > hpet. With hpet, cpu utilization is about 5~8%. Function hpet_read uses too much cpu > time. With tsc, cpu utilization is about 2~3%. I think more cpu utilization causes fewer > C state transitions. > > With idle=poll, the result is about 10% better than the one of hpet. If using idle=poll, > I didn't find result difference among different clocksources. > > > > > > I also run the testing on 2 stoakley machines and don't find such issues. > > > /proc/acpi/processor/CPUXXX/power shows stoakley cpu only has C1. > > > > > >> I'm wondering whether we might have an even bigger problem with disk I/O > > >> related to this than just the raw ACPI exit latency value itself. > > > We might have. I'm still doing more testing. With Venki's tool (write/read MSR registers), > > > I collected some C state switch stat. > > > > > You can see the latencies (expressed in us) on your machine with: > > [root@localhost corrado]# cat > > /sys/devices/system/cpu/cpu0/cpuidle/state*/latency > > 0 > > 0 > > 1 > > 133 > > > > Can you post your numbers, to see if they are unusually high? > [ymzhang@lkp-ne02 ~]$ cat /proc/acpi/processor/CPU0/power > active state: C0 > max_cstate: C8 > maximum allowed latency: 2000000000 usec > states: > C1: type[C1] promotion[--] demotion[--] latency[003] usage[00001661] duration[00000000000000000000] > C2: type[C3] promotion[--] demotion[--] latency[205] usage[00000687] duration[00000000000000732028] > C3: type[C3] promotion[--] demotion[--] latency[245] usage[00011509] duration[00000000000115186065] > > [ymzhang@lkp-ne02 ~]$ cat /sys/devices/system/cpu/cpu0/cpuidle/state*/latency > 0 > 3 > 205 > 245 > > > > > > Current cpuidle has a good consideration on cpu utilization, but doesn't have > > > consideration on devices. So with I/O delivery and interrupt drive model > > > with little cpu utilization, performance might be hurt if C state exit has a long > > > latency. Another interesting testing with netperf has the similiar behavior. I start 1 netperf client and bind client and server to different physical cpus to run a UDP-RR-1 loopback testing. The result is about 54000 without idle=poll while the one is 88000 with idle=poll. If I start CPU_NUM netperf clients, there is no such issue, because all cpu are busy. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Dynamic configure max_cstate 2009-07-28 7:20 ` Corrado Zoccolo 2009-07-28 9:00 ` Zhang, Yanmin @ 2009-07-28 19:25 ` Len Brown 1 sibling, 0 replies; 21+ messages in thread From: Len Brown @ 2009-07-28 19:25 UTC (permalink / raw) To: Corrado Zoccolo; +Cc: Zhang, Yanmin, Andreas Mohr, LKML, linux-acpi > Entering a deep C state can cause strange things to timers: some of > them, especially tsc, become unreliable. The Nehalem family CPU has a non-stop constant-frequency TSC. The measurements that Yanmin quotes show that the TSC is the lowest overhead timesource in the system. thanks, Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Dynamic configure max_cstate 2009-07-27 7:33 ` Andreas Mohr 2009-07-28 2:42 ` Zhang, Yanmin @ 2009-07-29 0:17 ` Len Brown 2009-07-29 8:00 ` Andreas Mohr 1 sibling, 1 reply; 21+ messages in thread From: Len Brown @ 2009-07-29 0:17 UTC (permalink / raw) To: Andreas Mohr; +Cc: Zhang, Yanmin, LKML, linux-acpi thanks, Len Brown, Intel Open Source Technology Center On Mon, 27 Jul 2009, Andreas Mohr wrote: > Hi, > > > When running a fio workload, I found sometimes cpu C state has > > big impact on the result. Mostly, fio is a disk I/O workload > > which doesn't spend much time with cpu, so cpu switch to C2/C3 > > freqently and the latency is big. > > Rather than inventing ways to limit ACPI Cx state usefulness, we should > perhaps be thinking of what's wrong here. > > And your complaint might just fit into a thought I had recently: > are we actually taking ACPI Cx exit latency into account, for timers??? Yes. menu_select() calls tick_nohz_get_sleep_length() specifically to compare the expiration of the next timer vs. the expected sleep length. The problem here is likely that the expected sleep length is shorter than expected, for IO interrupts are not timers... Thus we add long deep C-state wakeup time to the IO interrupt latency... -Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Dynamic configure max_cstate 2009-07-29 0:17 ` Len Brown @ 2009-07-29 8:00 ` Andreas Mohr 0 siblings, 0 replies; 21+ messages in thread From: Andreas Mohr @ 2009-07-29 8:00 UTC (permalink / raw) To: Len Brown Cc: Andreas Mohr, Zhang, Yanmin, Thomas Gleixner, mingo, LKML, linux-acpi Hi, On Tue, Jul 28, 2009 at 08:17:09PM -0400, Len Brown wrote: > > And your complaint might just fit into a thought I had recently: > > are we actually taking ACPI Cx exit latency into account, for timers??? > > Yes. > menu_select() calls tick_nohz_get_sleep_length() specifically > to compare the expiration of the next timer vs. the expected sleep length. > > The problem here is likely that the expected sleep length > is shorter than expected, for IO interrupts are not timers... > Thus we add long deep C-state wakeup time to the IO interrupt latency... Well, but... the code does not work according to my idea about this. The code currently checks against the expected sleep length and throws away any large exit latencies that don't fit. What I was thinking how to handle this is entirely different (and, frankly, I'm not sure whether it would have any advantage, but still): actively _subtract_ the idle exit latency from the timer expiration time (i.e., reprogram the timer on idle entry and again on idle exit if not expired yet) to make sure that the timer fires correctly despite having to handle the idle exit, too. OTOH while this might allow deeper Cx states, it's most likely a weaker solution than the current implementation, since it requires up to two times additional timer reprogramming. And additionally taking into account I/O-inflicted idle exit can be implemented pretty easily alongside the existing tick_nohz_get_sleep_length() mechanism. The code still causes some additional uneasiness such as: tick_nohz_get_sleep_length() returns dev->next_event - now, but pushed through all the ACPI latency hardware-wise) the actual timer appearance after cpu wakeup might be entirely random, there should be a feedback mechanism which measures when a timer was expected and when it then _actually_ turned up, to cancel out the delay effects of ACPI idle entry/exit. == i.e. we seem to be calculating these things on what we _think_ the machine is doing, not on what we _know_ about its previous behaviour == - since we don't have a feedback loop... IMHO this is an important missing element here, if such a feedback loop was implemented, then timer wakeups would be much more precise, which incidentally would result in improved machine performance. (CC Thomas) And spinning this a bit further - let me guess (I didn't check it) that hard realtime users are always quick to disable ACPI Cx completely? With such a mechanism they shouldn't need to, since the timer is programmed according to _actual_ CPU wakeup time, not when we _think_ it might wakeup. (CC Ingo) I just realized that such a feedback loop (resulting in possibly early-programmed timers) would then need my timer reprogramming mechanism again (after ACPI idle exit), to avoid early timer trigger. However ultimately I think it might turn out to be a much better solution to precisely _determine_ timer fireing than to simply statically, mechanically (blindly!) pre-set the time around which a timer "might be expected to be fired". An annoyingly simple sentence to phrase the current situation: "With ACPI idle configured, high-res timers aren't." Or am I wrong and the current implementation is already doing all this already? Didn't see that though... Andreas Mohr ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Dynamic configure max_cstate 2009-07-27 5:30 Dynamic configure max_cstate Zhang, Yanmin 2009-07-27 7:33 ` Andreas Mohr @ 2009-07-28 19:47 ` Len Brown 1 sibling, 0 replies; 21+ messages in thread From: Len Brown @ 2009-07-28 19:47 UTC (permalink / raw) To: Zhang, Yanmin; +Cc: LKML, linux-acpi, yakui_zhao, Arjan van de Ven > When running a fio workload, I found sometimes cpu C state has > big impact on the result. Mostly, fio is a disk I/O workload > which doesn't spend much time with cpu, so cpu switch to C2/C3 > freqently and the latency is big. > > If I start kernel with idle=poll or processor.max_cstate=1, > the result is quite good. Consider a scenario that machine is > busy at daytime and free at night. Could we add a dynamic > configuration interface for processor.max_cstate or something > similiar with sysfs? So user applications could change the > max_cstate dynamically? For example, we could add a new > parameter to function cpuidle_governor->select to mark the > highest c state. max_cstate is a debug param. It isn't a run-time API and never will be. User-space shouldn't need to know or care about C-states, and if it appears it needs to, then we have a bug we need to fix. The interface in Documentation/power/pm_qos_interface.txt is supposed to handle this. Though if the underlying code is not noticing IO interrupts, then it can't help. Another thing to look at is processor.latency_factor which you can change at run-time in /sys/module/processor/parameters/latency_factor We multiply the advertised exit latency by this before deciding to enter a C-state. The concept is that ACPI reports a performance number, but what we really want is a power break-even. Anyway, we know the default mulitple is too low, and will be raising it shortly. Of course if the current code is not predicting any IO interrupts on your IO-only workload, this, like pm_qos, will not help. cheers, -Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2009-07-31 17:37 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-07-27 5:30 Dynamic configure max_cstate Zhang, Yanmin 2009-07-27 7:33 ` Andreas Mohr 2009-07-28 2:42 ` Zhang, Yanmin 2009-07-28 7:20 ` Corrado Zoccolo 2009-07-28 9:00 ` Zhang, Yanmin 2009-07-28 10:11 ` Andreas Mohr 2009-07-28 14:03 ` Andreas Mohr 2009-07-28 17:35 ` ok, now would this be useful? (Re: Dynamic configure max_cstate) Andreas Mohr 2009-07-29 8:20 ` Dynamic configure max_cstate Zhang, Yanmin 2009-07-31 3:43 ` Robert Hancock 2009-07-31 7:06 ` Zhang, Yanmin 2009-07-31 8:07 ` Andreas Mohr 2009-07-31 14:40 ` Andi Kleen 2009-07-31 14:56 ` Michael S. Zick 2009-07-31 17:37 ` Pallipadi, Venkatesh 2009-07-31 15:14 ` Len Brown 2009-07-30 6:28 ` Zhang, Yanmin 2009-07-28 19:25 ` Len Brown 2009-07-29 0:17 ` Len Brown 2009-07-29 8:00 ` Andreas Mohr 2009-07-28 19:47 ` Len Brown
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox