Re: Dynamic configure max_cstate

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
To: Andreas Mohr <andi@lisas.de>
Cc: LKML <linux-kernel@vger.kernel.org>, linux-acpi@vger.kernel.org
Subject: Re: Dynamic configure max_cstate
Date: Tue, 28 Jul 2009 10:42:15 +0800	[thread overview]
Message-ID: <1248748935.2560.669.camel@ymzhang> (raw)
In-Reply-To: <20090727073338.GA12669@rhlx01.hs-esslingen.de>

On Mon, 2009-07-27 at 09:33 +0200, Andreas Mohr wrote:
> Hi,
> 
> > When running a fio workload, I found sometimes cpu C state has
> > big impact on the result. Mostly, fio is a disk I/O workload
> > which doesn't spend much time with cpu, so cpu switch to C2/C3
> > freqently and the latency is big.
> 
> Rather than inventing ways to limit ACPI Cx state usefulness, we should
> perhaps be thinking of what's wrong here.
Andreas,

Thanks for your kind comments.

> 
> And your complaint might just fit into a thought I had recently:
> are we actually taking ACPI Cx exit latency into account, for timers???
I tried both tickless kernel and non-tickless kernels. The result is similiar.

Originally, I also thought it's related to timer. As you know, I/O block layer
has many timers. Such timers don't expire normally. For example, an I/O request
is submitted to driver and driver delievers it to disk and hardware triggers
an interrupt after finishing I/O. Mostly, the I/O submit and interrupt, not
the timer, drive the I/O.

> 
> If we program a timer to fire at some point, then it is quite imaginable
> that any ACPI Cx exit latency due to the CPU being idle at that moment
> could add to actual timer trigger time significantly.
> 
> To combat this, one would need to tweak the timer expiration time
> to include the exit latency. But of course once the CPU is running
> again, one would need to re-add the latency amount (read: reprogram the
> timer hardware, ugh...) to prevent the timer from firing too early.
> 
> Given that one would need to reprogram timer hardware quite often,
> I don't know whether taking Cx exit latency into account is feasible.
> OTOH analysis of the single next timer value and actual hardware reprogramming
> would have to be done only once (in ACPI sleep and wake paths each),
> thus it might just turn out to be very beneficial after all
> (minus prolonging ACPI Cx path activity and thus aggravating CPU power
> savings, of course).
> 
> Arjan mentioned examples of maybe 10us for C2 and 185us for C3/C4 in an
> article.
> 
> OTOH even 185us is only 0.185ms, which, when compared to disk seek
> latency (around 7ms still, except for SSD), doesn't seem to be all that much.
> Or what kind of ballpark figure do you have for percentage of I/O
> deterioration?
I have lots of FIO sub test cases which test I/O on single disk and JBOD (a disk
bos which mostly has 12~13 disks) on nahelam machines. Your analysis on disk seek
is reasonable. I found sequential buffered read has the worst regression while rand
read is far better. For example, I start 12 processes per disk and every disk has 24
1-G files. There are 12 disks. The sequential read fio result is about 593MB/second
with idle=poll, and about 375MB/s without idle=poll. Read block size is 4KB.

Another exmaple is single fio direct seqential read (block size is 4K) on a single
SATA disk. The result is about 28MB/s without idle=poll and about 32.5MB with
idle=poll.

How did I find C state has impact on disk I/O result? Frankly, I found a regression
between kernel 2.6.27 and 2.6.28. Bisect located a nonstop tsc patch, but the patch
is quite good. I found the patch changes the default clocksource from hpet to
tsc. Then, I tried all clocksources and got the best result with acpi_pm clocksource.
But oprofile data shows acpi_pm has more cpu utilization. clocksource jiffies has
worst result but least cpu utilization. As you know, fio calls gettimeofday frequently.
Then, I tried boot parameter processor.max_cstate and idle=poll.
I get the similar result with processor.max_cstate=1 like the one with idle=poll.

I also run the testing on 2 stoakley machines and don't find such issues.
/proc/acpi/processor/CPUXXX/power shows stoakley cpu only has C1.

> I'm wondering whether we might have an even bigger problem with disk I/O
> related to this than just the raw ACPI exit latency value itself.
We might have. I'm still doing more testing. With Venki's tool (write/read MSR registers),
I collected some C state switch stat.

Current cpuidle has a good consideration on cpu utilization, but doesn't have
consideration on devices. So with I/O delivery and interrupt drive model
with little cpu utilization, performance might be hurt if C state exit has a long
latency.

Yanmin

--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

WARNING: multiple messages have this Message-ID (diff)

From: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
To: Andreas Mohr <andi@lisas.de>
Cc: LKML <linux-kernel@vger.kernel.org>, linux-acpi@vger.kernel.org
Subject: Re: Dynamic configure max_cstate
Date: Tue, 28 Jul 2009 10:42:15 +0800	[thread overview]
Message-ID: <1248748935.2560.669.camel@ymzhang> (raw)
In-Reply-To: <20090727073338.GA12669@rhlx01.hs-esslingen.de>

On Mon, 2009-07-27 at 09:33 +0200, Andreas Mohr wrote:
> Hi,
> 
> > When running a fio workload, I found sometimes cpu C state has
> > big impact on the result. Mostly, fio is a disk I/O workload
> > which doesn't spend much time with cpu, so cpu switch to C2/C3
> > freqently and the latency is big.
> 
> Rather than inventing ways to limit ACPI Cx state usefulness, we should
> perhaps be thinking of what's wrong here.
Andreas,

Thanks for your kind comments.

> 
> And your complaint might just fit into a thought I had recently:
> are we actually taking ACPI Cx exit latency into account, for timers???
I tried both tickless kernel and non-tickless kernels. The result is similiar.

Originally, I also thought it's related to timer. As you know, I/O block layer
has many timers. Such timers don't expire normally. For example, an I/O request
is submitted to driver and driver delievers it to disk and hardware triggers
an interrupt after finishing I/O. Mostly, the I/O submit and interrupt, not
the timer, drive the I/O.

> 
> If we program a timer to fire at some point, then it is quite imaginable
> that any ACPI Cx exit latency due to the CPU being idle at that moment
> could add to actual timer trigger time significantly.
> 
> To combat this, one would need to tweak the timer expiration time
> to include the exit latency. But of course once the CPU is running
> again, one would need to re-add the latency amount (read: reprogram the
> timer hardware, ugh...) to prevent the timer from firing too early.
> 
> Given that one would need to reprogram timer hardware quite often,
> I don't know whether taking Cx exit latency into account is feasible.
> OTOH analysis of the single next timer value and actual hardware reprogramming
> would have to be done only once (in ACPI sleep and wake paths each),
> thus it might just turn out to be very beneficial after all
> (minus prolonging ACPI Cx path activity and thus aggravating CPU power
> savings, of course).
> 
> Arjan mentioned examples of maybe 10us for C2 and 185us for C3/C4 in an
> article.
> 
> OTOH even 185us is only 0.185ms, which, when compared to disk seek
> latency (around 7ms still, except for SSD), doesn't seem to be all that much.
> Or what kind of ballpark figure do you have for percentage of I/O
> deterioration?
I have lots of FIO sub test cases which test I/O on single disk and JBOD (a disk
bos which mostly has 12~13 disks) on nahelam machines. Your analysis on disk seek
is reasonable. I found sequential buffered read has the worst regression while rand
read is far better. For example, I start 12 processes per disk and every disk has 24
1-G files. There are 12 disks. The sequential read fio result is about 593MB/second
with idle=poll, and about 375MB/s without idle=poll. Read block size is 4KB.

Another exmaple is single fio direct seqential read (block size is 4K) on a single
SATA disk. The result is about 28MB/s without idle=poll and about 32.5MB with
idle=poll.

How did I find C state has impact on disk I/O result? Frankly, I found a regression
between kernel 2.6.27 and 2.6.28. Bisect located a nonstop tsc patch, but the patch
is quite good. I found the patch changes the default clocksource from hpet to
tsc. Then, I tried all clocksources and got the best result with acpi_pm clocksource.
But oprofile data shows acpi_pm has more cpu utilization. clocksource jiffies has
worst result but least cpu utilization. As you know, fio calls gettimeofday frequently.
Then, I tried boot parameter processor.max_cstate and idle=poll.
I get the similar result with processor.max_cstate=1 like the one with idle=poll.

I also run the testing on 2 stoakley machines and don't find such issues.
/proc/acpi/processor/CPUXXX/power shows stoakley cpu only has C1.

> I'm wondering whether we might have an even bigger problem with disk I/O
> related to this than just the raw ACPI exit latency value itself.
We might have. I'm still doing more testing. With Venki's tool (write/read MSR registers),
I collected some C state switch stat.

Current cpuidle has a good consideration on cpu utilization, but doesn't have
consideration on devices. So with I/O delivery and interrupt drive model
with little cpu utilization, performance might be hurt if C state exit has a long
latency.

Yanmin

next prev parent reply	other threads:[~2009-07-28  2:42 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-07-27  5:30 Dynamic configure max_cstate Zhang, Yanmin
2009-07-27  7:33 ` Andreas Mohr
2009-07-28  2:42   ` Zhang, Yanmin [this message]
2009-07-28  2:42     ` Zhang, Yanmin
2009-07-28  7:20     ` Corrado Zoccolo
2009-07-28  7:20       ` Corrado Zoccolo
2009-07-28  9:00       ` Zhang, Yanmin
2009-07-28  9:00         ` Zhang, Yanmin
2009-07-28 10:11         ` Andreas Mohr
2009-07-28 14:03           ` Andreas Mohr
2009-07-28 17:35             ` ok, now would this be useful? (Re: Dynamic configure max_cstate) Andreas Mohr
2009-07-29  8:20           ` Dynamic configure max_cstate Zhang, Yanmin
2009-07-31  3:43           ` Robert Hancock
2009-07-31  7:06             ` Zhang, Yanmin
2009-07-31  8:07               ` Andreas Mohr
2009-07-31 14:40                 ` Andi Kleen
2009-07-31 14:56                   ` Michael S. Zick
2009-07-31 17:37                   ` Pallipadi, Venkatesh
2009-07-31 15:14                 ` Len Brown
2009-07-30  6:28         ` Zhang, Yanmin
2009-07-28 19:25       ` Len Brown
2009-07-29  0:17   ` Len Brown
2009-07-29  8:00     ` Andreas Mohr
2009-07-28 19:47 ` Len Brown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1248748935.2560.669.camel@ymzhang \
    --to=yanmin_zhang@linux.intel.com \
    --cc=andi@lisas.de \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.