Re: Dynamic configure max_cstate

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
To: Corrado Zoccolo <czoccolo@gmail.com>
Cc: Andreas Mohr <andi@lisas.de>, LKML <linux-kernel@vger.kernel.org>,
	linux-acpi@vger.kernel.org, Len Brown <lenb@kernel.org>
Subject: Re: Dynamic configure max_cstate
Date: Thu, 30 Jul 2009 14:28:00 +0800	[thread overview]
Message-ID: <1248935280.2560.733.camel@ymzhang> (raw)
In-Reply-To: <1248771635.2560.682.camel@ymzhang>

On Tue, 2009-07-28 at 17:00 +0800, Zhang, Yanmin wrote:
> On Tue, 2009-07-28 at 09:20 +0200, Corrado Zoccolo wrote:
> > Hi,
> > On Tue, Jul 28, 2009 at 4:42 AM, Zhang,
> > Yanmin<yanmin_zhang@linux.intel.com> wrote:
> > > On Mon, 2009-07-27 at 09:33 +0200, Andreas Mohr wrote:
> > >> Hi,
> > >>
> > >> > When running a fio workload, I found sometimes cpu C state has
> > >> > big impact on the result. Mostly, fio is a disk I/O workload
> > >> > which doesn't spend much time with cpu, so cpu switch to C2/C3
> > >> > freqently and the latency is big.
> > >>
> > >> Rather than inventing ways to limit ACPI Cx state usefulness, we should
> > >> perhaps be thinking of what's wrong here.
> > > Andreas,
> > >
> > > Thanks for your kind comments.
> > >
> > >>
> > >> And your complaint might just fit into a thought I had recently:
> > >> are we actually taking ACPI Cx exit latency into account, for timers???
> > > I tried both tickless kernel and non-tickless kernels. The result is similiar.
> > >
> > > Originally, I also thought it's related to timer. As you know, I/O block layer
> > > has many timers. Such timers don't expire normally. For example, an I/O request
> > > is submitted to driver and driver delievers it to disk and hardware triggers
> > > an interrupt after finishing I/O. Mostly, the I/O submit and interrupt, not
> > > the timer, drive the I/O.
> > >
> > >>
> > >> If we program a timer to fire at some point, then it is quite imaginable
> > >> that any ACPI Cx exit latency due to the CPU being idle at that moment
> > >> could add to actual timer trigger time significantly.
> > >>
> > >> To combat this, one would need to tweak the timer expiration time
> > >> to include the exit latency. But of course once the CPU is running
> > >> again, one would need to re-add the latency amount (read: reprogram the
> > >> timer hardware, ugh...) to prevent the timer from firing too early.
> > >>
> > >> Given that one would need to reprogram timer hardware quite often,
> > >> I don't know whether taking Cx exit latency into account is feasible.
> > >> OTOH analysis of the single next timer value and actual hardware reprogramming
> > >> would have to be done only once (in ACPI sleep and wake paths each),
> > >> thus it might just turn out to be very beneficial after all
> > >> (minus prolonging ACPI Cx path activity and thus aggravating CPU power
> > >> savings, of course).
> > >>
> > >> Arjan mentioned examples of maybe 10us for C2 and 185us for C3/C4 in an
> > >> article.
> > >>
> > >> OTOH even 185us is only 0.185ms, which, when compared to disk seek
> > >> latency (around 7ms still, except for SSD), doesn't seem to be all that much.
> > >> Or what kind of ballpark figure do you have for percentage of I/O
> > >> deterioration?
> > > I have lots of FIO sub test cases which test I/O on single disk and JBOD (a disk
> > > bos which mostly has 12~13 disks) on nahelam machines. Your analysis on disk seek
> > > is reasonable. I found sequential buffered read has the worst regression while rand
> > > read is far better. For example, I start 12 processes per disk and every disk has 24
> > > 1-G files. There are 12 disks. The sequential read fio result is about 593MB/second
> > > with idle=poll, and about 375MB/s without idle=poll. Read block size is 4KB.
> > >
> > > Another exmaple is single fio direct seqential read (block size is 4K) on a single
> > > SATA disk. The result is about 28MB/s without idle=poll and about 32.5MB with
> > > idle=poll.
> > >
> > > How did I find C state has impact on disk I/O result? Frankly, I found a regression
> > > between kernel 2.6.27 and 2.6.28. Bisect located a nonstop tsc patch, but the patch
> > > is quite good. I found the patch changes the default clocksource from hpet to
> > > tsc. Then, I tried all clocksources and got the best result with acpi_pm clocksource.
> > > But oprofile data shows acpi_pm has more cpu utilization. clocksource jiffies has
> > > worst result but least cpu utilization. As you know, fio calls gettimeofday frequently.
> > > Then, I tried boot parameter processor.max_cstate and idle=poll.
> > > I get the similar result with processor.max_cstate=1 like the one with idle=poll.
> > >
> > 
> > Is it possible that the different bandwidths figures are due to
> > incorrect timing, instead of C-state latencies?
> I'm not sure.
> 
> > Entering a deep C state can cause strange things to timers: some of
> > them, especially tsc, become unreliable.
> > Maybe the patch you found that re-enables tsc is actually wrong for
> > your machine, for which tsc is unreliable in deep C states.
> I'm using a SDV machine, not an official product. But it's rare that cpuid
> reports non-stop tsc feature while it doesn't support it.
> 
> I tried different clocksources. For exmaple, I could get a better (30%) result with
> hpet. With hpet, cpu utilization is about 5~8%. Function hpet_read uses too much cpu
> time. With tsc, cpu utilization is about 2~3%. I think more cpu utilization causes fewer
> C state transitions.
> 
> With idle=poll, the result is about 10% better than the one of hpet. If using idle=poll,
> I didn't find result difference among different clocksources.
> 
> > 
> > > I also run the testing on 2 stoakley machines and don't find such issues.
> > > /proc/acpi/processor/CPUXXX/power shows stoakley cpu only has C1.
> > >
> > >> I'm wondering whether we might have an even bigger problem with disk I/O
> > >> related to this than just the raw ACPI exit latency value itself.
> > > We might have. I'm still doing more testing. With Venki's tool (write/read MSR registers),
> > > I collected some C state switch stat.
> > >
> > You can see the latencies (expressed in us) on your machine with:
> > [root@localhost corrado]# cat
> > /sys/devices/system/cpu/cpu0/cpuidle/state*/latency
> > 0
> > 0
> > 1
> > 133
> > 
> > Can you post your numbers, to see if they are unusually high?
> [ymzhang@lkp-ne02 ~]$ cat /proc/acpi/processor/CPU0/power 
> active state:            C0
> max_cstate:              C8
> maximum allowed latency: 2000000000 usec
> states:
>     C1:                  type[C1] promotion[--] demotion[--] latency[003] usage[00001661] duration[00000000000000000000]
>     C2:                  type[C3] promotion[--] demotion[--] latency[205] usage[00000687] duration[00000000000000732028]
>     C3:                  type[C3] promotion[--] demotion[--] latency[245] usage[00011509] duration[00000000000115186065]
> 
> [ymzhang@lkp-ne02 ~]$ cat /sys/devices/system/cpu/cpu0/cpuidle/state*/latency
> 0
> 3
> 205
> 245
> 
> > 
> > > Current cpuidle has a good consideration on cpu utilization, but doesn't have
> > > consideration on devices. So with I/O delivery and interrupt drive model
> > > with little cpu utilization, performance might be hurt if C state exit has a long
> > > latency.
Another interesting testing with netperf has the similiar behavior. I start 1 netperf client
and bind client and server to different physical cpus to run a UDP-RR-1 loopback testing.
The result is about 54000 without idle=poll while the one is 88000 with idle=poll.

If I start CPU_NUM netperf clients, there is no such issue, because all cpu are busy.

next prev parent reply	other threads:[~2009-07-30  6:28 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-07-27  5:30 Dynamic configure max_cstate Zhang, Yanmin
2009-07-27  7:33 ` Andreas Mohr
2009-07-28  2:42   ` Zhang, Yanmin
2009-07-28  2:42     ` Zhang, Yanmin
2009-07-28  7:20     ` Corrado Zoccolo
2009-07-28  7:20       ` Corrado Zoccolo
2009-07-28  9:00       ` Zhang, Yanmin
2009-07-28  9:00         ` Zhang, Yanmin
2009-07-28 10:11         ` Andreas Mohr
2009-07-28 14:03           ` Andreas Mohr
2009-07-28 17:35             ` ok, now would this be useful? (Re: Dynamic configure max_cstate) Andreas Mohr
2009-07-29  8:20           ` Dynamic configure max_cstate Zhang, Yanmin
2009-07-31  3:43           ` Robert Hancock
2009-07-31  7:06             ` Zhang, Yanmin
2009-07-31  8:07               ` Andreas Mohr
2009-07-31 14:40                 ` Andi Kleen
2009-07-31 14:56                   ` Michael S. Zick
2009-07-31 17:37                   ` Pallipadi, Venkatesh
2009-07-31 15:14                 ` Len Brown
2009-07-30  6:28         ` Zhang, Yanmin [this message]
2009-07-28 19:25       ` Len Brown
2009-07-29  0:17   ` Len Brown
2009-07-29  8:00     ` Andreas Mohr
2009-07-28 19:47 ` Len Brown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1248935280.2560.733.camel@ymzhang \
    --to=yanmin_zhang@linux.intel.com \
    --cc=andi@lisas.de \
    --cc=czoccolo@gmail.com \
    --cc=lenb@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.