RFC: revert request for cpuidle patches e11538d1 and 69a37bea

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RFC:  revert request for cpuidle patches e11538d1 and 69a37bea
@ 2013-07-26 17:33 Jeremy Eder
  2013-07-26 18:13 ` Rik van Riel
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Jeremy Eder @ 2013-07-26 17:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: rafael.j.wysocki, riel, youquan.song, paulmck, daniel.lezcano,
	arjan, len.brown

Hello,

We believe we've identified a particular commit to the cpuidle code that
seems to be impacting performance of variety of workloads.  The simplest way to
reproduce is using netperf TCP_RR test, so we're using that, on a pair of
Sandy Bridge based servers.  We also have data from a large database setup
where performance is also measurably/positively impacted, though that test
data isn't easily share-able.

Included below are test results from 3 test kernels:

kernel       reverts
-----------------------------------------------------------
1) vanilla   upstream (no reverts)

2) perfteam2 reverts e11538d1f03914eb92af5a1a378375c05ae8520c

3) test      reverts 69a37beabf1f0a6705c08e879bdd5d82ff6486c4
                     e11538d1f03914eb92af5a1a378375c05ae8520c

In summary, netperf TCP_RR numbers improve by approximately 4% after
reverting 69a37beabf1f0a6705c08e879bdd5d82ff6486c4.  When
69a37beabf1f0a6705c08e879bdd5d82ff6486c4 is included, C0 residency never
seems to get above 40%.  Taking that patch out gets C0 near 100% quite
often, and performance increases.

The below data are histograms representing the %c0 residency @ 1-second
sample rates (using turbostat), while under netperf test.

- If you look at the first 4 histograms, you can see %c0 residency almost
  entirely in the 30,40% bin.
- The last pair, which reverts 69a37beabf1f0a6705c08e879bdd5d82ff6486c4,
  shows %c0 in the 80,90,100% bins.

Below each kernel name are netperf TCP_RR trans/s numbers for the
particular kernel that can be disclosed publicly, comparing the 3 test
kernels.  We ran a 4th test with the vanilla kernel where we've also set
/dev/cpu_dma_latency=0 to show overall impact boosting single-threaded
TCP_RR performance over 11% above baseline.

3.10-rc2 vanilla RX + c0 lock (/dev/cpu_dma_latency=0):  
TCP_RR trans/s 54323.78

-----------------------------------------------------------
3.10-rc2 vanilla RX (no reverts)
TCP_RR trans/s 48192.47

Receiver %c0 
    0.0000 -    10.0000 [     1]: *
   10.0000 -    20.0000 [     0]: 
   20.0000 -    30.0000 [     0]: 
   30.0000 -    40.0000 [    59]: 
***********************************************************
   40.0000 -    50.0000 [     1]: *
   50.0000 -    60.0000 [     0]: 
   60.0000 -    70.0000 [     0]: 
   70.0000 -    80.0000 [     0]: 
   80.0000 -    90.0000 [     0]: 
   90.0000 -   100.0000 [     0]: 

Sender %c0
    0.0000 -    10.0000 [     1]: *
   10.0000 -    20.0000 [     0]: 
   20.0000 -    30.0000 [     0]: 
   30.0000 -    40.0000 [    11]: ***********
   40.0000 -    50.0000 [    49]:
*************************************************
   50.0000 -    60.0000 [     0]: 
   60.0000 -    70.0000 [     0]: 
   70.0000 -    80.0000 [     0]: 
   80.0000 -    90.0000 [     0]: 
   90.0000 -   100.0000 [     0]: 

-----------------------------------------------------------
3.10-rc2 perfteam2 RX (reverts commit
e11538d1f03914eb92af5a1a378375c05ae8520c)
TCP_RR trans/s 49698.69

Receiver %c0 
    0.0000 -    10.0000 [     1]: *
   10.0000 -    20.0000 [     1]: *
   20.0000 -    30.0000 [     0]: 
   30.0000 -    40.0000 [    59]:
***********************************************************
   40.0000 -    50.0000 [     0]: 
   50.0000 -    60.0000 [     0]: 
   60.0000 -    70.0000 [     0]: 
   70.0000 -    80.0000 [     0]: 
   80.0000 -    90.0000 [     0]: 
   90.0000 -   100.0000 [     0]: 

Sender %c0 
    0.0000 -    10.0000 [     1]: *
   10.0000 -    20.0000 [     0]: 
   20.0000 -    30.0000 [     0]: 
   30.0000 -    40.0000 [     2]: **
   40.0000 -    50.0000 [    58]:
**********************************************************
   50.0000 -    60.0000 [     0]: 
   60.0000 -    70.0000 [     0]: 
   70.0000 -    80.0000 [     0]: 
   80.0000 -    90.0000 [     0]: 
   90.0000 -   100.0000 [     0]: 

-----------------------------------------------------------
3.10-rc2 test RX (reverts 69a37beabf1f0a6705c08e879bdd5d82ff6486c4 and
e11538d1f03914eb92af5a1a378375c05ae8520c)
TCP_RR trans/s 47766.95

Receiver %c0
    0.0000 -    10.0000 [     1]: *
   10.0000 -    20.0000 [     1]: *
   20.0000 -    30.0000 [     0]: 
   30.0000 -    40.0000 [    27]: ***************************
   40.0000 -    50.0000 [     2]: **
   50.0000 -    60.0000 [     0]: 
   60.0000 -    70.0000 [     2]: **
   70.0000 -    80.0000 [     0]: 
   80.0000 -    90.0000 [     0]: 
   90.0000 -   100.0000 [    28]: ****************************

Sender:
    0.0000 -    10.0000 [     1]: *
   10.0000 -    20.0000 [     0]: 
   20.0000 -    30.0000 [     0]: 
   30.0000 -    40.0000 [    11]: ***********
   40.0000 -    50.0000 [     0]: 
   50.0000 -    60.0000 [     1]: *
   60.0000 -    70.0000 [     0]: 
   70.0000 -    80.0000 [     3]: ***
   80.0000 -    90.0000 [     7]: *******
   90.0000 -   100.0000 [    38]: **************************************

These results demonstrate gaining back the tendency of the CPU to stay in
more responsive, performant C-states (and thus yield measurably better
performance), by reverting commit 69a37beabf1f0a6705c08e879bdd5d82ff6486c4.

While taking into account the changing landscape with regards to CPU
governors, and both P- and C-states, we think that a single-thread should
still be able to achieve maximum performance.  With the current upstream
code base, workloads with a low number of "hot" threads are not able to
achieve maximum performance "out of the box".

Also recently, Intel's LAD has posted upstream performance results that
include an interesting column with their table of results.  See upstream
commit 0a4db187a999, column #3 within the "Performance numbers" table.  It
seems known, even within Intel, that the deeper C-states incur a cost too
high to bear, as they've explicitly tested restricting the CPU to higher
c-states of C0,1.

-- Jeremy Eder

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC:  revert request for cpuidle patches e11538d1 and 69a37bea
  2013-07-26 17:33 RFC: revert request for cpuidle patches e11538d1 and 69a37bea Jeremy Eder
@ 2013-07-26 18:13 ` Rik van Riel
  2013-07-26 18:27   ` Arjan van de Ven
  2013-07-27  6:43 ` Daniel Lezcano
  2013-07-30  3:57 ` Youquan Song
  2 siblings, 1 reply; 19+ messages in thread
From: Rik van Riel @ 2013-07-26 18:13 UTC (permalink / raw)
  To: Jeremy Eder
  Cc: linux-kernel, rafael.j.wysocki, youquan.song, paulmck,
	daniel.lezcano, arjan, len.brown

On 07/26/2013 01:33 PM, Jeremy Eder wrote:
> Hello,
>
> We believe we've identified a particular commit to the cpuidle code that
> seems to be impacting performance of variety of workloads.  The simplest way to
> reproduce is using netperf TCP_RR test, so we're using that, on a pair of
> Sandy Bridge based servers.  We also have data from a large database setup
> where performance is also measurably/positively impacted, though that test
> data isn't easily share-able.
>
> Included below are test results from 3 test kernels:
>
> kernel       reverts
> -----------------------------------------------------------
> 1) vanilla   upstream (no reverts)
>
> 2) perfteam2 reverts e11538d1f03914eb92af5a1a378375c05ae8520c
>
> 3) test      reverts 69a37beabf1f0a6705c08e879bdd5d82ff6486c4
>                       e11538d1f03914eb92af5a1a378375c05ae8520c
>
> In summary, netperf TCP_RR numbers improve by approximately 4% after
> reverting 69a37beabf1f0a6705c08e879bdd5d82ff6486c4.  When
> 69a37beabf1f0a6705c08e879bdd5d82ff6486c4 is included, C0 residency never
> seems to get above 40%.  Taking that patch out gets C0 near 100% quite
> often, and performance increases.

Could you try running the tests with just the repeat mode
stuff from commit 69a37bea excluded, but leaving the common
infrastructure and commit e11538?

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC:  revert request for cpuidle patches e11538d1 and 69a37bea
  2013-07-26 18:13 ` Rik van Riel
@ 2013-07-26 18:27   ` Arjan van de Ven
  2013-07-26 18:29     ` Rik van Riel
  0 siblings, 1 reply; 19+ messages in thread
From: Arjan van de Ven @ 2013-07-26 18:27 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Jeremy Eder, linux-kernel, rafael.j.wysocki, youquan.song,
	paulmck, daniel.lezcano, len.brown

On 7/26/2013 11:13 AM, Rik van Riel wrote:

>
> Could you try running the tests with just the repeat mode
> stuff from commit 69a37bea excluded, but leaving the common
> infrastructure and commit e11538?
>

personally I think we should go the other way around.
revert the set entirely first, and now, and get our performance back
to what it should be

and then see what we can add back without causing the regressions.
this may take longer, or be done in steps, and that's ok.

the end point may well be the same... but we can then evaluate in the right
direction.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC:  revert request for cpuidle patches e11538d1 and 69a37bea
  2013-07-26 18:27   ` Arjan van de Ven
@ 2013-07-26 18:29     ` Rik van Riel
  2013-07-26 21:48       ` Rafael J. Wysocki
  2013-07-27  7:11       ` Daniel Lezcano
  0 siblings, 2 replies; 19+ messages in thread
From: Rik van Riel @ 2013-07-26 18:29 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Jeremy Eder, linux-kernel, rafael.j.wysocki, youquan.song,
	paulmck, daniel.lezcano, len.brown

On 07/26/2013 02:27 PM, Arjan van de Ven wrote:
> On 7/26/2013 11:13 AM, Rik van Riel wrote:
>
>>
>> Could you try running the tests with just the repeat mode
>> stuff from commit 69a37bea excluded, but leaving the common
>> infrastructure and commit e11538?
>>
>
> personally I think we should go the other way around.
> revert the set entirely first, and now, and get our performance back
> to what it should be
>
> and then see what we can add back without causing the regressions.
> this may take longer, or be done in steps, and that's ok.
>
> the end point may well be the same... but we can then evaluate in the right
> direction.

Works for me. I have no objection to reverting both patches,
if the people planning to fix the code prefer that :)

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC:  revert request for cpuidle patches e11538d1 and 69a37bea
  2013-07-26 18:29     ` Rik van Riel
@ 2013-07-26 21:48       ` Rafael J. Wysocki
  2013-07-27  0:36         ` Rafael J. Wysocki
  2013-07-27  7:11       ` Daniel Lezcano
  1 sibling, 1 reply; 19+ messages in thread
From: Rafael J. Wysocki @ 2013-07-26 21:48 UTC (permalink / raw)
  To: Rik van Riel, Jeremy Eder
  Cc: Arjan van de Ven, linux-kernel, rafael.j.wysocki, youquan.song,
	paulmck, daniel.lezcano, len.brown, Linux PM list

On Friday, July 26, 2013 02:29:40 PM Rik van Riel wrote:
> On 07/26/2013 02:27 PM, Arjan van de Ven wrote:
> > On 7/26/2013 11:13 AM, Rik van Riel wrote:
> >
> >>
> >> Could you try running the tests with just the repeat mode
> >> stuff from commit 69a37bea excluded, but leaving the common
> >> infrastructure and commit e11538?
> >>
> >
> > personally I think we should go the other way around.
> > revert the set entirely first, and now, and get our performance back
> > to what it should be
> >
> > and then see what we can add back without causing the regressions.
> > this may take longer, or be done in steps, and that's ok.
> >
> > the end point may well be the same... but we can then evaluate in the right
> > direction.
> 
> Works for me. I have no objection to reverting both patches,
> if the people planning to fix the code prefer that :)

OK, I'll queue up the reverts as fixes for 3.11-rc4.

Thanks,
Rafael


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC:  revert request for cpuidle patches e11538d1 and 69a37bea
  2013-07-26 21:48       ` Rafael J. Wysocki
@ 2013-07-27  0:36         ` Rafael J. Wysocki
  2013-07-27  6:22           ` Len Brown
  0 siblings, 1 reply; 19+ messages in thread
From: Rafael J. Wysocki @ 2013-07-27  0:36 UTC (permalink / raw)
  To: Rik van Riel, Jeremy Eder
  Cc: Arjan van de Ven, linux-kernel, rafael.j.wysocki, youquan.song,
	paulmck, daniel.lezcano, len.brown, Linux PM list

On Friday, July 26, 2013 11:48:36 PM Rafael J. Wysocki wrote:
> On Friday, July 26, 2013 02:29:40 PM Rik van Riel wrote:
> > On 07/26/2013 02:27 PM, Arjan van de Ven wrote:
> > > On 7/26/2013 11:13 AM, Rik van Riel wrote:
> > >
> > >>
> > >> Could you try running the tests with just the repeat mode
> > >> stuff from commit 69a37bea excluded, but leaving the common
> > >> infrastructure and commit e11538?
> > >>
> > >
> > > personally I think we should go the other way around.
> > > revert the set entirely first, and now, and get our performance back
> > > to what it should be
> > >
> > > and then see what we can add back without causing the regressions.
> > > this may take longer, or be done in steps, and that's ok.
> > >
> > > the end point may well be the same... but we can then evaluate in the right
> > > direction.
> > 
> > Works for me. I have no objection to reverting both patches,
> > if the people planning to fix the code prefer that :)
> 
> OK, I'll queue up the reverts as fixes for 3.11-rc4.

So, the reverts are on the fixes-next branch of the linux-pm.git tree that you
can access at

http://git.kernel.org/cgit/linux/kernel/git/rafael/linux-pm.git/log/?h=fixes-next

However, they are not simple reverts as we've had some non-trivial changes on
top of those commits already, so I'd appreciate it a lot if somebody could
double check if I didn't break anything in them.

They are based on top of my master branch for now, but I'll rebase them on
3.11-rc3 when it's out.

Thanks,
Rafael


-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea
  2013-07-27  0:36         ` Rafael J. Wysocki
@ 2013-07-27  6:22           ` Len Brown
  2013-07-27 12:10             ` Rafael J. Wysocki
  0 siblings, 1 reply; 19+ messages in thread
From: Len Brown @ 2013-07-27  6:22 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rik van Riel, Jeremy Eder, Arjan van de Ven,
	linux-kernel@vger.kernel.org, Rafael J. Wysocki, youquan.song,
	paulmck, daniel.lezcano, Brown, Len, Linux PM list

>> OK, I'll queue up the reverts as fixes for 3.11-rc4.
>
> So, the reverts are on the fixes-next branch of the linux-pm.git tree that you
> can access at
>
> http://git.kernel.org/cgit/linux/kernel/git/rafael/linux-pm.git/log/?h=fixes-next
>
> However, they are not simple reverts as we've had some non-trivial changes on
> top of those commits already, so I'd appreciate it a lot if somebody could
> double check if I didn't break anything in them.

I've verified that the reverts improve netperf TCP_RR performance.

Here I've got two Xeon's, slightly different, so I run in both directions.
Also I run two ways -- Out-of-the-box, plus with cpufreq
set to max frequency.  The reverts improve all 4 cases:

JKT → IVT                IVT → JKT

 3.11.0-rc2        baseline w/o revert                Out of Box

20420                19963
20658                19915
20298                20320

 3.11.0-rc2        baseline w/o revert                max freq

59658                51427
59663                51503
59416                51343

3.11.0-rc2-00002-g74bce39                        Linux-pm “fixes-next” branch

3.11.0-rc2-00002-g74bce39                Out of box
23227                22056
23306                22125
23387                40226                40k result saw some 2.2 ghz
vs 1.2 ghz in other runs
                21991

3.11.0-rc2-00002-g74bce39                Max-freq
67240                57645
64880                56764
65924                57435

Tested-by: Len Brown <len.brown@intel.com>

thanks,
Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC:  revert request for cpuidle patches e11538d1 and 69a37bea
  2013-07-26 17:33 RFC: revert request for cpuidle patches e11538d1 and 69a37bea Jeremy Eder
  2013-07-26 18:13 ` Rik van Riel
@ 2013-07-27  6:43 ` Daniel Lezcano
  2013-07-30  3:57 ` Youquan Song
  2 siblings, 0 replies; 19+ messages in thread
From: Daniel Lezcano @ 2013-07-27  6:43 UTC (permalink / raw)
  To: Jeremy Eder
  Cc: linux-kernel, rafael.j.wysocki, riel, youquan.song, paulmck,
	arjan, len.brown

On 07/26/2013 07:33 PM, Jeremy Eder wrote:
> Hello,
> 
> We believe we've identified a particular commit to the cpuidle code that
> seems to be impacting performance of variety of workloads.  The simplest way to
> reproduce is using netperf TCP_RR test, so we're using that, on a pair of
> Sandy Bridge based servers.  We also have data from a large database setup
> where performance is also measurably/positively impacted, though that test
> data isn't easily share-able.
> 
> Included below are test results from 3 test kernels:

Is the system tickless or with a periodic tick ?



> kernel       reverts
> -----------------------------------------------------------
> 1) vanilla   upstream (no reverts)
> 
> 2) perfteam2 reverts e11538d1f03914eb92af5a1a378375c05ae8520c
> 
> 3) test      reverts 69a37beabf1f0a6705c08e879bdd5d82ff6486c4
>                      e11538d1f03914eb92af5a1a378375c05ae8520c
> 
> In summary, netperf TCP_RR numbers improve by approximately 4% after
> reverting 69a37beabf1f0a6705c08e879bdd5d82ff6486c4.  When
> 69a37beabf1f0a6705c08e879bdd5d82ff6486c4 is included, C0 residency never
> seems to get above 40%.  Taking that patch out gets C0 near 100% quite
> often, and performance increases.
> 
> The below data are histograms representing the %c0 residency @ 1-second
> sample rates (using turbostat), while under netperf test.
> 
> - If you look at the first 4 histograms, you can see %c0 residency almost
>   entirely in the 30,40% bin.
> - The last pair, which reverts 69a37beabf1f0a6705c08e879bdd5d82ff6486c4,
>   shows %c0 in the 80,90,100% bins.
> 
> Below each kernel name are netperf TCP_RR trans/s numbers for the
> particular kernel that can be disclosed publicly, comparing the 3 test
> kernels.  We ran a 4th test with the vanilla kernel where we've also set
> /dev/cpu_dma_latency=0 to show overall impact boosting single-threaded
> TCP_RR performance over 11% above baseline.
> 
> 3.10-rc2 vanilla RX + c0 lock (/dev/cpu_dma_latency=0):  
> TCP_RR trans/s 54323.78
> 
> -----------------------------------------------------------
> 3.10-rc2 vanilla RX (no reverts)
> TCP_RR trans/s 48192.47
> 
> Receiver %c0 
>     0.0000 -    10.0000 [     1]: *
>    10.0000 -    20.0000 [     0]: 
>    20.0000 -    30.0000 [     0]: 
>    30.0000 -    40.0000 [    59]: 
> ***********************************************************
>    40.0000 -    50.0000 [     1]: *
>    50.0000 -    60.0000 [     0]: 
>    60.0000 -    70.0000 [     0]: 
>    70.0000 -    80.0000 [     0]: 
>    80.0000 -    90.0000 [     0]: 
>    90.0000 -   100.0000 [     0]: 
> 
> Sender %c0
>     0.0000 -    10.0000 [     1]: *
>    10.0000 -    20.0000 [     0]: 
>    20.0000 -    30.0000 [     0]: 
>    30.0000 -    40.0000 [    11]: ***********
>    40.0000 -    50.0000 [    49]:
> *************************************************
>    50.0000 -    60.0000 [     0]: 
>    60.0000 -    70.0000 [     0]: 
>    70.0000 -    80.0000 [     0]: 
>    80.0000 -    90.0000 [     0]: 
>    90.0000 -   100.0000 [     0]: 
> 
> -----------------------------------------------------------
> 3.10-rc2 perfteam2 RX (reverts commit
> e11538d1f03914eb92af5a1a378375c05ae8520c)
> TCP_RR trans/s 49698.69
> 
> Receiver %c0 
>     0.0000 -    10.0000 [     1]: *
>    10.0000 -    20.0000 [     1]: *
>    20.0000 -    30.0000 [     0]: 
>    30.0000 -    40.0000 [    59]:
> ***********************************************************
>    40.0000 -    50.0000 [     0]: 
>    50.0000 -    60.0000 [     0]: 
>    60.0000 -    70.0000 [     0]: 
>    70.0000 -    80.0000 [     0]: 
>    80.0000 -    90.0000 [     0]: 
>    90.0000 -   100.0000 [     0]: 
> 
> Sender %c0 
>     0.0000 -    10.0000 [     1]: *
>    10.0000 -    20.0000 [     0]: 
>    20.0000 -    30.0000 [     0]: 
>    30.0000 -    40.0000 [     2]: **
>    40.0000 -    50.0000 [    58]:
> **********************************************************
>    50.0000 -    60.0000 [     0]: 
>    60.0000 -    70.0000 [     0]: 
>    70.0000 -    80.0000 [     0]: 
>    80.0000 -    90.0000 [     0]: 
>    90.0000 -   100.0000 [     0]: 
> 
> -----------------------------------------------------------
> 3.10-rc2 test RX (reverts 69a37beabf1f0a6705c08e879bdd5d82ff6486c4 and
> e11538d1f03914eb92af5a1a378375c05ae8520c)
> TCP_RR trans/s 47766.95
> 
> Receiver %c0
>     0.0000 -    10.0000 [     1]: *
>    10.0000 -    20.0000 [     1]: *
>    20.0000 -    30.0000 [     0]: 
>    30.0000 -    40.0000 [    27]: ***************************
>    40.0000 -    50.0000 [     2]: **
>    50.0000 -    60.0000 [     0]: 
>    60.0000 -    70.0000 [     2]: **
>    70.0000 -    80.0000 [     0]: 
>    80.0000 -    90.0000 [     0]: 
>    90.0000 -   100.0000 [    28]: ****************************
> 
> Sender:
>     0.0000 -    10.0000 [     1]: *
>    10.0000 -    20.0000 [     0]: 
>    20.0000 -    30.0000 [     0]: 
>    30.0000 -    40.0000 [    11]: ***********
>    40.0000 -    50.0000 [     0]: 
>    50.0000 -    60.0000 [     1]: *
>    60.0000 -    70.0000 [     0]: 
>    70.0000 -    80.0000 [     3]: ***
>    80.0000 -    90.0000 [     7]: *******
>    90.0000 -   100.0000 [    38]: **************************************
> 
> These results demonstrate gaining back the tendency of the CPU to stay in
> more responsive, performant C-states (and thus yield measurably better
> performance), by reverting commit 69a37beabf1f0a6705c08e879bdd5d82ff6486c4.
> 
> While taking into account the changing landscape with regards to CPU
> governors, and both P- and C-states, we think that a single-thread should
> still be able to achieve maximum performance.  With the current upstream
> code base, workloads with a low number of "hot" threads are not able to
> achieve maximum performance "out of the box".
> 
> Also recently, Intel's LAD has posted upstream performance results that
> include an interesting column with their table of results.  See upstream
> commit 0a4db187a999, column #3 within the "Performance numbers" table.  It
> seems known, even within Intel, that the deeper C-states incur a cost too
> high to bear, as they've explicitly tested restricting the CPU to higher
> c-states of C0,1.
> 
> -- Jeremy Eder
> 


-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC:  revert request for cpuidle patches e11538d1 and 69a37bea
  2013-07-26 18:29     ` Rik van Riel
  2013-07-26 21:48       ` Rafael J. Wysocki
@ 2013-07-27  7:11       ` Daniel Lezcano
  2013-07-29 11:49         ` Arjan van de Ven
  2013-07-29 13:12         ` Arjan van de Ven
  1 sibling, 2 replies; 19+ messages in thread
From: Daniel Lezcano @ 2013-07-27  7:11 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Arjan van de Ven, Jeremy Eder, linux-kernel, rafael.j.wysocki,
	youquan.song, paulmck, len.brown, Vincent Guittot

On 07/26/2013 08:29 PM, Rik van Riel wrote:
> On 07/26/2013 02:27 PM, Arjan van de Ven wrote:
>> On 7/26/2013 11:13 AM, Rik van Riel wrote:
>>
>>>
>>> Could you try running the tests with just the repeat mode
>>> stuff from commit 69a37bea excluded, but leaving the common
>>> infrastructure and commit e11538?
>>>
>>
>> personally I think we should go the other way around.
>> revert the set entirely first, and now, and get our performance back
>> to what it should be
>>
>> and then see what we can add back without causing the regressions.
>> this may take longer, or be done in steps, and that's ok.
>>
>> the end point may well be the same... but we can then evaluate in the
>> right
>> direction.
> 
> Works for me. I have no objection to reverting both patches,
> if the people planning to fix the code prefer that :)

I am favor of reverting these patches also because they don't solve the
root problem of reducing the prediction failure.

The menu governor tries to deduce the next wakeup but based on events
per cpu. That means if a task with a specific behavior is migrated
across cpus, the statistics will be wrong. IMHO, the menu governor has
too much heuristic in there and may be we should consider the scheduler
to pass more informations to the governor and remove some of these
heuristics.

Beside that, on ARM the cpus could be coupled and the timer to detect
the prediction will wake up the entire cluster, making the power saving
less efficient and impacting the statistics of the other cpu.


-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea
  2013-07-27  6:22           ` Len Brown
@ 2013-07-27 12:10             ` Rafael J. Wysocki
  0 siblings, 0 replies; 19+ messages in thread
From: Rafael J. Wysocki @ 2013-07-27 12:10 UTC (permalink / raw)
  To: Len Brown
  Cc: Rik van Riel, Jeremy Eder, Arjan van de Ven,
	linux-kernel@vger.kernel.org, Rafael J. Wysocki, youquan.song,
	paulmck, daniel.lezcano, Brown, Len, Linux PM list

On Saturday, July 27, 2013 02:22:53 AM Len Brown wrote:
> >> OK, I'll queue up the reverts as fixes for 3.11-rc4.
> >
> > So, the reverts are on the fixes-next branch of the linux-pm.git tree that you
> > can access at
> >
> > http://git.kernel.org/cgit/linux/kernel/git/rafael/linux-pm.git/log/?h=fixes-next
> >
> > However, they are not simple reverts as we've had some non-trivial changes on
> > top of those commits already, so I'd appreciate it a lot if somebody could
> > double check if I didn't break anything in them.
> 
> I've verified that the reverts improve netperf TCP_RR performance.
> 
> Here I've got two Xeon's, slightly different, so I run in both directions.
> Also I run two ways -- Out-of-the-box, plus with cpufreq
> set to max frequency.  The reverts improve all 4 cases:
> 
> JKT → IVT                IVT → JKT
> 
>  3.11.0-rc2        baseline w/o revert                Out of Box
> 
> 20420                19963
> 20658                19915
> 20298                20320
> 
>  3.11.0-rc2        baseline w/o revert                max freq
> 
> 59658                51427
> 59663                51503
> 59416                51343
> 
> 3.11.0-rc2-00002-g74bce39                        Linux-pm “fixes-next” branch
> 
> 3.11.0-rc2-00002-g74bce39                Out of box
> 23227                22056
> 23306                22125
> 23387                40226                40k result saw some 2.2 ghz
> vs 1.2 ghz in other runs
>                 21991
> 
> 3.11.0-rc2-00002-g74bce39                Max-freq
> 67240                57645
> 64880                56764
> 65924                57435
> 
> Tested-by: Len Brown <len.brown@intel.com>

Many thanks Len!

Rafael


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC:  revert request for cpuidle patches e11538d1 and 69a37bea
  2013-07-27  7:11       ` Daniel Lezcano
@ 2013-07-29 11:49         ` Arjan van de Ven
  2013-07-29 13:12         ` Arjan van de Ven
  1 sibling, 0 replies; 19+ messages in thread
From: Arjan van de Ven @ 2013-07-29 11:49 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Rik van Riel, Jeremy Eder, linux-kernel, rafael.j.wysocki,
	youquan.song, paulmck, len.brown, Vincent Guittot

>
> Beside that, on ARM the cpus could be coupled and the timer to detect
> the prediction will wake up the entire cluster, making the power saving
> less efficient and impacting the statistics of the other cpu.

an ARM specific governor would be quite appropriate in that case.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC:  revert request for cpuidle patches e11538d1 and 69a37bea
  2013-07-27  7:11       ` Daniel Lezcano
  2013-07-29 11:49         ` Arjan van de Ven
@ 2013-07-29 13:12         ` Arjan van de Ven
  2013-07-29 14:14           ` Lorenzo Pieralisi
  1 sibling, 1 reply; 19+ messages in thread
From: Arjan van de Ven @ 2013-07-29 13:12 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Rik van Riel, Jeremy Eder, linux-kernel, rafael.j.wysocki,
	youquan.song, paulmck, len.brown, Vincent Guittot

>
> The menu governor tries to deduce the next wakeup but based on events
> per cpu. That means if a task with a specific behavior is migrated
> across cpus, the statistics will be wrong.

btw this is largely a misunderstanding;
tasks are not the issue; tasks use timers and those are perfectly predictable.
It's interrupts that are not and the heuristics are for that.

Now, if your hardware does the really-bad-for-power wake-all on any interrupt,
then the menu governor logic is not good for you; rather than looking at the next
timer on the current cpu you need to look at the earliest timer on the set of bundled
cpus as the upper bound of the next wake event.
And maybe even more special casing is needed... but I doubt it.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC:  revert request for cpuidle patches e11538d1 and 69a37bea
  2013-07-29 13:12         ` Arjan van de Ven
@ 2013-07-29 14:14           ` Lorenzo Pieralisi
  2013-07-29 14:29             ` Arjan van de Ven
  0 siblings, 1 reply; 19+ messages in thread
From: Lorenzo Pieralisi @ 2013-07-29 14:14 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Daniel Lezcano, Rik van Riel, Jeremy Eder,
	linux-kernel@vger.kernel.org, rafael.j.wysocki@intel.com,
	youquan.song@intel.com, paulmck@linux.vnet.ibm.com,
	len.brown@intel.com, Vincent Guittot

On Mon, Jul 29, 2013 at 02:12:58PM +0100, Arjan van de Ven wrote:
> >
> > The menu governor tries to deduce the next wakeup but based on events
> > per cpu. That means if a task with a specific behavior is migrated
> > across cpus, the statistics will be wrong.
> 
> 
> btw this is largely a misunderstanding;
> tasks are not the issue; tasks use timers and those are perfectly predictable.
> It's interrupts that are not and the heuristics are for that.
> 
> Now, if your hardware does the really-bad-for-power wake-all on any interrupt,
> then the menu governor logic is not good for you; rather than looking at the next
> timer on the current cpu you need to look at the earliest timer on the set of bundled
> cpus as the upper bound of the next wake event.

Yes, that's true and we have to look into this properly, but certainly
a wake-up for a CPU in a package C-state is not beneficial to x86 CPUs either,
or I am missing something ?

Even if the wake-up interrupts just power up one of the CPUs in a package
and leave other(s) alone, all HW state shared (ie caches) by those CPUs must
be turned on. What I am asking is: this bundled next event is a concept
that should apply to x86 CPUs too, or it is entirely managed in FW/HW
and the kernel just should not care ?

I still do not understand how this "bundled" next event is managed on
x86 with the menu governor, or better why it is not managed at all, given
the importance of package C-states.

> And maybe even more special casing is needed... but I doubt it.

I lost you here, can you elaborate pls ?

Thanks a lot,
Lorenzo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC:  revert request for cpuidle patches e11538d1 and 69a37bea
  2013-07-29 14:14           ` Lorenzo Pieralisi
@ 2013-07-29 14:29             ` Arjan van de Ven
  2013-07-29 16:01               ` Lorenzo Pieralisi
  0 siblings, 1 reply; 19+ messages in thread
From: Arjan van de Ven @ 2013-07-29 14:29 UTC (permalink / raw)
  To: Lorenzo Pieralisi
  Cc: Daniel Lezcano, Rik van Riel, Jeremy Eder,
	linux-kernel@vger.kernel.org, rafael.j.wysocki@intel.com,
	youquan.song@intel.com, paulmck@linux.vnet.ibm.com,
	len.brown@intel.com, Vincent Guittot

On 7/29/2013 7:14 AM, Lorenzo Pieralisi wrote:
>>
>>
>> btw this is largely a misunderstanding;
>> tasks are not the issue; tasks use timers and those are perfectly predictable.
>> It's interrupts that are not and the heuristics are for that.
>>
>> Now, if your hardware does the really-bad-for-power wake-all on any interrupt,
>> then the menu governor logic is not good for you; rather than looking at the next
>> timer on the current cpu you need to look at the earliest timer on the set of bundled
>> cpus as the upper bound of the next wake event.
>
> Yes, that's true and we have to look into this properly, but certainly
> a wake-up for a CPU in a package C-state is not beneficial to x86 CPUs either,
> or I am missing something ?

a CPU core isn't in a package C state, the system is.
(in a core C state the whole core is already powered down completely; a package C state
just also turns off the memory controller/etc)

package C states are global on x86 (not just per package); there's nothing one
can do there in terms of grouping/etc.

> Even if the wake-up interrupts just power up one of the CPUs in a package
> and leave other(s) alone, all HW state shared (ie caches) by those CPUs must
> be turned on. What I am asking is: this bundled next event is a concept
> that should apply to x86 CPUs too, or it is entirely managed in FW/HW
> and the kernel just should not care ?

on Intel x86 cpus, there's not really bundled concept. or rather, there is only 1 bundle
(which amounts to the same thing).
Yes in a multi-package setup there are some cache power effects... but there's
not a lot one can do there.
The other cores don't wake up, so they still make their own correct decisions.

> I still do not understand how this "bundled" next event is managed on
> x86 with the menu governor, or better why it is not managed at all, given
> the importance of package C-states.

package C states on x86 are basically OS invisible. The OS manages core level C states,
the hardware manages the rest.
The bundle part hurts you on a "one wakes all" system,
not because of package level power effects, but because others wake up prematurely
(compared to what they expected) which causes them to think future wakups will also
be earlier. All because they get the "what is the next known event" wrong,
and start correcting for known events instead of only for 'unpredictable' interrupts.
Things will go very wonky if you do that for sure.
(I've seen various simulation data on that, and the menu governor indeed acts quite poorly
for that)

>> And maybe even more special casing is needed... but I doubt it.
>
> I lost you here, can you elaborate pls ?

well.. just looking at the earliest timer might not be enough; that timer might be on a different
core that's still active, and may change after the current cpu has gone into an idle state.
Fun.
Coupled C states on this level are a PAIN in many ways, and tend to totally suck for power
due to this and the general "too much is active" reasons.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC:  revert request for cpuidle patches e11538d1 and 69a37bea
  2013-07-29 14:29             ` Arjan van de Ven
@ 2013-07-29 16:01               ` Lorenzo Pieralisi
  2013-07-29 16:04                 ` Arjan van de Ven
  0 siblings, 1 reply; 19+ messages in thread
From: Lorenzo Pieralisi @ 2013-07-29 16:01 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Daniel Lezcano, Rik van Riel, Jeremy Eder,
	linux-kernel@vger.kernel.org, rafael.j.wysocki@intel.com,
	youquan.song@intel.com, paulmck@linux.vnet.ibm.com,
	len.brown@intel.com, Vincent Guittot

On Mon, Jul 29, 2013 at 03:29:20PM +0100, Arjan van de Ven wrote:
> On 7/29/2013 7:14 AM, Lorenzo Pieralisi wrote:
> >>
> >>
> >> btw this is largely a misunderstanding;
> >> tasks are not the issue; tasks use timers and those are perfectly predictable.
> >> It's interrupts that are not and the heuristics are for that.
> >>
> >> Now, if your hardware does the really-bad-for-power wake-all on any interrupt,
> >> then the menu governor logic is not good for you; rather than looking at the next
> >> timer on the current cpu you need to look at the earliest timer on the set of bundled
> >> cpus as the upper bound of the next wake event.
> >
> > Yes, that's true and we have to look into this properly, but certainly
> > a wake-up for a CPU in a package C-state is not beneficial to x86 CPUs either,
> > or I am missing something ?
> 
> a CPU core isn't in a package C state, the system is.
> (in a core C state the whole core is already powered down completely; a package C state
> just also turns off the memory controller/etc)
> 
> package C states are global on x86 (not just per package); there's nothing one
> can do there in terms of grouping/etc.

So you are saying that package states are system states on x86 right ?
Now things are a bit clearer, I was a bit baffled at first when you
mentioned that package C-states allow PM to turn off DRAM controller,
the concept of package C-state is a bit misleading and does not resemble
much to what cluster states are now for ARM, that's why I asked in the
first place, thank you.

> > Even if the wake-up interrupts just power up one of the CPUs in a package
> > and leave other(s) alone, all HW state shared (ie caches) by those CPUs must
> > be turned on. What I am asking is: this bundled next event is a concept
> > that should apply to x86 CPUs too, or it is entirely managed in FW/HW
> > and the kernel just should not care ?
> 
> on Intel x86 cpus, there's not really bundled concept. or rather, there is only 1 bundle
> (which amounts to the same thing).
> Yes in a multi-package setup there are some cache power effects... but there's
> not a lot one can do there.

On ARM there is, some basic optimizations like avoid cleaning caches if
an IRQ is pending or a next event is due shortly, for instance. The
difference is that on ARM cache management in done in SW and under
kernel or FW control (which in a way is closer to what x86 does, even
though I think on x86 most of power control is offloaded to HW).

Given what you said above, I understand that even on a multi-package system,
package C-states are global, not per package. That's a pivotal point.

> The other cores don't wake up, so they still make their own correct decisions.
> 
> > I still do not understand how this "bundled" next event is managed on
> > x86 with the menu governor, or better why it is not managed at all, given
> > the importance of package C-states.
> 
> package C states on x86 are basically OS invisible. The OS manages core level C states,
> the hardware manages the rest.
> The bundle part hurts you on a "one wakes all" system,
> not because of package level power effects, but because others wake up prematurely
> (compared to what they expected) which causes them to think future wakups will also
> be earlier. All because they get the "what is the next known event" wrong,
> and start correcting for known events instead of only for 'unpredictable' interrupts.

Well, reality is a bit more complex and probably less drastic, cores that are
woken up spuriosly can be put back to sleep without going through the governor
again, but your point is taken.

> Things will go very wonky if you do that for sure.
> (I've seen various simulation data on that, and the menu governor indeed acts quite poorly
> for that)

That's something we have been benchmarking, yes.

> >> And maybe even more special casing is needed... but I doubt it.
> >
> > I lost you here, can you elaborate pls ?
> 
> well.. just looking at the earliest timer might not be enough; that timer might be on a different
> core that's still active, and may change after the current cpu has gone into an idle state.

Yes, I have worked on this, and certainly next events must be tied to
the state a core is in, the bare next event does not help.

> Fun.

Big :D

> Coupled C states on this level are a PAIN in many ways, and tend to totally suck for power
> due to this and the general "too much is active" reasons.

I think the trend is moving towards core gating, which resembles a lot to what
x86 world does today. Still, the interaction between menu governor and
cluster states has to be characterized and that's we are doing at the
moment.

Thank you very much,
Lorenzo


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC:  revert request for cpuidle patches e11538d1 and 69a37bea
  2013-07-29 16:01               ` Lorenzo Pieralisi
@ 2013-07-29 16:04                 ` Arjan van de Ven
  0 siblings, 0 replies; 19+ messages in thread
From: Arjan van de Ven @ 2013-07-29 16:04 UTC (permalink / raw)
  To: Lorenzo Pieralisi
  Cc: Daniel Lezcano, Rik van Riel, Jeremy Eder,
	linux-kernel@vger.kernel.org, rafael.j.wysocki@intel.com,
	youquan.song@intel.com, paulmck@linux.vnet.ibm.com,
	len.brown@intel.com, Vincent Guittot

On 7/29/2013 9:01 AM, Lorenzo Pieralisi wrote:
>
>> Coupled C states on this level are a PAIN in many ways, and tend to totally suck for power
>> due to this and the general "too much is active" reasons.
>
> I think the trend is moving towards core gating, which resembles a lot to what
> x86 world does today. Still, the interaction between menu governor and
> cluster states has to be characterized and that's we are doing at the
> moment.

I suspect that we (Linux) need a different governor than menu for the coupled case.
Fundamentally you're going to need different algorithms that are much closer tied
to the behavior of the specific hardware; the current menu governor uses an abstract
and simple hardware model (fully independent).
For specific hardware that deviates from the model, something different is clearly needed.

It's an open question for me if just the predictor is the part that
you split out, or if the whole governor needs to be split out.
(but frankly, if you take the predictor out of the menu governor, not a whole lot is left)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC:  revert request for cpuidle patches e11538d1 and 69a37bea
  2013-07-30  3:57 ` Youquan Song
@ 2013-07-29 16:59   ` Jeremy Eder
  2013-08-02 18:19     ` Jeremy Eder
  0 siblings, 1 reply; 19+ messages in thread
From: Jeremy Eder @ 2013-07-29 16:59 UTC (permalink / raw)
  To: Youquan Song
  Cc: Jeremy Eder, linux-kernel, rafael.j.wysocki, riel, youquan.song,
	paulmck, daniel.lezcano, arjan, len.brown

On 130729 23:57:31, Youquan Song wrote:
> Hi Jeremy,
> 
> I try reproduce your result and then fix the issue, but I do not reproduce it
>  yet.
> 
> I run at netperf-2.6.0 at one machine as server: netserver, other
> machine: netperf -t TCP_RR -H $SERVER_IP -l 60. The target machine is
> used in both client and server. I do not reproduce the performance drop
> issue. I also notice the result is not stable, sometime it is high,
> sometime is low. In sumarry, it is hard to make a definite result.
> 
> Can you try tell me how to reproduce the issue? how do you get the C0
> data?
> 
> What's your config for kernel?  Do you enable CONFIG_NO_HZ_FULL=y or
> only CONFIG_NO_HZ=y?
> 
> 
> Thanks
> -Youquan 

Hi,

To answer both your and Daniel's question, those results used only
CONFIG_NO_HZ=y.

These network latency benchmarks are fickle creatures, and need careful
tuning to become reproducible.  Plus there are BIOS implications and tuning
varies by vendor.

Anyway for the most part it's probably not stable because in order to get
any sort
of reproducibility between runs you need to do at least these steps:

- ensure as little is running in userspace as possible
- determine PCI affinity for the NIC
- on both machines, isolate the socket connected to the NIC from userspace
  tasks
- Turn off irqbalance and bind all IRQs for that NIC to a single core on
  the same socket as the NIC
- run netperf with -TX,Y where X,Y are core numbers that you wish
  netperf/netserver to run on, respectively.

For example, if your NIC is attached to socket 0 and socket 0 cores are
enumerated 0-7, then:

- set /proc/irq/NNN/smp_affinity_list to, say, 6 for all vectors on that
  NIC.
- nice -20 netperf -t TCP_RR - $SERVER_IP -l 60 -T4,4 -s 2

That should get you most of the way there.  The -s 2 connects and waits 2
seconds, I found this to help with the first few second's worth of data.
Or
you could just toss the first 2 seconds worth, it seems to take that long
to stabilize.  What I mean is, if you're not using -D1,1 option to netperf,
you might not have seen that netperf tests seem to take a few seconds to
stabilize even
when properly tuned.

I got the C0 data by running turbostat in parallel with each benchmark run,
then grabbing the C-state data for the cores relevant to the test.  In my
case that was cores 4 and 6, where core 4 was where I put netperf/netserver
and core 6 was where I put the NIC IRQs.  Then I parsed that output into a
format that this could interpret:

https://github.com/bitly/data_hacks/blob/master/data_hacks/histogram.py

I'm building a kernel from Rafael's tree and will try to confirm what Len
already sent.  Thanks everyone for looking into it.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC:  revert request for cpuidle patches e11538d1 and 69a37bea
  2013-07-26 17:33 RFC: revert request for cpuidle patches e11538d1 and 69a37bea Jeremy Eder
  2013-07-26 18:13 ` Rik van Riel
  2013-07-27  6:43 ` Daniel Lezcano
@ 2013-07-30  3:57 ` Youquan Song
  2013-07-29 16:59   ` Jeremy Eder
  2 siblings, 1 reply; 19+ messages in thread
From: Youquan Song @ 2013-07-30  3:57 UTC (permalink / raw)
  To: Jeremy Eder
  Cc: linux-kernel, rafael.j.wysocki, riel, youquan.song, youquan.song,
	paulmck, daniel.lezcano, arjan, len.brown

Hi Jeremy,

I try reproduce your result and then fix the issue, but I do not reproduce it
 yet.

I run at netperf-2.6.0 at one machine as server: netserver, other
machine: netperf -t TCP_RR -H $SERVER_IP -l 60. The target machine is
used in both client and server. I do not reproduce the performance drop
issue. I also notice the result is not stable, sometime it is high,
sometime is low. In sumarry, it is hard to make a definite result.

Can you try tell me how to reproduce the issue? how do you get the C0
data?

What's your config for kernel?  Do you enable CONFIG_NO_HZ_FULL=y or
only CONFIG_NO_HZ=y?

Thanks
-Youquan 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC:  revert request for cpuidle patches e11538d1 and 69a37bea
  2013-07-29 16:59   ` Jeremy Eder
@ 2013-08-02 18:19     ` Jeremy Eder
  0 siblings, 0 replies; 19+ messages in thread
From: Jeremy Eder @ 2013-08-02 18:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: rafael.j.wysocki, riel, youquan.song, paulmck, daniel.lezcano,
	arjan, len.brown

On 130729 12:59:47, Jeremy Eder wrote:
> On 130729 23:57:31, Youquan Song wrote:
> > Hi Jeremy,
> > 
> > I try reproduce your result and then fix the issue, but I do not reproduce it
> >  yet.
> > 
> > I run at netperf-2.6.0 at one machine as server: netserver, other
> > machine: netperf -t TCP_RR -H $SERVER_IP -l 60. The target machine is
> > used in both client and server. I do not reproduce the performance drop
> > issue. I also notice the result is not stable, sometime it is high,
> > sometime is low. In sumarry, it is hard to make a definite result.
> > 
> > Can you try tell me how to reproduce the issue? how do you get the C0
> > data?
> > 
> > What's your config for kernel?  Do you enable CONFIG_NO_HZ_FULL=y or
> > only CONFIG_NO_HZ=y?
> > 
> > 
> > Thanks
> > -Youquan 
> 
> Hi,
> 
> To answer both your and Daniel's question, those results used only
> CONFIG_NO_HZ=y.
> 
> These network latency benchmarks are fickle creatures, and need careful
> tuning to become reproducible.  Plus there are BIOS implications and tuning
> varies by vendor.
> 
> Anyway for the most part it's probably not stable because in order to get
> any sort
> of reproducibility between runs you need to do at least these steps:
> 
> - ensure as little is running in userspace as possible
> - determine PCI affinity for the NIC
> - on both machines, isolate the socket connected to the NIC from userspace
>   tasks
> - Turn off irqbalance and bind all IRQs for that NIC to a single core on
>   the same socket as the NIC
> - run netperf with -TX,Y where X,Y are core numbers that you wish
>   netperf/netserver to run on, respectively.
> 
> For example, if your NIC is attached to socket 0 and socket 0 cores are
> enumerated 0-7, then:
> 
> - set /proc/irq/NNN/smp_affinity_list to, say, 6 for all vectors on that
>   NIC.
> - nice -20 netperf -t TCP_RR - $SERVER_IP -l 60 -T4,4 -s 2
> 
> That should get you most of the way there.  The -s 2 connects and waits 2
> seconds, I found this to help with the first few second's worth of data.
> Or
> you could just toss the first 2 seconds worth, it seems to take that long
> to stabilize.  What I mean is, if you're not using -D1,1 option to netperf,
> you might not have seen that netperf tests seem to take a few seconds to
> stabilize even
> when properly tuned.
> 
> I got the C0 data by running turbostat in parallel with each benchmark run,
> then grabbing the C-state data for the cores relevant to the test.  In my
> case that was cores 4 and 6, where core 4 was where I put netperf/netserver
> and core 6 was where I put the NIC IRQs.  Then I parsed that output into a
> format that this could interpret:
> 
> https://github.com/bitly/data_hacks/blob/master/data_hacks/histogram.py
> 
> I'm building a kernel from Rafael's tree and will try to confirm what Len
> already sent.  Thanks everyone for looking into it.


Hi, sorry for the delay.  In addition to the results I initially posted,
the below results confirm my initial data, plus what Len sent:

3.11-rc2 w/reverts
TCP_RR trans/s 54454.13

3.11-rc2 w/reverts + c0 lock
TCP_RR trans/s 55088.11

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2013-08-02 18:19 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-07-26 17:33 RFC: revert request for cpuidle patches e11538d1 and 69a37bea Jeremy Eder
2013-07-26 18:13 ` Rik van Riel
2013-07-26 18:27   ` Arjan van de Ven
2013-07-26 18:29     ` Rik van Riel
2013-07-26 21:48       ` Rafael J. Wysocki
2013-07-27  0:36         ` Rafael J. Wysocki
2013-07-27  6:22           ` Len Brown
2013-07-27 12:10             ` Rafael J. Wysocki
2013-07-27  7:11       ` Daniel Lezcano
2013-07-29 11:49         ` Arjan van de Ven
2013-07-29 13:12         ` Arjan van de Ven
2013-07-29 14:14           ` Lorenzo Pieralisi
2013-07-29 14:29             ` Arjan van de Ven
2013-07-29 16:01               ` Lorenzo Pieralisi
2013-07-29 16:04                 ` Arjan van de Ven
2013-07-27  6:43 ` Daniel Lezcano
2013-07-30  3:57 ` Youquan Song
2013-07-29 16:59   ` Jeremy Eder
2013-08-02 18:19     ` Jeremy Eder

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).