* PostgreSQL pgbench performance regression in 2.6.23+
@ 2008-05-21 17:34 Greg Smith
2008-05-22 7:10 ` Mike Galbraith
0 siblings, 1 reply; 38+ messages in thread
From: Greg Smith @ 2008-05-21 17:34 UTC (permalink / raw)
To: lkml
[-- Attachment #1: Type: TEXT/PLAIN, Size: 8306 bytes --]
PostgreSQL ships with a simple database benchmarking tool named pgbench,
in what's labeled the contrib section (in many distributions it's a
separate package from the main server/client ones). I see there's been
some work done already improving how the PostgreSQL server works under the
new scheduler (the "Poor PostgreSQL scaling on Linux 2.6.25-rc5" thread).
I wanted to provide you a different test case using pgbench that has taken
a sharp dive starting with 2.6.23, and the server improvement changes in
2.6.25 actually made this problem worse.
I think it will be easy for someone else to replicate my results and I'll
go over the exact procedure below. To start with a view of how bad the
regression is, here's a summary of the results on one system, an AMD X2
4600+ running at 2.4GHz, with a few interesting kernels. I threw in
results from Solaris 10 on this system as a nice independant reference
point. The numbers here are transactions/second (TPS) running a simple
read-only test over a 160MB data set, I took the median from 3 test runs:
Clients 2.6.9 2.6.22 2.6.24 2.6.25 Solaris
1 11173 11052 10526 10700 9656
2 18035 16352 14447 10370 14518
3 19365 15414 17784 9403 14062
4 18975 14290 16832 8882 14568
5 18652 14211 16356 8527 15062
6 17830 13291 16763 9473 15314
8 15837 12374 15343 9093 15164
10 14829 11218 10732 9057 14967
15 14053 11116 7460 7113 13944
20 13713 11412 7171 7017 13357
30 13454 11191 7049 6896 12987
40 13103 11062 7001 6820 12871
50 12311 11255 6915 6797 12858
That's the CentOS 4 2.6.9 kernel there, while the rest are stock ones I
compiled with a minimum of fiddling from the defaults (just adding support
for my SATA RAID card). You can see a major drop with the recent kernels
at high client loads, and the changes in 2.6.25 seem to have really hurt
even the low client count ones.
The other recent hardware I have here, an Intel Q6600 based system, gives
even more maddening results. On successive benchmark runs, you can watch
it break down only sometimes once you get just above 8 clients. At 10 and
15 clients, when I run it a few times, I'll sometimes get results in the
good 25-30K TPS range, while others will give the 10K slow case. It's not
a smooth drop off like in the AMD case, the results from 10-15 are really
unstable. I've attached some files with 5 quick runs at each client load
so you can see what I'm talking about. On that system I was also able to
test 2.6.26-rc2 which doesn't look all that different from 2.6.25.
All these results are running everything on the server using the default
local sockets-based interface, which is relevant in the real world because
that's how a web app hosted on the same system will talk to the database.
If I switch to connecting to the database over TCP/IP and run the pgbench
client on another system, the extra latency drops the single client case
to ~3100TPS. But the high client load cases are great--about 26K TPS at
50 clients. That result is attached as q6600-remote-2.6.25.txt, the
remote client was running 2.6.20. Since recent PostgreSQL results were
also fine with sysbench as the benchmark driver, this suggests the problem
here is actually related to the pgbench client itself and how it gets
scheduled relative to the server backends, rather than being inherent to
the server.
Replicating the test results
----------------------------
Onto replicating my results, which I hope works because I don't have too
much time to test potential fixed kernels myself (I was supposed to be
working on the PostgreSQL code until this sidetracked me). I'll assume
you can get the basic database going, if anybody needs help with that let
me know. There is one server tunable that needs to be adjusted before you
can get useful PostgreSQL benchmarks from this (and many other) tests.
In the root of the database directory, there will be a file named
postgresql.conf. Edit that and changed the setting for the shared_buffers
parameter to 256MB to mimic my test setup. You may need to bump up shmmax
(this is the one list where I'm happy I don't have to explain what that
means!). Restart the server and check the logs to make sure it came back
up, if shmmax is too low it will just tell you how big it needs to be and
not start.
Now the basic procedure to run this test is:
-dropdb pgbench (if it's already there)
-createdb pgbench
-pgbench -i -s 10 pgbench (makes about a 160MB database)
-pgbench -S -c <clients> -t 10000 pgbench
The idea is that you'll have a large enough data set to not fit in L2
cache, but small enough that it all fits in PostgreSQL's dedicated memory
(shared_buffers) so that it never has to ask the kernel to read a block.
The "pgbench -i" initialization step will populate the server's memory and
while that's all written to disk, it should stay in memory afterwards as
well. That's why I use this as a general CPU/L2/memory test as viewed
from a PostgreSQL context, and as you can see from my results with this
problem it's pretty sensitive to whether your setup is optimal or not.
To make this easier to run against a range of client loads, I've attached
a script (selecttest.sh) that does the last two steps in the above.
That's what I used to generate all the results I've attached. If you've
got the database setup such that you can run the psql client and pgbench
is in your path, you should just be able to run that script and have it
give you a set of results in a couple of minutes. You can adjust which
client loads and how many times it runs each by editing the script.
Addendum: how pgbench works
----------------------------
pgbench works off "command scripts", which are a series of SQL commands
with some extra benchmarking features implemented as a really simple
programming language. For example, the SELECT-only test run above, what
you get when passing -S to pgbench, is implemented like this:
\\set naccounts 100000 * :scale
\\setrandom aid 1 :naccounts
SELECT abalance FROM accounts WHERE aid = :aid;
Here :scale is detected automatically by doing a count of a table in the
database.
The pgbench client runs as a single process. When pgbench starts, it
iterates over each client, parsing the script until it hits a line that
needs to be sent to the server. At that point, it issues that command as
an asynchronous request, then returns to the main loop. Once every client
is primed with a command, it enters a loop where it just waits for
responses from them.
The main loop has all the open client connections in a fd_set. Each time
a select() on that set says there's been a response to at least one of the
clients from the server, it sweeps through all the clients and feeds the
next script line to any that are ready for one. This proceeds until the
target transaction count is reached.
This design is recognized as being only useful for smallish client loads.
The results start dropping off very hard even on a fast machine with >100
simulated clients as the single pgbench process struggles to respond to
everyone who is ready on each pass through all the clients who got
responses. This makes pgbench particularly unsuitable for testing on
systems with a large number of CPUs. I find pgbench just can't keep up
with the useful number of clients possible somewhere between 8 and 16
cores. I'm hoping the PostgreSQL community can rewrite it in a more
efficient way before the next release comes out now that such hardware is
starting to show up more running this database. If that's the only way to
resolve the issue outlined in this message, that's not intolerable, but a
kernel fix would obviously be better.
I wanted to submit this here regardless because I'd really like for
current versions to not have a big regression just because they were using
a newer kernel, and it provides an interesting scheduler test case to add
to the mix. The fact that earlier Linux kernels and alternate ones like
Solaris give pretty consistant results here says this programming approach
isn't impossible for a kernel to support well, I just don't think this
specific type of load has been considered in the test cases for the new
scheduler yet.
--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
[-- Attachment #2: Type: TEXT/PLAIN, Size: 503 bytes --]
#!/bin/bash
uname -pr
SCALE=10
TOTTRANS=100000
SETTIMES=3
SETCLIENTS="1 2 3 4 5 6 8 10 15 20 30 40 50"
TESTDB="pgbench"
pgbench -i -s $SCALE $TESTDB > /dev/null 2>&1
for C in $SETCLIENTS; do
T=1
while [ $T -le "$SETTIMES" ]; do
TRANS=`expr $TOTTRANS / $C`
pgbench -S -n -c $C -t $TRANS $TESTDB > results.txt
TPS=`grep "(including connections establishing)" results.txt | cut -d " " -f 3`
echo $C $TPS
T=$(( $T + 1))
done
done
rm -f results.txt
[-- Attachment #3: Type: TEXT/PLAIN, Size: 711 bytes --]
2.6.25 x86_64 (on server; client run on remote host with kernel 2.6.20-16)
1 3057.844199
1 3092.482163
1 3121.953364
2 5727.142659
2 5908.297317
2 5926.888628
3 9363.477540
3 9433.084801
3 9431.190712
4 13004.533641
4 12895.343840
4 12949.625568
5 15874.535293
5 16215.776199
5 15909.425730
6 18579.074963
6 18712.558182
6 18453.177986
8 20867.107616
8 20611.982116
8 20808.939187
10 22629.902429
10 22739.298715
10 22212.577028
15 26653.026061
15 25672.065614
15 26483.221996
20 27557.045841
20 26237.814831
20 28956.575850
30 23166.785331
30 26702.258997
30 28583.974107
40 27541.904319
40 25891.167513
40 26476.592971
50 26434.081991
50 25637.140628
50 26099.091465
[-- Attachment #4: Type: TEXT/PLAIN, Size: 1064 bytes --]
2.6.25 x86_64
1 10330.660688
1 11271.754910
1 11282.125571
1 11256.340415
1 11325.051399
2 12504.737733
2 12588.134248
2 12441.831328
2 12447.620413
2 12593.846193
3 12628.665286
3 12766.801694
3 12797.020210
3 12959.703085
3 12905.702894
4 13958.284828
4 13977.373428
4 14109.186195
4 13034.869580
4 13005.338692
5 11994.961157
5 14766.047482
5 14344.018623
5 12404.053099
5 12007.859384
6 10916.289994
6 12145.067460
6 12109.840159
6 9693.585149
6 12180.340072
8 10810.231149
8 10837.233744
8 10799.744867
8 10839.094402
8 10816.589793
10 10655.716568
10 10643.532452
10 10609.845427
10 10615.836344
10 10645.945965
15 10277.499687
15 10207.888097
15 10193.409730
15 10217.082607
15 10207.900603
20 9719.168513
20 9715.113997
20 9718.205094
20 9701.906027
20 9690.018254
30 8899.177367
30 8844.672113
30 8868.549891
30 8879.713057
30 8884.936474
40 8361.219394
40 8350.369479
40 8363.908997
40 8348.133182
40 8344.822067
50 8095.186440
50 8095.049481
50 8131.078184
50 8096.018127
50 8090.840723
[-- Attachment #5: Type: TEXT/PLAIN, Size: 1069 bytes --]
2.6.24.4 x86_64
1 11421.820154
1 11431.670391
1 11449.594192
1 11427.799562
1 11468.476484
2 14325.863542
2 14437.174685
2 14402.338248
2 14799.436556
2 14772.314319
3 19668.805474
3 19535.175389
3 19354.685119
3 19295.420668
3 19336.724384
4 22103.545829
4 22602.537542
4 21865.331424
4 21178.368668
4 22424.647019
5 26270.300375
5 26614.721827
5 26678.889155
5 27197.844190
5 25774.059440
6 27238.368411
6 27730.210861
6 27489.568666
6 28347.088836
6 27122.737466
8 27632.278480
8 28796.070834
8 29232.842514
8 28681.952426
8 28562.876030
10 31189.910688
10 30459.861670
10 30330.180410
10 30726.648362
10 10902.279165
15 10447.387234
15 25295.659944
15 10375.324430
15 10355.221697
15 11314.860580
20 9897.298701
20 9892.404276
20 9868.676534
20 8911.663139
20 9879.533903
30 9018.739658
30 9052.746303
30 9018.794160
30 9009.324773
30 9272.859955
40 8501.766072
40 8538.091714
40 8476.846342
40 8664.056995
40 8490.264553
50 8192.826361
50 8218.880626
50 8225.086398
50 8221.221900
50 8343.573679
[-- Attachment #6: Type: TEXT/PLAIN, Size: 1085 bytes --]
2.6.22.19 x86_64
1 7623.484051
1 7625.915300
1 7589.468641
1 7570.584916
1 7652.315514
2 17702.824804
2 17369.699463
2 17222.642263
2 17593.340147
2 15637.517344
3 26377.325613
3 19256.513966
3 26813.207675
3 28777.228927
3 29432.081702
4 22640.938711
4 27589.357791
4 21602.130661
4 20272.457778
4 28949.123652
5 25815.538683
5 24871.847804
5 26238.117740
5 25570.425042
5 24551.637987
6 23901.788403
6 25105.699222
6 26229.009396
6 25517.620111
6 21909.853124
8 23674.903797
8 25231.645429
8 25255.745998
8 23869.783647
8 23818.807473
10 21703.371771
10 23839.408211
10 23185.911127
10 23665.093490
10 24717.906888
15 23421.502246
15 23403.340506
15 23329.025587
15 22730.765349
15 23207.747521
20 22480.887312
20 22635.157923
20 22511.885150
20 22223.832215
20 16553.580879
30 19407.089071
30 21718.108980
30 20645.888631
30 21650.537929
30 21993.984923
40 20098.232119
40 19562.630446
40 20236.880784
40 19181.712002
40 20835.781538
50 19043.951727
50 19859.900319
50 18122.228998
50 19467.880528
50 19921.715626
[-- Attachment #7: Type: TEXT/PLAIN, Size: 1084 bytes --]
2.6.26-rc2 x86_64
1 11023.139112
1 11039.151787
1 10961.233297
1 11006.943841
1 11034.116274
2 11588.185104
2 11412.046785
2 11636.440818
2 11519.910495
2 10110.350431
3 12812.004251
3 13580.622648
3 13379.527058
3 13303.612765
3 13251.912767
4 13281.604142
4 13800.818582
4 12847.651013
4 12579.769893
4 12669.317510
5 13070.632785
5 12503.529121
5 12653.504407
5 12442.387082
5 11895.378717
6 12256.322309
6 12228.701519
6 12628.954679
6 12203.115311
6 12610.640729
8 11754.685359
8 11421.719702
8 10237.187443
8 11729.049572
8 11575.933726
10 11582.762487
10 11567.416116
10 11612.546009
10 11580.299826
10 11511.407517
15 11301.027547
15 11211.228675
15 11270.000193
15 11164.906529
15 11120.390151
20 11222.653060
20 10847.887130
20 11343.419297
20 11158.649437
20 11307.272182
30 10302.870811
30 10092.840200
30 10404.836485
30 10153.170822
30 10633.193429
40 10368.266304
40 10017.006874
40 9682.031437
40 10166.772689
40 10413.682496
50 9994.610906
50 9333.995176
50 9426.160782
50 9845.708881
50 10018.081636
^ permalink raw reply [flat|nested] 38+ messages in thread* Re: PostgreSQL pgbench performance regression in 2.6.23+
2008-05-21 17:34 PostgreSQL pgbench performance regression in 2.6.23+ Greg Smith
@ 2008-05-22 7:10 ` Mike Galbraith
2008-05-22 8:28 ` Dhaval Giani
0 siblings, 1 reply; 38+ messages in thread
From: Mike Galbraith @ 2008-05-22 7:10 UTC (permalink / raw)
To: Greg Smith; +Cc: lkml, Peter Zijlstra, Ingo Molnar
On Wed, 2008-05-21 at 13:34 -0400, Greg Smith wrote:
> PostgreSQL ships with a simple database benchmarking tool named pgbench,
> in what's labeled the contrib section (in many distributions it's a
> separate package from the main server/client ones). I see there's been
> some work done already improving how the PostgreSQL server works under the
> new scheduler (the "Poor PostgreSQL scaling on Linux 2.6.25-rc5" thread).
> I wanted to provide you a different test case using pgbench that has taken
> a sharp dive starting with 2.6.23, and the server improvement changes in
> 2.6.25 actually made this problem worse.
>
> I think it will be easy for someone else to replicate my results and I'll
> go over the exact procedure below.
Yup, I can reproduce. Running the test with 2.6.25.4, everything is
waking/running on one CPU, leaving my box 75% idle. Not good.
-Mike
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: PostgreSQL pgbench performance regression in 2.6.23+
2008-05-22 7:10 ` Mike Galbraith
@ 2008-05-22 8:28 ` Dhaval Giani
2008-05-22 9:05 ` Mike Galbraith
0 siblings, 1 reply; 38+ messages in thread
From: Dhaval Giani @ 2008-05-22 8:28 UTC (permalink / raw)
To: Mike Galbraith
Cc: Greg Smith, lkml, Peter Zijlstra, Ingo Molnar, Srivatsa Vaddagiri
On Thu, May 22, 2008 at 09:10:07AM +0200, Mike Galbraith wrote:
>
> On Wed, 2008-05-21 at 13:34 -0400, Greg Smith wrote:
> > PostgreSQL ships with a simple database benchmarking tool named pgbench,
> > in what's labeled the contrib section (in many distributions it's a
> > separate package from the main server/client ones). I see there's been
> > some work done already improving how the PostgreSQL server works under the
> > new scheduler (the "Poor PostgreSQL scaling on Linux 2.6.25-rc5" thread).
> > I wanted to provide you a different test case using pgbench that has taken
> > a sharp dive starting with 2.6.23, and the server improvement changes in
> > 2.6.25 actually made this problem worse.
> >
> > I think it will be easy for someone else to replicate my results and I'll
> > go over the exact procedure below.
>
> Yup, I can reproduce. Running the test with 2.6.25.4, everything is
> waking/running on one CPU, leaving my box 75% idle. Not good.
>
Can you try with 2.6.26-rc? There is minimal load balancing for group
scheduling till 25, which might explain the lack of scalability.
--
regards,
Dhaval
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: PostgreSQL pgbench performance regression in 2.6.23+
2008-05-22 8:28 ` Dhaval Giani
@ 2008-05-22 9:05 ` Mike Galbraith
2008-05-22 10:34 ` Mike Galbraith
0 siblings, 1 reply; 38+ messages in thread
From: Mike Galbraith @ 2008-05-22 9:05 UTC (permalink / raw)
To: Dhaval Giani
Cc: Greg Smith, lkml, Peter Zijlstra, Ingo Molnar, Srivatsa Vaddagiri
On Thu, 2008-05-22 at 13:58 +0530, Dhaval Giani wrote:
> On Thu, May 22, 2008 at 09:10:07AM +0200, Mike Galbraith wrote:
> >
> > On Wed, 2008-05-21 at 13:34 -0400, Greg Smith wrote:
> > > PostgreSQL ships with a simple database benchmarking tool named pgbench,
> > > in what's labeled the contrib section (in many distributions it's a
> > > separate package from the main server/client ones). I see there's been
> > > some work done already improving how the PostgreSQL server works under the
> > > new scheduler (the "Poor PostgreSQL scaling on Linux 2.6.25-rc5" thread).
> > > I wanted to provide you a different test case using pgbench that has taken
> > > a sharp dive starting with 2.6.23, and the server improvement changes in
> > > 2.6.25 actually made this problem worse.
> > >
> > > I think it will be easy for someone else to replicate my results and I'll
> > > go over the exact procedure below.
> >
> > Yup, I can reproduce. Running the test with 2.6.25.4, everything is
> > waking/running on one CPU, leaving my box 75% idle. Not good.
> >
>
> Can you try with 2.6.26-rc? There is minimal load balancing for group
> scheduling till 25, which might explain the lack of scalability.
I'm playing with it now, it's tweakable with migration cost. This
testcase is funky. It can't generate enough work to keep CPUs busy for
spit, and can't saturate my little quad with any kernel I've tried.
-Mike
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: PostgreSQL pgbench performance regression in 2.6.23+
2008-05-22 9:05 ` Mike Galbraith
@ 2008-05-22 10:34 ` Mike Galbraith
2008-05-22 11:25 ` Mike Galbraith
0 siblings, 1 reply; 38+ messages in thread
From: Mike Galbraith @ 2008-05-22 10:34 UTC (permalink / raw)
To: Dhaval Giani
Cc: Greg Smith, lkml, Peter Zijlstra, Ingo Molnar, Srivatsa Vaddagiri
On Thu, 2008-05-22 at 11:05 +0200, Mike Galbraith wrote:
> On Thu, 2008-05-22 at 13:58 +0530, Dhaval Giani wrote:
> > Can you try with 2.6.26-rc? There is minimal load balancing for group
> > scheduling till 25, which might explain the lack of scalability.
>
> I'm playing with it now, it's tweakable with migration cost. This
> testcase is funky. It can't generate enough work to keep CPUs busy for
> spit, and can't saturate my little quad with any kernel I've tried.
Heh, watch this. No tweaking.
(Nadia's ipc/idr patches are applied though, to see if the high end
improves over previous runs with various kernels, and it does seem to.)
2.6.26-smp x86_64
1 10014.774797
2 9791.395302
3 10575.369296
4 9763.183251
5 10160.274262
6 9893.174179
8 9566.978464
10 10294.456456
15 9444.100540
20 9137.878618
30 8277.795499
40 7925.824428
50 7646.644285
nail postgres to CPUs1-3
nail pgbench to CPU0
2.6.26-smp x86_64
1 10900.959982
2 15976.870604
3 24661.322669
4 25347.141780
5 25893.815676
6 26756.414839
8 25399.018582
10 26172.878669
15 25542.082746
20 25090.381828
30 24270.301103
40 23405.867336
50 21926.223083
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: PostgreSQL pgbench performance regression in 2.6.23+
2008-05-22 10:34 ` Mike Galbraith
@ 2008-05-22 11:25 ` Mike Galbraith
2008-05-22 11:44 ` Peter Zijlstra
0 siblings, 1 reply; 38+ messages in thread
From: Mike Galbraith @ 2008-05-22 11:25 UTC (permalink / raw)
To: Dhaval Giani
Cc: Greg Smith, lkml, Peter Zijlstra, Ingo Molnar, Srivatsa Vaddagiri
On Thu, 2008-05-22 at 12:34 +0200, Mike Galbraith wrote:
> (Nadia's ipc/idr patches are applied though, to see if the high end
> improves over previous runs with various kernels, and it does seem to.)
Disregard the above, no they don't. (now removed again)
However. The problem with 2.6.26.git running this testcase appears to
be SYNC_WAKEUPS. No taskset, nada except echo 863 > sched_features
2.6.26.git
1 8173.538610
2 15738.206889
3 23399.356839
4 21401.182501
5 21682.839897
6 26396.301413
8 29910.334798
10 29953.625797
15 29535.740343
20 28950.900431
30 27159.733949
40 24163.344207
50 23258.496794
vs
2.6.22.17-0.1-default (opensuse 10.3 stock kernel)
1 7693.501369
2 15669.304960
3 25340.818410
4 24445.932930
5 22807.019544
6 24051.387364
8 22406.392813
10 22631.510576
15 21225.243584
20 20382.232075
30 18834.814588
40 17799.906622
50 17305.274561
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: PostgreSQL pgbench performance regression in 2.6.23+
2008-05-22 11:25 ` Mike Galbraith
@ 2008-05-22 11:44 ` Peter Zijlstra
2008-05-22 12:09 ` Mike Galbraith
0 siblings, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2008-05-22 11:44 UTC (permalink / raw)
To: Mike Galbraith
Cc: Dhaval Giani, Greg Smith, lkml, Ingo Molnar, Srivatsa Vaddagiri
On Thu, 2008-05-22 at 13:25 +0200, Mike Galbraith wrote:
> On Thu, 2008-05-22 at 12:34 +0200, Mike Galbraith wrote:
>
> > (Nadia's ipc/idr patches are applied though, to see if the high end
> > improves over previous runs with various kernels, and it does seem to.)
>
> Disregard the above, no they don't. (now removed again)
>
> However. The problem with 2.6.26.git running this testcase appears to
> be SYNC_WAKEUPS. No taskset, nada except echo 863 > sched_features
>
> 2.6.26.git
> 1 8173.538610
> 2 15738.206889
> 3 23399.356839
> 4 21401.182501
> 5 21682.839897
> 6 26396.301413
> 8 29910.334798
> 10 29953.625797
> 15 29535.740343
> 20 28950.900431
> 30 27159.733949
> 40 24163.344207
> 50 23258.496794
>
> vs
>
> 2.6.22.17-0.1-default (opensuse 10.3 stock kernel)
> 1 7693.501369
> 2 15669.304960
> 3 25340.818410
> 4 24445.932930
> 5 22807.019544
> 6 24051.387364
> 8 22406.392813
> 10 22631.510576
> 15 21225.243584
> 20 20382.232075
> 30 18834.814588
> 40 17799.906622
> 50 17305.274561
Makes sense - I took a look at pgbench.c (and only thereafter took the
time to find the initial mail lkml where Greg rather nicely explained
its workings) - the thing with sync wakeups is that they try to pull
tasks together, but as this one task (pgbench) serves a number of
postgresql server tasks it will cluster everything.
Humm,.. how to fix this.. we'd need to somehow detect the 1:n nature of
its operation - I'm sure there are other scenarios that could benefit
from this.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: PostgreSQL pgbench performance regression in 2.6.23+
2008-05-22 11:44 ` Peter Zijlstra
@ 2008-05-22 12:09 ` Mike Galbraith
2008-05-22 12:24 ` Peter Zijlstra
0 siblings, 1 reply; 38+ messages in thread
From: Mike Galbraith @ 2008-05-22 12:09 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Dhaval Giani, Greg Smith, lkml, Ingo Molnar, Srivatsa Vaddagiri
On Thu, 2008-05-22 at 13:44 +0200, Peter Zijlstra wrote:
> Humm,.. how to fix this.. we'd need to somehow detect the 1:n nature of
> its operation - I'm sure there are other scenarios that could benefit
> from this.
Maybe simple (minded): cache waker's last non-interrupt context wakee,
if the wakee != cached, ignore SYNC_WAKEUP unless sync was requested at
call time?
-Mike
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: PostgreSQL pgbench performance regression in 2.6.23+
2008-05-22 12:09 ` Mike Galbraith
@ 2008-05-22 12:24 ` Peter Zijlstra
2008-05-22 13:16 ` Mike Galbraith
2008-05-23 7:13 ` Greg Smith
0 siblings, 2 replies; 38+ messages in thread
From: Peter Zijlstra @ 2008-05-22 12:24 UTC (permalink / raw)
To: Mike Galbraith
Cc: Dhaval Giani, Greg Smith, lkml, Ingo Molnar, Srivatsa Vaddagiri
On Thu, 2008-05-22 at 14:09 +0200, Mike Galbraith wrote:
> On Thu, 2008-05-22 at 13:44 +0200, Peter Zijlstra wrote:
>
> > Humm,.. how to fix this.. we'd need to somehow detect the 1:n nature of
> > its operation - I'm sure there are other scenarios that could benefit
> > from this.
>
> Maybe simple (minded): cache waker's last non-interrupt context wakee,
> if the wakee != cached, ignore SYNC_WAKEUP unless sync was requested at
> call time?
Yeah, something like so - or perhaps like you say cache the wakee.
I picked the wake_affine() condition, because I think that is the
biggest factor in this behaviour. You could of course also disable all
of sync.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index c86c5c5..856c2a8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -950,6 +950,8 @@ struct sched_entity {
u64 last_wakeup;
u64 avg_overlap;
+ struct sched_entity *waker;
+
#ifdef CONFIG_SCHEDSTATS
u64 wait_start;
u64 wait_max;
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 894a702..8971044 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1036,7 +1036,8 @@ wake_affine(struct rq *rq, struct sched_domain *this_sd, struct rq *this_rq,
* a reasonable amount of time then attract this newly
* woken task:
*/
- if (sync && curr->sched_class == &fair_sched_class) {
+ if (sync && curr->sched_class == &fair_sched_class &&
+ p->se.waker == curr->se->waker) {
if (curr->se.avg_overlap < sysctl_sched_migration_cost &&
p->se.avg_overlap < sysctl_sched_migration_cost)
return 1;
@@ -1210,6 +1211,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p)
if (unlikely(se == pse))
return;
+ se->waker = pse;
cfs_rq_of(pse)->next = pse;
/*
^ permalink raw reply related [flat|nested] 38+ messages in thread* Re: PostgreSQL pgbench performance regression in 2.6.23+
2008-05-22 12:24 ` Peter Zijlstra
@ 2008-05-22 13:16 ` Mike Galbraith
2008-05-23 7:13 ` Greg Smith
1 sibling, 0 replies; 38+ messages in thread
From: Mike Galbraith @ 2008-05-22 13:16 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Dhaval Giani, Greg Smith, lkml, Ingo Molnar, Srivatsa Vaddagiri
On Thu, 2008-05-22 at 14:24 +0200, Peter Zijlstra wrote:
> On Thu, 2008-05-22 at 14:09 +0200, Mike Galbraith wrote:
> > On Thu, 2008-05-22 at 13:44 +0200, Peter Zijlstra wrote:
> >
> > > Humm,.. how to fix this.. we'd need to somehow detect the 1:n nature of
> > > its operation - I'm sure there are other scenarios that could benefit
> > > from this.
> >
> > Maybe simple (minded): cache waker's last non-interrupt context wakee,
> > if the wakee != cached, ignore SYNC_WAKEUP unless sync was requested at
> > call time?
>
> Yeah, something like so - or perhaps like you say cache the wakee.
>
> I picked the wake_affine() condition, because I think that is the
> biggest factor in this behaviour. You could of course also disable all
> of sync.
Works fine (modulo booboo).
-Mike
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index c86c5c5..856c2a8 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -950,6 +950,8 @@ struct sched_entity {
> u64 last_wakeup;
> u64 avg_overlap;
>
> + struct sched_entity *waker;
> +
> #ifdef CONFIG_SCHEDSTATS
> u64 wait_start;
> u64 wait_max;
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 894a702..8971044 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -1036,7 +1036,8 @@ wake_affine(struct rq *rq, struct sched_domain *this_sd, struct rq *this_rq,
> * a reasonable amount of time then attract this newly
> * woken task:
> */
> - if (sync && curr->sched_class == &fair_sched_class) {
> + if (sync && curr->sched_class == &fair_sched_class &&
> + p->se.waker == curr->se->waker) {
> if (curr->se.avg_overlap < sysctl_sched_migration_cost &&
> p->se.avg_overlap < sysctl_sched_migration_cost)
> return 1;
> @@ -1210,6 +1211,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p)
> if (unlikely(se == pse))
> return;
>
> + se->waker = pse;
> cfs_rq_of(pse)->next = pse;
>
> /*
>
>
^ permalink raw reply [flat|nested] 38+ messages in thread* Re: PostgreSQL pgbench performance regression in 2.6.23+
2008-05-22 12:24 ` Peter Zijlstra
2008-05-22 13:16 ` Mike Galbraith
@ 2008-05-23 7:13 ` Greg Smith
2008-05-23 10:00 ` Mike Galbraith
1 sibling, 1 reply; 38+ messages in thread
From: Greg Smith @ 2008-05-23 7:13 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Mike Galbraith, Dhaval Giani, lkml, Ingo Molnar,
Srivatsa Vaddagiri
On Thu, 22 May 2008, Peter Zijlstra wrote:
> I picked the wake_affine() condition, because I think that is the
> biggest factor in this behaviour.
I tested out Peter's patch (updated version against -rc3 with a typo fix
from Mike below) and it's a big step in the right direction. Here are
updated results from my benchmark script, adding 2.6.26-rc3 and that rev
with this patch applied:
Clients 2.6.22 2.6.24 2.6.25 -rc3 patch
1 11052 10526 10700 10193 10439
2 16352 14447 10370 9817 13289
3 15414 17784 9403 9428 13678
4 14290 16832 8882 9533 13033
5 14211 16356 8527 9558 12790
6 13291 16763 9473 9367 12660
8 12374 15343 9093 9159 12357
10 11218 10732 9057 8711 11839
15 11116 7460 7113 7620 11267
20 11412 7171 7017 7707 10531
30 11191 7049 6896 7195 9766
40 11062 7001 6820 7079 9668
50 11255 6915 6797 7202 9588
Exact versions I tested because I think it may start mattering now:
2.6.22.19, 2.6.24.3, 2.6.25. I didn't save 2.6.23 results but recall them
being similar to 2.6.24.
On this dual-core system, without this patch there's an average of a a 33%
regression in -rc3 compared to 2.6.22. With it that's dropped to 8%; some
cases (around 10 clients) even improve a touch (it's enough within the
margin of error here I wouldn't conclude too much from that). The big
jump in high client count cases is the first I've seen that since CFS was
introduced. It seems a bit odd to me that there's still such a large
regression in the 2-8 client cases compared with not only 2.6.22 but
2.6.24, which owned this benchmark in that area.
With this feedback, any ideas on where to go next? There seems like's
some room for improvement still left here.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5395a61..e160f71 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -965,6 +965,8 @@ struct sched_entity {
u64 last_wakeup;
u64 avg_overlap;
+ struct sched_entity *waker;
+
#ifdef CONFIG_SCHEDSTATS
u64 wait_start;
u64 wait_max;
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index e24ecd3..9db3cb4 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1066,7 +1066,8 @@ wake_affine(struct rq *rq, struct sched_domain
*this_sd, struct rq *this_rq,
* a reasonable amount of time then attract this newly
* woken task:
*/
- if (sync && curr->sched_class == &fair_sched_class) {
+ if (sync && curr->sched_class == &fair_sched_class &&
+ p->se.waker == curr->se.waker) {
if (curr->se.avg_overlap < sysctl_sched_migration_cost &&
p->se.avg_overlap <
sysctl_sched_migration_cost)
return 1;
@@ -1238,6 +1239,7 @@ static void check_preempt_wakeup(struct rq *rq,
struct task_struct *p)
if (unlikely(se == pse))
return;
+ se->waker = pse;
cfs_rq_of(pse)->next = pse;
/*
--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
^ permalink raw reply related [flat|nested] 38+ messages in thread* Re: PostgreSQL pgbench performance regression in 2.6.23+
2008-05-23 7:13 ` Greg Smith
@ 2008-05-23 10:00 ` Mike Galbraith
2008-05-23 10:10 ` Ingo Molnar
0 siblings, 1 reply; 38+ messages in thread
From: Mike Galbraith @ 2008-05-23 10:00 UTC (permalink / raw)
To: Greg Smith
Cc: Peter Zijlstra, Dhaval Giani, lkml, Ingo Molnar,
Srivatsa Vaddagiri
On Fri, 2008-05-23 at 03:13 -0400, Greg Smith wrote:
> With this feedback, any ideas on where to go next? There seems like's
> some room for improvement still left here.
Dunno. This load is very highly tweakable, and doesn't seem to like
preemption much at all. You can see below what preemption is doing to
2.6.22.18 by looking at the batch numbers. SCHED_BATCH turns the O(1)
scheduler into a pathetic little round-robin scheduler, and this load
loves pathetic :-) After seeing the batch numbers, I tweaked .git to
make it as round-robin as I could.
My take on the numbers is that both kernels preempt too frequently for
_this_ load.. but what to do, many many loads desperately need
preemption to perform.
2.6.22.18 2.6.22.18-batch 2.6.26.git 2.6.26.git.batch
1 7487.115236 7643.563512 9999.400036 9915.823582
2 17074.869889 15360.150210 14042.644140 14958.375329
3 25073.139078 24802.446538 15621.206938 25047.032536
4 24236.413612 26126.482482 16436.055117 25007.183313
5 26367.198572 28298.293443 19926.550734 27853.081679
6 24695.827843 30786.651975 22375.916107 28119.474302
8 21020.949689 31973.674156 25825.292413 31070.664011
10 22792.204610 31775.164023 26754.471274 31596.415197
15 21202.173186 30388.559630 28711.761083 30963.050265
20 21204.041830 29317.044783 28512.269685 30127.614550
30 18519.965964 27252.739106 26682.613791 28185.244056
40 17936.447579 25670.803773 24964.936746 26282.369366
50 16247.605712 25089.154310 21078.604858 25356.750461
-Mike
^ permalink raw reply [flat|nested] 38+ messages in thread* Re: PostgreSQL pgbench performance regression in 2.6.23+
2008-05-23 10:00 ` Mike Galbraith
@ 2008-05-23 10:10 ` Ingo Molnar
2008-05-23 10:15 ` Mike Galbraith
2008-05-23 13:05 ` Mike Galbraith
0 siblings, 2 replies; 38+ messages in thread
From: Ingo Molnar @ 2008-05-23 10:10 UTC (permalink / raw)
To: Mike Galbraith
Cc: Greg Smith, Peter Zijlstra, Dhaval Giani, lkml,
Srivatsa Vaddagiri
* Mike Galbraith <efault@gmx.de> wrote:
> My take on the numbers is that both kernels preempt too frequently for
> _this_ load.. but what to do, many many loads desperately need
> preemption to perform.
>
> 2.6.22.18 2.6.22.18-batch 2.6.26.git 2.6.26.git.batch
> 1 7487.115236 7643.563512 9999.400036 9915.823582
> 2 17074.869889 15360.150210 14042.644140 14958.375329
> 3 25073.139078 24802.446538 15621.206938 25047.032536
> 4 24236.413612 26126.482482 16436.055117 25007.183313
> 5 26367.198572 28298.293443 19926.550734 27853.081679
> 6 24695.827843 30786.651975 22375.916107 28119.474302
> 8 21020.949689 31973.674156 25825.292413 31070.664011
> 10 22792.204610 31775.164023 26754.471274 31596.415197
> 15 21202.173186 30388.559630 28711.761083 30963.050265
> 20 21204.041830 29317.044783 28512.269685 30127.614550
> 30 18519.965964 27252.739106 26682.613791 28185.244056
> 40 17936.447579 25670.803773 24964.936746 26282.369366
> 50 16247.605712 25089.154310 21078.604858 25356.750461
was 2.6.26.git.batch running the load with SCHED_BATCH, or did you do
other tweaks as well?
if it's other tweaks as well then could you perhaps try to make
SCHED_BATCH batch more agressively?
I.e. i think it's a perfectly fine answer to say "if your workload needs
batch scheduling, run it under SCHED_BATCH".
Ingo
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: PostgreSQL pgbench performance regression in 2.6.23+
2008-05-23 10:10 ` Ingo Molnar
@ 2008-05-23 10:15 ` Mike Galbraith
2008-05-23 23:18 ` Greg Smith
2008-05-23 13:05 ` Mike Galbraith
1 sibling, 1 reply; 38+ messages in thread
From: Mike Galbraith @ 2008-05-23 10:15 UTC (permalink / raw)
To: Ingo Molnar
Cc: Greg Smith, Peter Zijlstra, Dhaval Giani, lkml,
Srivatsa Vaddagiri
On Fri, 2008-05-23 at 12:10 +0200, Ingo Molnar wrote:
> * Mike Galbraith <efault@gmx.de> wrote:
>
> > My take on the numbers is that both kernels preempt too frequently for
> > _this_ load.. but what to do, many many loads desperately need
> > preemption to perform.
> >
> > 2.6.22.18 2.6.22.18-batch 2.6.26.git 2.6.26.git.batch
> > 1 7487.115236 7643.563512 9999.400036 9915.823582
> > 2 17074.869889 15360.150210 14042.644140 14958.375329
> > 3 25073.139078 24802.446538 15621.206938 25047.032536
> > 4 24236.413612 26126.482482 16436.055117 25007.183313
> > 5 26367.198572 28298.293443 19926.550734 27853.081679
> > 6 24695.827843 30786.651975 22375.916107 28119.474302
> > 8 21020.949689 31973.674156 25825.292413 31070.664011
> > 10 22792.204610 31775.164023 26754.471274 31596.415197
> > 15 21202.173186 30388.559630 28711.761083 30963.050265
> > 20 21204.041830 29317.044783 28512.269685 30127.614550
> > 30 18519.965964 27252.739106 26682.613791 28185.244056
> > 40 17936.447579 25670.803773 24964.936746 26282.369366
> > 50 16247.605712 25089.154310 21078.604858 25356.750461
>
> was 2.6.26.git.batch running the load with SCHED_BATCH, or did you do
> other tweaks as well?
It was running SCHED_BATCH, features=0.
> if it's other tweaks as well then could you perhaps try to make
> SCHED_BATCH batch more agressively?
That's what I was thinking, because it needed features=0 as well to
achieve O(1) batch performance.
> I.e. i think it's a perfectly fine answer to say "if your workload needs
> batch scheduling, run it under SCHED_BATCH".
Yes, and this appears to be such a case.
-Mike
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: PostgreSQL pgbench performance regression in 2.6.23+
2008-05-23 10:15 ` Mike Galbraith
@ 2008-05-23 23:18 ` Greg Smith
2008-05-23 23:46 ` Mike Galbraith
0 siblings, 1 reply; 38+ messages in thread
From: Greg Smith @ 2008-05-23 23:18 UTC (permalink / raw)
To: Mike Galbraith
Cc: Ingo Molnar, Peter Zijlstra, Dhaval Giani, lkml,
Srivatsa Vaddagiri
On Fri, 23 May 2008, Mike Galbraith wrote:
> It was running SCHED_BATCH, features=0...it needed features=0 as well to
> achieve O(1) batch performance.
I figured out how to run pgbench with chrt in order to get SCHED_BATCH
behavior, but I don't understand what you mean by features=0 here. Since
I didn't see the same magnitude of different just using batch that seems
important, where does that get set at?
I'm also curious what hardware your results are coming from, to fit them
into my larger pgbench results context space.
Got my 4-core system back on-line again today (found some bad RAM) and
wanted to try another round of tests on that. Looks like you've defined 5
test sets I should replicate:
2.6.22
2.6.22, batch
2.6.26.git
2.6.26.git, batch
2.6.26.git, batch + se.load.weight patch
Should I still be trying Peter's se.waker patch as well in this mix
somewhere?
--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: PostgreSQL pgbench performance regression in 2.6.23+
2008-05-23 23:18 ` Greg Smith
@ 2008-05-23 23:46 ` Mike Galbraith
2008-05-24 8:08 ` Mike Galbraith
2008-05-27 0:28 ` Greg Smith
0 siblings, 2 replies; 38+ messages in thread
From: Mike Galbraith @ 2008-05-23 23:46 UTC (permalink / raw)
To: Greg Smith
Cc: Ingo Molnar, Peter Zijlstra, Dhaval Giani, lkml,
Srivatsa Vaddagiri
On Fri, 2008-05-23 at 19:18 -0400, Greg Smith wrote:
> On Fri, 23 May 2008, Mike Galbraith wrote:
>
> > It was running SCHED_BATCH, features=0...it needed features=0 as well to
> > achieve O(1) batch performance.
>
> I figured out how to run pgbench with chrt in order to get SCHED_BATCH
> behavior, but I don't understand what you mean by features=0 here. Since
> I didn't see the same magnitude of different just using batch that seems
> important, where does that get set at?
/proc/sys/kernel/sched_features. You need CONFIG_SCHED_DEBUG to have
accsess to the scheduler tweakables.
> I'm also curious what hardware your results are coming from, to fit them
> into my larger pgbench results context space.
A grocery store Q6600 box.
> Got my 4-core system back on-line again today (found some bad RAM) and
> wanted to try another round of tests on that. Looks like you've defined 5
> test sets I should replicate:
>
> 2.6.22
> 2.6.22, batch
> 2.6.26.git
> 2.6.26.git, batch
> 2.6.26.git, batch + se.load.weight patch
>
> Should I still be trying Peter's se.waker patch as well in this mix
> somewhere?
Yeah.
-Mike
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: PostgreSQL pgbench performance regression in 2.6.23+
2008-05-23 23:46 ` Mike Galbraith
@ 2008-05-24 8:08 ` Mike Galbraith
2008-05-27 0:28 ` Greg Smith
1 sibling, 0 replies; 38+ messages in thread
From: Mike Galbraith @ 2008-05-24 8:08 UTC (permalink / raw)
To: Greg Smith
Cc: Ingo Molnar, Peter Zijlstra, Dhaval Giani, lkml,
Srivatsa Vaddagiri
On Sat, 2008-05-24 at 01:46 +0200, Mike Galbraith wrote:
> On Fri, 2008-05-23 at 19:18 -0400, Greg Smith wrote:
> > Should I still be trying Peter's se.waker patch as well in this mix
> > somewhere?
>
> Yeah.
btw, the problem with 2.6.25.4 and this load is one and the same. With
a 1:N load, you really don't want work generator waking all worker-bees
on it's CPU. The patchlet below let's you turn it off.
diff --git a/kernel/sched.c b/kernel/sched.c
index 1e4596c..5641eb8 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -596,6 +596,7 @@ enum {
SCHED_FEAT_START_DEBIT = 4,
SCHED_FEAT_HRTICK = 8,
SCHED_FEAT_DOUBLE_TICK = 16,
+ SCHED_FEAT_SYNC_WAKEUPS = 32,
};
const_debug unsigned int sysctl_sched_features =
@@ -603,7 +604,8 @@ const_debug unsigned int sysctl_sched_features =
SCHED_FEAT_WAKEUP_PREEMPT * 1 |
SCHED_FEAT_START_DEBIT * 1 |
SCHED_FEAT_HRTICK * 1 |
- SCHED_FEAT_DOUBLE_TICK * 0;
+ SCHED_FEAT_DOUBLE_TICK * 0 |
+ SCHED_FEAT_SYNC_WAKEUPS * 0;
#define sched_feat(x) (sysctl_sched_features & SCHED_FEAT_##x)
@@ -1902,6 +1904,9 @@ static int try_to_wake_up(struct task_struct *p, unsigned int state, int sync)
long old_state;
struct rq *rq;
+ if (!sched_feat(SYNC_WAKEUPS))
+ sync = 0;
+
smp_wmb();
rq = task_rq_lock(p, &flags);
old_state = p->state;
^ permalink raw reply related [flat|nested] 38+ messages in thread* Re: PostgreSQL pgbench performance regression in 2.6.23+
2008-05-23 23:46 ` Mike Galbraith
2008-05-24 8:08 ` Mike Galbraith
@ 2008-05-27 0:28 ` Greg Smith
2008-05-27 5:59 ` [patch] " Mike Galbraith
1 sibling, 1 reply; 38+ messages in thread
From: Greg Smith @ 2008-05-27 0:28 UTC (permalink / raw)
To: Mike Galbraith
Cc: Ingo Molnar, Peter Zijlstra, Dhaval Giani, lkml,
Srivatsa Vaddagiri
After spending a whole day testing various scheduler options, I've got a
pretty good idea how possible improvements here might map out. Let's
start with Mike's results (slightly reformatted), from his "grocery store
Q6600 box" similar to the one my results in this message come from:
.22.18 .22.18b .26.git .26.git.batch
1 7487 7644 9999 9916
2 17075 15360 14043 14958
3 25073 24802 15621 25047
4 24236 26126 16436 25007
5 26367 28298 19927 27853
6 24696 30787 22376 28119
8 21021 31974 25825 31071
10 22792 31775 26754 31596
15 21202 30389 28712 30963
20 21204 29317 28512 30128
30 18520 27253 26683 28185
40 17936 25671 24965 26282
50 16248 25089 21079 25357
I couldn't replicate that batch mode improvement in 2.6.22 or 2.6.26.git,
so I asked Mike for some clarification about how he did the batch testing
here:
> I used a tool someone posted (quite a) a few years ago, which I added
> batch support to. I just start the script ala
> schedctl -B ./selecttest.sh.
> I put server startup and shutdown into the script as well, and that's
> the important bit you're missing methinks - postgress must be run as
> SCHED_BATCH, lest each and every instance attain max dynamic priority,
> and preempt pgbench.
Which explains the difference: I was just running pgbench as "chrt -b cmd
pgbench ..." which doesn't help at all. I am uncomfortable with the idea
of running the database server itself as a batch process. While it may be
effective for optimizing this benchmark, I think it's in general a bad
idea because it may de-tune it for more real-world workloads like web
applications. Also, that requires being intrusive into people's setup
scripts, which bothers me a lot more than doing a bit of kernel tuning at
system startup.
Mike also suggested a patch that adjusted se.load.weight. That didn't
seem helpful in any of the cases I tested, presumably it helps with the
all batch-mode setup I didn't try properly.
I did again get useful results here with the stock 2.6.26.git kernel and
default parameters using Peter's small patch to adjust se.waker.
What I found most interesting was how the results changed when I set
/proc/sys/kernel/sched_features = 0, without doing anything with batch
mode. The default for that is 1101111111=895. What I then did was run
through setting each of those bits off one by one to see which feature(s)
were getting in the way here. The two that mattered a lot were 895-32=863
(no SCHED_FEAT_SYNC_WAKEUPS ) and 895-2=893 (no
SCHED_FEAT_WAKEUP_PREEMPT). Combining those two but keeping the rest of
the features on (895-32-2=861) actually gave the best result I've ever
seen here, better than with all the features disabled. Tossing out all
the tests I did that didn't show anything useful, here's my table of the
interesting results.
Clients .22.19 .26.git waker f=0 f=893 f=863 f=861
1 7660 11043 11041 9214 11204 9232 9433
2 17798 11452 16306 16916 11165 16686 16097
3 29612 13231 18476 24202 11348 26210 26906
4 25584 13053 17942 26639 11331 25094 25679
6 25295 12263 18472 28918 11761 30525 33297
8 24344 11748 19109 32730 12190 31775 35912
10 23963 11612 19537 31688 12331 29644 36215
15 23026 11414 19518 33209 13050 28651 36452
20 22549 11332 19029 32583 13544 25776 35707
30 22074 10743 18884 32447 14191 21772 33501
40 21495 10406 18609 31704 11017 20600 32743
50 20051 10534 17478 29483 14683 19949 31047
60 18690 9816 17467 28614 14817 18681 29576
Note that compared to earlier test runs, I replaced the 5 client case with
a 60 client one to get more data on the top end. I also wouldn't pay too
much attention to the single client case; that one really bounces around a
lot on most of the kernel revs, even with me doing 5 runs and using the
median.
These results give me a short-term answer I can move forward with for now:
if people want to know how to get useful select-only pgbench results using
2.6.26-git, I can suggest "echo 861 > /proc/sys/kernel/sched_features" and
know that will give results that crush the older scheduler without making
any additional changes. That's great progress and I appreciate all of
Mike's work in particular to reaching this point.
Some still open questions after this long investigation that I'd like to
know the answers to are:
1) Why are my 2.6.26.git results so dramatically worse than the ones Mike
posted? I'm not sure what was different about his test setup here. The
2.6.22 results are pretty similar, and the fully tuned ones as well, so
the big difference on that column bugs me.
2) Mike suggested a patch to 2.6.25 in this thread that backports the
feature for disabling SCHED_FEAT_SYNC_WAKEUPS. Would it be reasonable to
push that into 2.6.25.5? It's clearly quite useful for this load and
therefore possibly others.
3) Peter's se.waker patch is a big step forward on this workload without
any tuning, closing a significant amount of the gap between the default
setup and what I get with the two troublesome features turned off
altogether. What issues might there be with pushing that into the stock
{2.6.25|2.6.26} kernel?
4) What known workloads are there that suffer if SCHED_FEAT_SYNC_WAKEUPS
and SCHED_FEAT_WAKEUP_PREEMPT are disabled? I'd think that any attempt to
tune those code paths would need my case for "works better when
SYNC/PREEMPT wakeups disabled" as well as a case that works worse to
balance modifications against.
5) Once (4) has identified some tests cases, what else might be done to
make the default behavior better without killing the situations it's
intended for?
--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
^ permalink raw reply [flat|nested] 38+ messages in thread* [patch] Re: PostgreSQL pgbench performance regression in 2.6.23+
2008-05-27 0:28 ` Greg Smith
@ 2008-05-27 5:59 ` Mike Galbraith
2008-05-27 8:20 ` Mike Galbraith
2008-06-06 5:03 ` Greg Smith
0 siblings, 2 replies; 38+ messages in thread
From: Mike Galbraith @ 2008-05-27 5:59 UTC (permalink / raw)
To: Greg Smith
Cc: Ingo Molnar, Peter Zijlstra, Dhaval Giani, lkml,
Srivatsa Vaddagiri
[-- Attachment #1: Type: text/plain, Size: 7103 bytes --]
On Mon, 2008-05-26 at 20:28 -0400, Greg Smith wrote:
> I did again get useful results here with the stock 2.6.26.git kernel and
> default parameters using Peter's small patch to adjust se.waker.
>
> What I found most interesting was how the results changed when I set
> /proc/sys/kernel/sched_features = 0, without doing anything with batch
> mode. The default for that is 1101111111=895. What I then did was run
> through setting each of those bits off one by one to see which feature(s)
> were getting in the way here. The two that mattered a lot were 895-32=863
> (no SCHED_FEAT_SYNC_WAKEUPS ) and 895-2=893 (no
> SCHED_FEAT_WAKEUP_PREEMPT). Combining those two but keeping the rest of
> the features on (895-32-2=861) actually gave the best result I've ever
> seen here, better than with all the features disabled. Tossing out all
> the tests I did that didn't show anything useful, here's my table of the
> interesting results.
>
> Clients .22.19 .26.git waker f=0 f=893 f=863 f=861
> 1 7660 11043 11041 9214 11204 9232 9433
> 2 17798 11452 16306 16916 11165 16686 16097
> 3 29612 13231 18476 24202 11348 26210 26906
> 4 25584 13053 17942 26639 11331 25094 25679
> 6 25295 12263 18472 28918 11761 30525 33297
> 8 24344 11748 19109 32730 12190 31775 35912
> 10 23963 11612 19537 31688 12331 29644 36215
> 15 23026 11414 19518 33209 13050 28651 36452
> 20 22549 11332 19029 32583 13544 25776 35707
> 30 22074 10743 18884 32447 14191 21772 33501
> 40 21495 10406 18609 31704 11017 20600 32743
> 50 20051 10534 17478 29483 14683 19949 31047
> 60 18690 9816 17467 28614 14817 18681 29576
>
> Note that compared to earlier test runs, I replaced the 5 client case with
> a 60 client one to get more data on the top end. I also wouldn't pay too
> much attention to the single client case; that one really bounces around a
> lot on most of the kernel revs, even with me doing 5 runs and using the
> median.
Care to give the below a whirl? If fixes the over-enthusiastic affinity
bug in a less restrictive way. It doesn't attempt to addresss the needs
of any particular load though, that needs more thought (tricky issue).
With default features, I get the below.
2.6.26-smp x86_64
1 10121.600913
2 14360.229517
3 17048.770371
4 18748.777814
5 22086.493358
6 24913.416187
8 27976.026783
10 29346.503261
15 29157.239431
20 28392.257204
30 26590.199787
40 24422.481578
50 23305.981434
(I can get a bit more by disabling HR_TICK along with a dinky patchlet
to reduce overhead when it's disabled. Bottom line is that the bug is
fixed though, maximizing performance is separate issue imho)
Prevent short-running wakers of short-running threads from overloading a single
cpu via wakeup affinity, and wire up disconnected debug option.
Signed-off-by: Mike Galbraith <efault@gmx.de>
kernel/sched_fair.c | 25 ++++++++++++++-----------
1 files changed, 14 insertions(+), 11 deletions(-)
Index: linux-2.6.26.git/kernel/sched_fair.c
===================================================================
--- linux-2.6.26.git.orig/kernel/sched_fair.c
+++ linux-2.6.26.git/kernel/sched_fair.c
@@ -1057,16 +1057,27 @@ wake_affine(struct rq *rq, struct sched_
struct task_struct *curr = this_rq->curr;
unsigned long tl = this_load;
unsigned long tl_per_task;
+ int bad_imbalance;
- if (!(this_sd->flags & SD_WAKE_AFFINE))
+ if (!(this_sd->flags & SD_WAKE_AFFINE) || !sched_feat(AFFINE_WAKEUPS))
return 0;
/*
+ * If sync wakeup then subtract the (maximum possible)
+ * effect of the currently running task from the load
+ * of the current CPU:
+ */
+ if (sync && tl)
+ tl -= curr->se.load.weight;
+
+ bad_imbalance = 100*(tl + p->se.load.weight) > imbalance*load;
+
+ /*
* If the currently running task will sleep within
* a reasonable amount of time then attract this newly
* woken task:
*/
- if (sync && curr->sched_class == &fair_sched_class) {
+ if (sync && !bad_imbalance && curr->sched_class == &fair_sched_class)
{
if (curr->se.avg_overlap < sysctl_sched_migration_cost &&
p->se.avg_overlap < sysctl_sched_migration_cost)
return 1;
@@ -1075,16 +1086,8 @@ wake_affine(struct rq *rq, struct sched_
schedstat_inc(p, se.nr_wakeups_affine_attempts);
tl_per_task = cpu_avg_load_per_task(this_cpu);
- /*
- * If sync wakeup then subtract the (maximum possible)
- * effect of the currently running task from the load
- * of the current CPU:
- */
- if (sync)
- tl -= current->se.load.weight;
-
if ((tl <= load && tl + target_load(prev_cpu, idx) <= tl_per_task) ||
- 100*(tl + p->se.load.weight) <= imbalance*load) {
+ !bad_imbalance) {
/*
* This domain has SD_WAKE_AFFINE and
* p is cache cold in this domain, and
> Some still open questions after this long investigation that I'd like to
> know the answers to are:
>
> 1) Why are my 2.6.26.git results so dramatically worse than the ones Mike
> posted? I'm not sure what was different about his test setup here. The
> 2.6.22 results are pretty similar, and the fully tuned ones as well, so
> the big difference on that column bugs me.
Because those were git _with_ Peter's patch. I was looking at the
difference between a non-broken 2.6.26 with and without minimizing
preemption, to match the way I tested 2.6.22.
> 2) Mike suggested a patch to 2.6.25 in this thread that backports the
> feature for disabling SCHED_FEAT_SYNC_WAKEUPS. Would it be reasonable to
> push that into 2.6.25.5? It's clearly quite useful for this load and
> therefore possibly others.
If the patch above flies, imho, that should be folded into the backport
to stable. The heart of the patch is a bugfix, so definitely stable
material. Whether to enable the debugging/tweaking knobs as well is a
different question. (I would do it, but I ain't the maintainer;)
> 3) Peter's se.waker patch is a big step forward on this workload without
> any tuning, closing a significant amount of the gap between the default
> setup and what I get with the two troublesome features turned off
> altogether. What issues might there be with pushing that into the stock
> {2.6.25|2.6.26} kernel?
(it doesn't fully address the 1:N needs, that needs much more thought)
> 4) What known workloads are there that suffer if SCHED_FEAT_SYNC_WAKEUPS
> and SCHED_FEAT_WAKEUP_PREEMPT are disabled? I'd think that any attempt to
> tune those code paths would need my case for "works better when
> SYNC/PREEMPT wakeups disabled" as well as a case that works worse to
> balance modifications against.
I suspect very many, certainly any load where latency is of major
importance. Peak performance of the mysql+oltp benchmark for sure is
injured with your settings. As a really good demonstration of how
important wakeup preemption can be, try running the attached testcase,
which was distilled from a real world load and posted to lkml a few
years ago, without wakeup preemption and nailed to one cpu.
> 5) Once (4) has identified some tests cases, what else might be done to
> make the default behavior better without killing the situations it's
> intended for?
See patch :)
-Mike
[-- Attachment #2: starve.c --]
[-- Type: text/x-csrc, Size: 715 bytes --]
#include <stdlib.h>
#include <stdio.h>
#include <signal.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
volatile unsigned long loop = 10000000;
void
handler (int n)
{
if (loop > 0)
--loop;
}
static int
child (void)
{
pid_t ppid = getppid ();
sleep (1);
while (1)
kill (ppid, SIGUSR1);
return 0;
}
int
main (int argc, char **argv)
{
pid_t child_pid;
int r;
loop = argc > 1 ? strtoul (argv[1], NULL, 10) : 10000000;
printf ("expecting to receive %lu signals\n", loop);
if ((child_pid = fork ()) == 0)
exit (child ());
signal (SIGUSR1, handler);
while (loop)
sleep (1);
r = kill (child_pid, SIGTERM);
waitpid (child_pid, NULL, 0);
return 0;
}
^ permalink raw reply [flat|nested] 38+ messages in thread* Re: [patch] Re: PostgreSQL pgbench performance regression in 2.6.23+
2008-05-27 5:59 ` [patch] " Mike Galbraith
@ 2008-05-27 8:20 ` Mike Galbraith
2008-05-27 8:35 ` Mike Galbraith
2008-06-06 5:03 ` Greg Smith
1 sibling, 1 reply; 38+ messages in thread
From: Mike Galbraith @ 2008-05-27 8:20 UTC (permalink / raw)
To: Greg Smith
Cc: Ingo Molnar, Peter Zijlstra, Dhaval Giani, lkml,
Srivatsa Vaddagiri
On Tue, 2008-05-27 at 07:59 +0200, Mike Galbraith wrote:
> On Mon, 2008-05-26 at 20:28 -0400, Greg Smith wrote:
>
> > 2) Mike suggested a patch to 2.6.25 in this thread that backports
> the
> > feature for disabling SCHED_FEAT_SYNC_WAKEUPS. Would it be reasonable to
> > push that into 2.6.25.5? It's clearly quite useful for this load and
> > therefore possibly others.
>
> If the patch above flies, imho, that should be folded into the backport
> to stable. The heart of the patch is a bugfix, so definitely stable
> material. Whether to enable the debugging/tweaking knobs as well is a
> different question. (I would do it, but I ain't the maintainer;)
Hm, pbench's extreme dislike of preemption, and the starvation testcase
I sent earlier having an absolute requirement of preemption kinda argues
that some knobs and dials should be per task or task group (or, or... or
scheduler should be all knowing all seeing;)
2.6.25.4-feat=45 2.6.25.4-feat=111 2.6.25.4-feat=47
1 11385.471887 10292.721924 9551.157672
2 16709.515434 15540.399522 16283.968970
3 25456.658841 20187.320016 24562.735943
4 24453.435157 24975.037450 23391.583053
5 25504.302958 23102.131056 23671.860667
6 27076.359200 24688.791507 25947.592071
8 31758.200682 29462.639752 29700.144372
10 32190.081142 30428.413809 27439.441838
15 31175.074906 11097.668025 20344.284129
20 30513.974332 10742.166624 19256.695409
30 28307.399275 10233.708047 17535.423344
40 26720.463867 10037.856773 16104.895695
50 24899.945793 9907.624283 15768.746911
Anyway, if patchlet flies, and Ingo concurs, I'll submit the below.
Prevent short-running wakers of short-running threads from overloading a
single
cpu via wakeup affinity, and provide affinity related debug/tuning
options.
Signed-off-by: Mike Galbraith <efault@gmx.de>
kernel/sched.c | 9 ++++++++-
kernel/sched_fair.c | 25 ++++++++++++++-----------
2 files changed, 22 insertions(+), 12 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 1e4596c..d6d70a8 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -596,6 +596,8 @@ enum {
SCHED_FEAT_START_DEBIT = 4,
SCHED_FEAT_HRTICK = 8,
SCHED_FEAT_DOUBLE_TICK = 16,
+ SCHED_FEAT_AFFINE_WAKEUPS = 32,
+ SCHED_FEAT_SYNC_WAKEUPS = 64,
};
const_debug unsigned int sysctl_sched_features =
@@ -603,7 +605,9 @@ const_debug unsigned int sysctl_sched_features =
SCHED_FEAT_WAKEUP_PREEMPT * 1 |
SCHED_FEAT_START_DEBIT * 1 |
SCHED_FEAT_HRTICK * 1 |
- SCHED_FEAT_DOUBLE_TICK * 0;
+ SCHED_FEAT_DOUBLE_TICK * 0 |
+ SCHED_FEAT_AFFINE_WAKEUPS * 1 |
+ SCHED_FEAT_SYNC_WAKEUPS * 1;
#define sched_feat(x) (sysctl_sched_features & SCHED_FEAT_##x)
@@ -1902,6 +1906,9 @@ static int try_to_wake_up(struct task_struct *p,
unsigned int state, int sync)
long old_state;
struct rq *rq;
+ if (!sched_feat(SYNC_WAKEUPS))
+ sync = 0;
+
smp_wmb();
rq = task_rq_lock(p, &flags);
old_state = p->state;
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 0080968..bf2defe 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -992,16 +992,27 @@ wake_affine(struct rq *rq, struct sched_domain
*this_sd, struct rq *this_rq,
struct task_struct *curr = this_rq->curr;
unsigned long tl = this_load;
unsigned long tl_per_task;
+ int bad_imbalance;
- if (!(this_sd->flags & SD_WAKE_AFFINE))
+ if (!(this_sd->flags & SD_WAKE_AFFINE) || !sched_feat(AFFINE_WAKEUPS))
return 0;
/*
+ * If sync wakeup then subtract the (maximum possible)
+ * effect of the currently running task from the load
+ * of the current CPU:
+ */
+ if (sync && tl)
+ tl -= curr->se.load.weight;
+
+ bad_imbalance = 100*(tl + p->se.load.weight) > imbalance*load;
+
+ /*
* If the currently running task will sleep within
* a reasonable amount of time then attract this newly
* woken task:
*/
- if (sync && curr->sched_class == &fair_sched_class) {
+ if (sync && !bad_imbalance && curr->sched_class == &fair_sched_class)
{
if (curr->se.avg_overlap < sysctl_sched_migration_cost &&
p->se.avg_overlap < sysctl_sched_migration_cost)
return 1;
@@ -1010,16 +1021,8 @@ wake_affine(struct rq *rq, struct sched_domain
*this_sd, struct rq *this_rq,
schedstat_inc(p, se.nr_wakeups_affine_attempts);
tl_per_task = cpu_avg_load_per_task(this_cpu);
- /*
- * If sync wakeup then subtract the (maximum possible)
- * effect of the currently running task from the load
- * of the current CPU:
- */
- if (sync)
- tl -= current->se.load.weight;
-
if ((tl <= load && tl + target_load(prev_cpu, idx) <= tl_per_task) ||
- 100*(tl + p->se.load.weight) <= imbalance*load) {
+ !bad_imbalance) {
/*
* This domain has SD_WAKE_AFFINE and
* p is cache cold in this domain, and
^ permalink raw reply related [flat|nested] 38+ messages in thread* Re: [patch] Re: PostgreSQL pgbench performance regression in 2.6.23+
2008-05-27 8:20 ` Mike Galbraith
@ 2008-05-27 8:35 ` Mike Galbraith
0 siblings, 0 replies; 38+ messages in thread
From: Mike Galbraith @ 2008-05-27 8:35 UTC (permalink / raw)
To: Greg Smith
Cc: Ingo Molnar, Peter Zijlstra, Dhaval Giani, lkml,
Srivatsa Vaddagiri
On Tue, 2008-05-27 at 10:20 +0200, Mike Galbraith wrote:
> Hm, pbench's extreme dislike of preemption, and the starvation testcase
> I sent earlier having an absolute requirement of preemption kinda argues
> that some knobs and dials should be per task or task group (or, or... or
> scheduler should be all knowing all seeing;)
(to somewhat solidify the random thought i'm sharing...)
Perhaps a SCHED_PREEMPT class so such things can co-exist:
SCHED_BATCH == I never preempt.
SCHED_NORMAL == I preempt sometimes.
SCHED_PREEMPT == I always preempt my waker.
(end of random synaptic firing;)
-Mike
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [patch] Re: PostgreSQL pgbench performance regression in 2.6.23+
2008-05-27 5:59 ` [patch] " Mike Galbraith
2008-05-27 8:20 ` Mike Galbraith
@ 2008-06-06 5:03 ` Greg Smith
2008-06-06 6:13 ` Mike Galbraith
1 sibling, 1 reply; 38+ messages in thread
From: Greg Smith @ 2008-06-06 5:03 UTC (permalink / raw)
To: Mike Galbraith
Cc: Ingo Molnar, Peter Zijlstra, Dhaval Giani, lkml,
Srivatsa Vaddagiri
On Tue, 27 May 2008, Mike Galbraith wrote:
> Care to give the below a whirl? If fixes the over-enthusiastic affinity
> bug in a less restrictive way. It doesn't attempt to addresss the needs
> of any particular load though, that needs more thought (tricky issue).
>
> With default features, I get the below...
Sorry I didn't get back to you until now, got distracted for a bit.
Here's my table now updated with this patched version and with your
numbers for comparision, since we have the same basic processor setup:
Clients .22.19 .26.git patch Mike
1 7660 11043 11003 10122
2 17798 11452 16868 14360
3 29612 13231 20381 17049
4 25584 13053 22222 18749
6 25295 12263 23546 24913
8 24344 11748 23895 27976
10 23963 11612 22492 29347
15 23026 11414 21896 29157
20 22549 11332 21015 28392
30 22074 10743 18411 26590
40 21495 10406 17982 24422
50 20051 10534 17009 23306
So this is a huge win for this patch compared with the stock 2.6.26.git
(I'm still using the daily snapshot from 2008-05-26) and a nice
improvement over the earlier, smaller patches I tested in this thread
(which peaked at 19537 for 10 clients for me with default features, vs. a
peak of 23895 @ 8 here).
I think I might not be testing exactly the same thing you did, though,
because the pattern doesn't match. I think that my Q6600 system runs a
little bit faster than yours, which is the case for small numbers of
clients here. But once we get above 8 clients your setup is way faster,
with the difference at 15 clients being the largest. Were you perhaps
using batch mode when you generated these results? Only thing I could
think of that would produce this pattern. If it's not something simple
like that, I may have to dig into whether there was some change in the git
snapshot between what you tested and what I did.
Regardless, clearly your patch reduces the regression with the default
parameters to a mild one instead of the gigantic one we started with.
Considering how generally incompatible this benchmark is with this
scheduler, and that there are clear workarounds (feature disabling) I can
document in PostgreSQL land to "fix" the problem defined for me now, I'd
be happy if all that came from this investigation was this change. I'd
hope that being strengthened against this workload improves the
scheduler's robustness for other tasks of this type, which I'm sure there
are more of than just pgbench.
You get my vote for moving toward committing it+backport even if
the improvement is only what I saw in my tests. If I can figure out how
to get closer to the results you got, all the better.
--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [patch] Re: PostgreSQL pgbench performance regression in 2.6.23+
2008-06-06 5:03 ` Greg Smith
@ 2008-06-06 6:13 ` Mike Galbraith
2008-06-07 11:38 ` Mike Galbraith
0 siblings, 1 reply; 38+ messages in thread
From: Mike Galbraith @ 2008-06-06 6:13 UTC (permalink / raw)
To: Greg Smith
Cc: Ingo Molnar, Peter Zijlstra, Dhaval Giani, lkml,
Srivatsa Vaddagiri
On Fri, 2008-06-06 at 01:03 -0400, Greg Smith wrote:
> I think I might not be testing exactly the same thing you did, though,
> because the pattern doesn't match. I think that my Q6600 system runs a
> little bit faster than yours, which is the case for small numbers of
> clients here. But once we get above 8 clients your setup is way faster,
> with the difference at 15 clients being the largest. Were you perhaps
> using batch mode when you generated these results?
No, those were with stock settings.
> Regardless, clearly your patch reduces the regression with the default
> parameters to a mild one instead of the gigantic one we started with.
Unfortunately, after the recent reverts, we're right back to huge :-/
I'm trying to come up with a dirt simple solution that doesn't harm
other load types. I've found no clear reason why we regressed so badly,
it seems to be a luck of the draw run order thing. As soon as the load
starts jamming up a bit, it avalanches into a serialized mess again. I
know the why, just need to find that dirt simple and pure win fix.
> Considering how generally incompatible this benchmark is with this
> scheduler, and that there are clear workarounds (feature disabling) I can
> document in PostgreSQL land to "fix" the problem defined for me now, I'd
> be happy if all that came from this investigation was this change. I'd
> hope that being strengthened against this workload improves the
> scheduler's robustness for other tasks of this type, which I'm sure there
> are more of than just pgbench.
I consider pgbench to be a pretty excellent testcase. Getting this
fixed properly will certainly benefit similar loads, Xorg being one
that's just not as extreme as pgbench.
> You get my vote for moving toward committing it+backport even if
> the improvement is only what I saw in my tests. If I can figure out how
> to get closer to the results you got, all the better.
It's committed, but I don't think a back-port is justified. It does
what it's supposed to do, but there's a part 2. I suspect that your
results differ from mine due to that luck of the run order draw thing.
-Mike
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [patch] Re: PostgreSQL pgbench performance regression in 2.6.23+
2008-06-06 6:13 ` Mike Galbraith
@ 2008-06-07 11:38 ` Mike Galbraith
2008-06-07 12:50 ` Mike Galbraith
2008-06-07 13:08 ` Peter Zijlstra
0 siblings, 2 replies; 38+ messages in thread
From: Mike Galbraith @ 2008-06-07 11:38 UTC (permalink / raw)
To: Greg Smith
Cc: Ingo Molnar, Peter Zijlstra, Dhaval Giani, lkml,
Srivatsa Vaddagiri
On Fri, 2008-06-06 at 08:13 +0200, Mike Galbraith wrote:
> On Fri, 2008-06-06 at 01:03 -0400, Greg Smith wrote:
>
> > I think I might not be testing exactly the same thing you did, though,
> > because the pattern doesn't match. I think that my Q6600 system runs a
> > little bit faster than yours, which is the case for small numbers of
> > clients here. But once we get above 8 clients your setup is way faster,
> > with the difference at 15 clients being the largest. Were you perhaps
> > using batch mode when you generated these results?
>
> No, those were with stock settings.
>
> > Regardless, clearly your patch reduces the regression with the default
> > parameters to a mild one instead of the gigantic one we started with.
>
> Unfortunately, after the recent reverts, we're right back to huge :-/
>
> I'm trying to come up with a dirt simple solution that doesn't harm
> other load types.
The below doesn't hurt my volanomark numbers of the day, helps pgbench
considerably, and improves the higher client end of mysql+oltp a wee
bit. It may hurt the low end a wee bit, but the low end is always
pretty unstable, so it's hard to tell with only three runs.
pgbench
2.6.26-rc5 2.6.26-rc5+
1 10213.768037 10237.990274 10165.511814 10183.705908
2 15885.949053 15519.005195 14994.697875 15204.900479
3 15663.233356 16043.733087 16554.371722 17279.376443
4 14193.807355 15799.792612 18447.345925 18088.861169
5 17239.456219 17326.938538 20119.250823 18537.351094
6 15293.624093 14272.208159 21439.841579 22634.887824
8 12483.727461 13486.991527 25579.379337 25908.373483
10 11919.023584 12058.503518 23876.035623 22403.867804
15 10128.724654 11253.959398 23276.797649 23595.597093
20 9645.056147 9980.465235 23603.315133 23256.506240
30 9288.747962 8801.059613 23633.448266 23229.286697
40 8494.705123 8323.107702 22925.552706 23081.526954
50 8357.781935 8239.867147 19102.481374 19558.624434
volanomark
2.6.26-rc5
test-1.log:Average throughput = 101768 messages per second
test-2.log:Average throughput = 99124 messages per second
test-3.log:Average throughput = 99821 messages per second
test-1.log:Average throughput = 101362 messages per second
test-2.log:Average throughput = 98891 messages per second
test-3.log:Average throughput = 99164 messages per second
2.6.26-rc5+
test-1.log:Average throughput = 103275 messages per second
test-2.log:Average throughput = 100034 messages per second
test-3.log:Average throughput = 99434 messages per second
test-1.log:Average throughput = 100460 messages per second
test-2.log:Average throughput = 100188 messages per second
test-3.log:Average throughput = 99617 messages per second
Index: linux-2.6.26.git/kernel/sched_fair.c
===================================================================
--- linux-2.6.26.git.orig/kernel/sched_fair.c
+++ linux-2.6.26.git/kernel/sched_fair.c
@@ -664,6 +664,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
update_stats_dequeue(cfs_rq, se);
if (sleep) {
+ se->last_preempter = NULL;
update_avg_stats(cfs_rq, se);
#ifdef CONFIG_SCHEDSTATS
if (entity_is_task(se)) {
@@ -692,8 +693,10 @@ check_preempt_tick(struct cfs_rq *cfs_rq
ideal_runtime = sched_slice(cfs_rq, curr);
delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
- if (delta_exec > ideal_runtime)
+ if (delta_exec > ideal_runtime) {
+ curr->last_preempter = NULL;
resched_task(rq_of(cfs_rq)->curr);
+ }
}
static void
@@ -994,6 +997,7 @@ wake_affine(struct rq *rq, struct sched_
unsigned int imbalance)
{
struct task_struct *curr = this_rq->curr;
+ struct sched_entity *se = &curr->se, *pse = &p->se;
unsigned long tl = this_load;
unsigned long tl_per_task;
int balanced;
@@ -1002,14 +1006,26 @@ wake_affine(struct rq *rq, struct sched_
return 0;
/*
+ * If the current task is being wakeup preempted by multiple tasks
+ * that it awakened, such that it can't get significant work done
+ * between preemptions, try to spread these preemption sources.
+ */
+ if (sync && se->last_preempter && se->last_preempter != pse) {
+ u64 se_last_exec = se->sum_exec_runtime - se->prev_sum_exec_runtime;
+
+ if (se_last_exec < sysctl_sched_migration_cost)
+ return 0;
+ }
+
+ /*
* If sync wakeup then subtract the (maximum possible)
* effect of the currently running task from the load
* of the current CPU:
*/
if (sync)
- tl -= current->se.load.weight;
+ tl -= se->load.weight;
- balanced = 100*(tl + p->se.load.weight) <= imbalance*load;
+ balanced = 100*(tl + pse->load.weight) <= imbalance*load;
/*
* If the currently running task will sleep within
@@ -1017,8 +1033,8 @@ wake_affine(struct rq *rq, struct sched_
* woken task:
*/
if (sync && balanced && curr->sched_class == &fair_sched_class) {
- if (curr->se.avg_overlap < sysctl_sched_migration_cost &&
- p->se.avg_overlap < sysctl_sched_migration_cost)
+ if (se->avg_overlap < sysctl_sched_migration_cost &&
+ pse->avg_overlap < sysctl_sched_migration_cost)
return 1;
}
@@ -1219,8 +1235,27 @@ static void check_preempt_wakeup(struct
pse = parent_entity(pse);
}
- if (wakeup_preempt_entity(se, pse) == 1)
- resched_task(curr);
+ if (wakeup_preempt_entity(se, pse) == 1) {
+ int preempt = 1;
+
+ /*
+ * If current task is being prempted by multiple wakees,
+ * tag it for 1:N affine wakeup preemption avoidance.
+ */
+ if (se->last_preempter && se->last_preempter != pse &&
+ se->load.weight >= pse->load.weight) {
+ u64 exec = se->sum_exec_runtime - se->prev_sum_exec_runtime;
+
+ if (exec < sysctl_sched_migration_cost)
+ preempt = 0;
+ }
+
+ if (se == ¤t->se)
+ se->last_preempter = pse;
+
+ if (preempt)
+ resched_task(curr);
+ }
}
static struct task_struct *pick_next_task_fair(struct rq *rq)
Index: linux-2.6.26.git/include/linux/sched.h
===================================================================
--- linux-2.6.26.git.orig/include/linux/sched.h
+++ linux-2.6.26.git/include/linux/sched.h
@@ -963,6 +963,7 @@ struct sched_entity {
u64 last_wakeup;
u64 avg_overlap;
+ struct sched_entity *last_preempter;
#ifdef CONFIG_SCHEDSTATS
u64 wait_start;
Index: linux-2.6.26.git/kernel/sched.c
===================================================================
--- linux-2.6.26.git.orig/kernel/sched.c
+++ linux-2.6.26.git/kernel/sched.c
@@ -2176,6 +2176,7 @@ static void __sched_fork(struct task_str
p->se.prev_sum_exec_runtime = 0;
p->se.last_wakeup = 0;
p->se.avg_overlap = 0;
+ p->se.last_preempter = NULL;
#ifdef CONFIG_SCHEDSTATS
p->se.wait_start = 0;
^ permalink raw reply [flat|nested] 38+ messages in thread* Re: [patch] Re: PostgreSQL pgbench performance regression in 2.6.23+
2008-06-07 11:38 ` Mike Galbraith
@ 2008-06-07 12:50 ` Mike Galbraith
2008-06-07 13:07 ` Peter Zijlstra
2008-06-07 13:08 ` Peter Zijlstra
1 sibling, 1 reply; 38+ messages in thread
From: Mike Galbraith @ 2008-06-07 12:50 UTC (permalink / raw)
To: Ingo Molnar, Greg Smith
Cc: Peter Zijlstra, Dhaval Giani, lkml, Srivatsa Vaddagiri
[-- Attachment #1: Type: text/plain, Size: 374 bytes --]
Since I tested mysql+oltp and made the dang pdf of the results, I may
as well actually attach the thing <does that before continuing...>.
BTW, I have a question wrt avg_overlap. When a wakeup cause the current
task to begin sharing CPU with a freshly awakened task, the current task
is tagged.. but the wakee isn't. How come? If one is sharing, so is
the other.
-Mike
[-- Attachment #2: zz.pdf --]
[-- Type: application/pdf, Size: 16583 bytes --]
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [patch] Re: PostgreSQL pgbench performance regression in 2.6.23+
2008-06-07 12:50 ` Mike Galbraith
@ 2008-06-07 13:07 ` Peter Zijlstra
2008-06-07 14:16 ` Mike Galbraith
0 siblings, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2008-06-07 13:07 UTC (permalink / raw)
To: Mike Galbraith
Cc: Ingo Molnar, Greg Smith, Dhaval Giani, lkml, Srivatsa Vaddagiri
On Sat, 2008-06-07 at 14:50 +0200, Mike Galbraith wrote:
> Since I tested mysql+oltp and made the dang pdf of the results, I may
> as well actually attach the thing <does that before continuing...>.
>
> BTW, I have a question wrt avg_overlap. When a wakeup cause the current
> task to begin sharing CPU with a freshly awakened task, the current task
> is tagged.. but the wakee isn't. How come? If one is sharing, so is
> the other.
avg_overlap is about measuring how long we'll run after waking someone
else. The other measure, how long our waker shares the cpu with us,
hasn't proven to be relevant so far.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [patch] Re: PostgreSQL pgbench performance regression in 2.6.23+
2008-06-07 13:07 ` Peter Zijlstra
@ 2008-06-07 14:16 ` Mike Galbraith
2008-06-07 16:16 ` Peter Zijlstra
0 siblings, 1 reply; 38+ messages in thread
From: Mike Galbraith @ 2008-06-07 14:16 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Greg Smith, Dhaval Giani, lkml, Srivatsa Vaddagiri
On Sat, 2008-06-07 at 15:07 +0200, Peter Zijlstra wrote:
> On Sat, 2008-06-07 at 14:50 +0200, Mike Galbraith wrote:
> > Since I tested mysql+oltp and made the dang pdf of the results, I may
> > as well actually attach the thing <does that before continuing...>.
> >
> > BTW, I have a question wrt avg_overlap. When a wakeup cause the current
> > task to begin sharing CPU with a freshly awakened task, the current task
> > is tagged.. but the wakee isn't. How come? If one is sharing, so is
> > the other.
>
> avg_overlap is about measuring how long we'll run after waking someone
> else. The other measure, how long our waker shares the cpu with us,
> hasn't proven to be relevant so far.
Yeah wrt relevance, I've been playing with making it mean this and that,
with approx 0 success ;-) If it's a measure of how long we run after
waking though, don't we need to make sure it's not a cross CPU wakeup?
-Mike
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [patch] Re: PostgreSQL pgbench performance regression in 2.6.23+
2008-06-07 14:16 ` Mike Galbraith
@ 2008-06-07 16:16 ` Peter Zijlstra
2008-06-07 17:56 ` Mike Galbraith
0 siblings, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2008-06-07 16:16 UTC (permalink / raw)
To: Mike Galbraith
Cc: Ingo Molnar, Greg Smith, Dhaval Giani, lkml, Srivatsa Vaddagiri
On Sat, 2008-06-07 at 16:16 +0200, Mike Galbraith wrote:
> On Sat, 2008-06-07 at 15:07 +0200, Peter Zijlstra wrote:
> > On Sat, 2008-06-07 at 14:50 +0200, Mike Galbraith wrote:
> > > Since I tested mysql+oltp and made the dang pdf of the results, I may
> > > as well actually attach the thing <does that before continuing...>.
> > >
> > > BTW, I have a question wrt avg_overlap. When a wakeup cause the current
> > > task to begin sharing CPU with a freshly awakened task, the current task
> > > is tagged.. but the wakee isn't. How come? If one is sharing, so is
> > > the other.
> >
> > avg_overlap is about measuring how long we'll run after waking someone
> > else. The other measure, how long our waker shares the cpu with us,
> > hasn't proven to be relevant so far.
>
> Yeah wrt relevance, I've been playing with making it mean this and that,
> with approx 0 success ;-) If it's a measure of how long we run after
> waking though, don't we need to make sure it's not a cross CPU wakeup?
The idea was to dynamically detect sync wakeups, who's defining property
is that the waker will sleep after waking the wakee. And who's effect is
pulling tasks together on wakeups - so that we might have the most
benefit of cache sharing.
So if we were to exclude cross cpu wakeups from this measurement we'd
handicap the whole scheme, because then we'd never measure that its
actually a sync wakeup and wants to run on the same cpu.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [patch] Re: PostgreSQL pgbench performance regression in 2.6.23+
2008-06-07 16:16 ` Peter Zijlstra
@ 2008-06-07 17:56 ` Mike Galbraith
0 siblings, 0 replies; 38+ messages in thread
From: Mike Galbraith @ 2008-06-07 17:56 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Greg Smith, Dhaval Giani, lkml, Srivatsa Vaddagiri
On Sat, 2008-06-07 at 18:16 +0200, Peter Zijlstra wrote:
> The idea was to dynamically detect sync wakeups, who's defining property
> is that the waker will sleep after waking the wakee. And who's effect is
> pulling tasks together on wakeups - so that we might have the most
> benefit of cache sharing.
>
> So if we were to exclude cross cpu wakeups from this measurement we'd
> handicap the whole scheme, because then we'd never measure that its
> actually a sync wakeup and wants to run on the same cpu.
Ah, I get it now, thanks.
-Mike
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [patch] Re: PostgreSQL pgbench performance regression in 2.6.23+
2008-06-07 11:38 ` Mike Galbraith
2008-06-07 12:50 ` Mike Galbraith
@ 2008-06-07 13:08 ` Peter Zijlstra
2008-06-07 14:54 ` [patch part 2] " Mike Galbraith
1 sibling, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2008-06-07 13:08 UTC (permalink / raw)
To: Mike Galbraith
Cc: Greg Smith, Ingo Molnar, Dhaval Giani, lkml, Srivatsa Vaddagiri
On Sat, 2008-06-07 at 13:38 +0200, Mike Galbraith wrote:
Interesting.. Looks good.
> Index: linux-2.6.26.git/kernel/sched_fair.c
> ===================================================================
> --- linux-2.6.26.git.orig/kernel/sched_fair.c
> +++ linux-2.6.26.git/kernel/sched_fair.c
> @@ -664,6 +664,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
>
> update_stats_dequeue(cfs_rq, se);
> if (sleep) {
> + se->last_preempter = NULL;
> update_avg_stats(cfs_rq, se);
> #ifdef CONFIG_SCHEDSTATS
> if (entity_is_task(se)) {
> @@ -692,8 +693,10 @@ check_preempt_tick(struct cfs_rq *cfs_rq
>
> ideal_runtime = sched_slice(cfs_rq, curr);
> delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
> - if (delta_exec > ideal_runtime)
> + if (delta_exec > ideal_runtime) {
> + curr->last_preempter = NULL;
> resched_task(rq_of(cfs_rq)->curr);
> + }
> }
>
> static void
> @@ -994,6 +997,7 @@ wake_affine(struct rq *rq, struct sched_
> unsigned int imbalance)
> {
> struct task_struct *curr = this_rq->curr;
> + struct sched_entity *se = &curr->se, *pse = &p->se;
> unsigned long tl = this_load;
> unsigned long tl_per_task;
> int balanced;
> @@ -1002,14 +1006,26 @@ wake_affine(struct rq *rq, struct sched_
> return 0;
>
> /*
> + * If the current task is being wakeup preempted by multiple tasks
> + * that it awakened, such that it can't get significant work done
> + * between preemptions, try to spread these preemption sources.
> + */
> + if (sync && se->last_preempter && se->last_preempter != pse) {
> + u64 se_last_exec = se->sum_exec_runtime - se->prev_sum_exec_runtime;
> +
> + if (se_last_exec < sysctl_sched_migration_cost)
> + return 0;
> + }
> +
> + /*
> * If sync wakeup then subtract the (maximum possible)
> * effect of the currently running task from the load
> * of the current CPU:
> */
> if (sync)
> - tl -= current->se.load.weight;
> + tl -= se->load.weight;
>
> - balanced = 100*(tl + p->se.load.weight) <= imbalance*load;
> + balanced = 100*(tl + pse->load.weight) <= imbalance*load;
>
> /*
> * If the currently running task will sleep within
> @@ -1017,8 +1033,8 @@ wake_affine(struct rq *rq, struct sched_
> * woken task:
> */
> if (sync && balanced && curr->sched_class == &fair_sched_class) {
> - if (curr->se.avg_overlap < sysctl_sched_migration_cost &&
> - p->se.avg_overlap < sysctl_sched_migration_cost)
> + if (se->avg_overlap < sysctl_sched_migration_cost &&
> + pse->avg_overlap < sysctl_sched_migration_cost)
> return 1;
> }
>
> @@ -1219,8 +1235,27 @@ static void check_preempt_wakeup(struct
> pse = parent_entity(pse);
> }
>
> - if (wakeup_preempt_entity(se, pse) == 1)
> - resched_task(curr);
> + if (wakeup_preempt_entity(se, pse) == 1) {
> + int preempt = 1;
> +
> + /*
> + * If current task is being prempted by multiple wakees,
> + * tag it for 1:N affine wakeup preemption avoidance.
> + */
> + if (se->last_preempter && se->last_preempter != pse &&
> + se->load.weight >= pse->load.weight) {
> + u64 exec = se->sum_exec_runtime - se->prev_sum_exec_runtime;
> +
> + if (exec < sysctl_sched_migration_cost)
> + preempt = 0;
> + }
> +
> + if (se == ¤t->se)
> + se->last_preempter = pse;
> +
> + if (preempt)
> + resched_task(curr);
> + }
> }
>
> static struct task_struct *pick_next_task_fair(struct rq *rq)
> Index: linux-2.6.26.git/include/linux/sched.h
> ===================================================================
> --- linux-2.6.26.git.orig/include/linux/sched.h
> +++ linux-2.6.26.git/include/linux/sched.h
> @@ -963,6 +963,7 @@ struct sched_entity {
>
> u64 last_wakeup;
> u64 avg_overlap;
> + struct sched_entity *last_preempter;
>
> #ifdef CONFIG_SCHEDSTATS
> u64 wait_start;
> Index: linux-2.6.26.git/kernel/sched.c
> ===================================================================
> --- linux-2.6.26.git.orig/kernel/sched.c
> +++ linux-2.6.26.git/kernel/sched.c
> @@ -2176,6 +2176,7 @@ static void __sched_fork(struct task_str
> p->se.prev_sum_exec_runtime = 0;
> p->se.last_wakeup = 0;
> p->se.avg_overlap = 0;
> + p->se.last_preempter = NULL;
>
> #ifdef CONFIG_SCHEDSTATS
> p->se.wait_start = 0;
>
>
^ permalink raw reply [flat|nested] 38+ messages in thread* [patch part 2] Re: [patch] Re: PostgreSQL pgbench performance regression in 2.6.23+
2008-06-07 13:08 ` Peter Zijlstra
@ 2008-06-07 14:54 ` Mike Galbraith
2008-06-07 16:12 ` Peter Zijlstra
0 siblings, 1 reply; 38+ messages in thread
From: Mike Galbraith @ 2008-06-07 14:54 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Greg Smith, Ingo Molnar, Dhaval Giani, lkml, Srivatsa Vaddagiri
On Sat, 2008-06-07 at 15:08 +0200, Peter Zijlstra wrote:
> On Sat, 2008-06-07 at 13:38 +0200, Mike Galbraith wrote:
>
> Interesting.. Looks good.
In that case it _might_ fly, so needs changelog and blame line.
Tasks which awaken many clients can suffer terribly due to affine wakeup
preemption. This can (does for pgbench) lead to serialization of the entire
load on one CPU due to ever lowering throughput of the preempted waker and
constant affine wakeup of many preempters. Prevent this by noticing when
multi-task preemption is going on, ensure that the 1:N waker can always do
a reasonable batch of work, and temporarily restrict further affine wakeups.
Signed-off-by: Mike Galbraith <efault@gmx.de>
include/linux/sched.h | 1 +
kernel/sched.c | 1 +
kernel/sched_fair.c | 49 ++++++++++++++++++++++++++++++++++++++++++-------
3 files changed, 44 insertions(+), 7 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ae0be3c..73b7d23 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -963,6 +963,7 @@ struct sched_entity {
u64 last_wakeup;
u64 avg_overlap;
+ struct sched_entity *last_preempter;
#ifdef CONFIG_SCHEDSTATS
u64 wait_start;
diff --git a/kernel/sched.c b/kernel/sched.c
index bfb8ad8..deb30e9 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2176,6 +2176,7 @@ static void __sched_fork(struct task_struct *p)
p->se.prev_sum_exec_runtime = 0;
p->se.last_wakeup = 0;
p->se.avg_overlap = 0;
+ p->se.last_preempter = NULL;
#ifdef CONFIG_SCHEDSTATS
p->se.wait_start = 0;
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 08ae848..4539a79 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -664,6 +664,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int sleep)
update_stats_dequeue(cfs_rq, se);
if (sleep) {
+ se->last_preempter = NULL;
update_avg_stats(cfs_rq, se);
#ifdef CONFIG_SCHEDSTATS
if (entity_is_task(se)) {
@@ -692,8 +693,10 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
ideal_runtime = sched_slice(cfs_rq, curr);
delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
- if (delta_exec > ideal_runtime)
+ if (delta_exec > ideal_runtime) {
+ curr->last_preempter = NULL;
resched_task(rq_of(cfs_rq)->curr);
+ }
}
static void
@@ -994,6 +997,7 @@ wake_affine(struct rq *rq, struct sched_domain *this_sd, struct rq *this_rq,
unsigned int imbalance)
{
struct task_struct *curr = this_rq->curr;
+ struct sched_entity *se = &curr->se, *pse = &p->se;
unsigned long tl = this_load;
unsigned long tl_per_task;
int balanced;
@@ -1002,14 +1006,26 @@ wake_affine(struct rq *rq, struct sched_domain *this_sd, struct rq *this_rq,
return 0;
/*
+ * If the current task is being wakeup preempted by multiple tasks
+ * that it awakened, such that it can't get significant work done
+ * between preemptions, try to spread these preemption sources.
+ */
+ if (sync && se->last_preempter && se->last_preempter != pse) {
+ u64 se_last_exec = se->sum_exec_runtime - se->prev_sum_exec_runtime;
+
+ if (se_last_exec < sysctl_sched_migration_cost)
+ return 0;
+ }
+
+ /*
* If sync wakeup then subtract the (maximum possible)
* effect of the currently running task from the load
* of the current CPU:
*/
if (sync)
- tl -= current->se.load.weight;
+ tl -= se->load.weight;
- balanced = 100*(tl + p->se.load.weight) <= imbalance*load;
+ balanced = 100*(tl + pse->load.weight) <= imbalance*load;
/*
* If the currently running task will sleep within
@@ -1017,8 +1033,8 @@ wake_affine(struct rq *rq, struct sched_domain *this_sd, struct rq *this_rq,
* woken task:
*/
if (sync && balanced && curr->sched_class == &fair_sched_class) {
- if (curr->se.avg_overlap < sysctl_sched_migration_cost &&
- p->se.avg_overlap < sysctl_sched_migration_cost)
+ if (se->avg_overlap < sysctl_sched_migration_cost &&
+ pse->avg_overlap < sysctl_sched_migration_cost)
return 1;
}
@@ -1219,8 +1235,27 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p)
pse = parent_entity(pse);
}
- if (wakeup_preempt_entity(se, pse) == 1)
- resched_task(curr);
+ if (wakeup_preempt_entity(se, pse) == 1) {
+ int preempt = 1;
+
+ /*
+ * If current task is being prempted by multiple wakees,
+ * tag it for 1:N affine wakeup preemption avoidance.
+ */
+ if (se->last_preempter && se->last_preempter != pse &&
+ se->load.weight >= pse->load.weight) {
+ u64 exec = se->sum_exec_runtime - se->prev_sum_exec_runtime;
+
+ if (exec < sysctl_sched_migration_cost)
+ preempt = 0;
+ }
+
+ if (se == ¤t->se)
+ se->last_preempter = pse;
+
+ if (preempt)
+ resched_task(curr);
+ }
}
static struct task_struct *pick_next_task_fair(struct rq *rq)
^ permalink raw reply related [flat|nested] 38+ messages in thread* Re: [patch part 2] Re: [patch] Re: PostgreSQL pgbench performance regression in 2.6.23+
2008-06-07 14:54 ` [patch part 2] " Mike Galbraith
@ 2008-06-07 16:12 ` Peter Zijlstra
2008-06-07 17:53 ` Mike Galbraith
0 siblings, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2008-06-07 16:12 UTC (permalink / raw)
To: Mike Galbraith
Cc: Greg Smith, Ingo Molnar, Dhaval Giani, lkml, Srivatsa Vaddagiri
On Sat, 2008-06-07 at 16:54 +0200, Mike Galbraith wrote:
> On Sat, 2008-06-07 at 15:08 +0200, Peter Zijlstra wrote:
> > On Sat, 2008-06-07 at 13:38 +0200, Mike Galbraith wrote:
> >
> > Interesting.. Looks good.
>
> In that case it _might_ fly, so needs changelog and blame line.
Just wondering, how much effect does the last_preempter stuff have?, it
seems to me the minimum runtime check ought to throttle these wakeups
quite a bit as well.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [patch part 2] Re: [patch] Re: PostgreSQL pgbench performance regression in 2.6.23+
2008-06-07 16:12 ` Peter Zijlstra
@ 2008-06-07 17:53 ` Mike Galbraith
2008-06-07 18:19 ` Mike Galbraith
0 siblings, 1 reply; 38+ messages in thread
From: Mike Galbraith @ 2008-06-07 17:53 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Greg Smith, Ingo Molnar, Dhaval Giani, lkml, Srivatsa Vaddagiri
On Sat, 2008-06-07 at 18:12 +0200, Peter Zijlstra wrote:
> On Sat, 2008-06-07 at 16:54 +0200, Mike Galbraith wrote:
> > On Sat, 2008-06-07 at 15:08 +0200, Peter Zijlstra wrote:
> > > On Sat, 2008-06-07 at 13:38 +0200, Mike Galbraith wrote:
> > >
> > > Interesting.. Looks good.
> >
> > In that case it _might_ fly, so needs changelog and blame line.
>
> Just wondering, how much effect does the last_preempter stuff have?, it
> seems to me the minimum runtime check ought to throttle these wakeups
> quite a bit as well.
Without last_preempter, you'd have all tasks having a minimum runtime.
That would harm the single cpu starve.c testcase for sure, and anything
like it. I wanted to target this pretty accurately to 1:N type loads.
If you mean no trying to disperse preempters, I can test without it.
-Mike
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [patch part 2] Re: [patch] Re: PostgreSQL pgbench performance regression in 2.6.23+
2008-06-07 17:53 ` Mike Galbraith
@ 2008-06-07 18:19 ` Mike Galbraith
0 siblings, 0 replies; 38+ messages in thread
From: Mike Galbraith @ 2008-06-07 18:19 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Greg Smith, Ingo Molnar, Dhaval Giani, lkml, Srivatsa Vaddagiri
On Sat, 2008-06-07 at 19:53 +0200, Mike Galbraith wrote:
> On Sat, 2008-06-07 at 18:12 +0200, Peter Zijlstra wrote:
> > Just wondering, how much effect does the last_preempter stuff have?, it
> > seems to me the minimum runtime check ought to throttle these wakeups
> > quite a bit as well.
>
> Without last_preempter, you'd have all tasks having a minimum runtime.
> That would harm the single cpu starve.c testcase for sure, and anything
> like it. I wanted to target this pretty accurately to 1:N type loads.
>
> If you mean no trying to disperse preempters, I can test without it.
pgbench
2.6.26-rc5+ 2.6.26-rc5+ with no disperse
1 10165.511814 10183.705908 10191.865953 10186.995546
2 14994.697875 15204.900479 15209.856474 15239.639522
3 16554.371722 17279.376443 16431.588533 15828.812843
4 18447.345925 18088.861169 15967.533533 16827.107528
5 20119.250823 18537.351094 17890.057368 18829.423686
6 21439.841579 22634.887824 18562.389387 18907.807327
8 25579.379337 25908.373483 19527.104304 19687.221241
10 23876.035623 22403.867804 22635.429472 20627.666899
15 23276.797649 23595.597093 22695.938882 22233.399329
20 23603.315133 23256.506240 22623.205980 22637.340746
30 23633.448266 23229.286697 22736.523283 22691.638135
40 22925.552706 23081.526954 20037.610595 22174.404351
50 19102.481374 19558.624434 21459.370223 21664.820102
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: PostgreSQL pgbench performance regression in 2.6.23+
2008-05-23 10:10 ` Ingo Molnar
2008-05-23 10:15 ` Mike Galbraith
@ 2008-05-23 13:05 ` Mike Galbraith
2008-05-23 13:35 ` Mike Galbraith
1 sibling, 1 reply; 38+ messages in thread
From: Mike Galbraith @ 2008-05-23 13:05 UTC (permalink / raw)
To: Ingo Molnar
Cc: Greg Smith, Peter Zijlstra, Dhaval Giani, lkml,
Srivatsa Vaddagiri
On Fri, 2008-05-23 at 12:10 +0200, Ingo Molnar wrote:
> if it's other tweaks as well then could you perhaps try to make
> SCHED_BATCH batch more agressively?
Running SCHED_BATCH with only the below put a large dent in the problem.
You can have tl <= current->se.load.weight. Nothing good happens in
either case, at least with this load.
--- kernel/sched_fair.c.org 2008-05-23 14:59:39.000000000 +0200
+++ kernel/sched_fair.c 2008-05-23 14:49:05.000000000 +0200
@@ -1081,7 +1081,7 @@ wake_affine(struct rq *rq, struct sched_
* effect of the currently running task from the load
* of the current CPU:
*/
- if (sync)
+ if (sync && tl > current->se.load.weight)
tl -= current->se.load.weight;
if ((tl <= load && tl + target_load(prev_cpu, idx) <= tl_per_task) ||
2.6.26-smp x86_64
1 9209.503213
2 15792.406916
3 23369.199181
4 23140.108032
5 24556.515470
6 24926.457776
8 26896.607558
10 27350.988396
15 29005.426298
20 28558.267290
30 27002.328374
40 25809.202374
50 24589.478654
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: PostgreSQL pgbench performance regression in 2.6.23+
2008-05-23 13:05 ` Mike Galbraith
@ 2008-05-23 13:35 ` Mike Galbraith
0 siblings, 0 replies; 38+ messages in thread
From: Mike Galbraith @ 2008-05-23 13:35 UTC (permalink / raw)
To: Ingo Molnar
Cc: Greg Smith, Peter Zijlstra, Dhaval Giani, lkml,
Srivatsa Vaddagiri
On Fri, 2008-05-23 at 15:05 +0200, Mike Galbraith wrote:
> On Fri, 2008-05-23 at 12:10 +0200, Ingo Molnar wrote:
>
> > if it's other tweaks as well then could you perhaps try to make
> > SCHED_BATCH batch more agressively?
>
> Running SCHED_BATCH with only the below put a large dent in the problem.
>
> You can have tl <= current->se.load.weight. Nothing good happens in
> either case, at least with this load.
>
> --- kernel/sched_fair.c.org 2008-05-23 14:59:39.000000000 +0200
> +++ kernel/sched_fair.c 2008-05-23 14:49:05.000000000 +0200
> @@ -1081,7 +1081,7 @@ wake_affine(struct rq *rq, struct sched_
> * effect of the currently running task from the load
> * of the current CPU:
> */
> - if (sync)
> + if (sync && tl > current->se.load.weight)
> tl -= current->se.load.weight;
>
> if ((tl <= load && tl + target_load(prev_cpu, idx) <= tl_per_task) ||
>
>
>
> 2.6.26-smp x86_64
> 1 9209.503213
> 2 15792.406916
> 3 23369.199181
> 4 23140.108032
> 5 24556.515470
> 6 24926.457776
> 8 26896.607558
> 10 27350.988396
> 15 29005.426298
> 20 28558.267290
> 30 27002.328374
> 40 25809.202374
> 50 24589.478654
And without SCHED_BATCH
2.6.26-smp x86_64
1 8417.511252
2 15559.741472
3 23417.911087
4 21982.631084
5 24212.518114
6 21870.640050
8 25178.186022
10 27350.449792
15 27958.758943
20 28011.989131
30 26668.779045
40 24871.625107
50 23687.757456
So the primary low end problem is sync afine wakeups it seems.
-Mike
^ permalink raw reply [flat|nested] 38+ messages in thread
* PostgreSQL pgbench performance regression in 2.6.23+
@ 2008-05-23 11:26 Anton Petrusevich
2008-05-23 18:46 ` Greg Smith
0 siblings, 1 reply; 38+ messages in thread
From: Anton Petrusevich @ 2008-05-23 11:26 UTC (permalink / raw)
To: linux-kernel
Hi guys!
>> I.e. i think it's a perfectly fine answer to say "if your workload needs
>> batch scheduling, run it under SCHED_BATCH".
> Yes, and this appears to be such a case.
Excuse me for interrupting you, but am I getting this right: to run
effectively PostgreSQL with an active web app on the server a "system
administrator Joe" has to know about batch scheduling?
--
Anton Petrusevich
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: PostgreSQL pgbench performance regression in 2.6.23+
2008-05-23 11:26 Anton Petrusevich
@ 2008-05-23 18:46 ` Greg Smith
0 siblings, 0 replies; 38+ messages in thread
From: Greg Smith @ 2008-05-23 18:46 UTC (permalink / raw)
To: Anton Petrusevich; +Cc: lkml
On Fri, 23 May 2008, Anton Petrusevich wrote:
> Excuse me for interrupting you, but am I getting this right: to run
> effectively PostgreSQL with an active web app on the server a "system
> administrator Joe" has to know about batch scheduling?
That's not true. The problem here is specifically with the pgbench
program, which is a simple benchmarking tool included with PostgreSQL.
pgbench already had known scalability problems due to its inefficient
design and the new scheduler really doesn't play well with it by default.
It's the pgbench program that needs the batch scheduling, not the database
processes.
pgbench is certainly not representative of web application performance,
and it wouldn't be the first time there was some OS tweaking required to
get good results from such an artificial benchmark.
--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2008-06-07 18:19 UTC | newest]
Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-21 17:34 PostgreSQL pgbench performance regression in 2.6.23+ Greg Smith
2008-05-22 7:10 ` Mike Galbraith
2008-05-22 8:28 ` Dhaval Giani
2008-05-22 9:05 ` Mike Galbraith
2008-05-22 10:34 ` Mike Galbraith
2008-05-22 11:25 ` Mike Galbraith
2008-05-22 11:44 ` Peter Zijlstra
2008-05-22 12:09 ` Mike Galbraith
2008-05-22 12:24 ` Peter Zijlstra
2008-05-22 13:16 ` Mike Galbraith
2008-05-23 7:13 ` Greg Smith
2008-05-23 10:00 ` Mike Galbraith
2008-05-23 10:10 ` Ingo Molnar
2008-05-23 10:15 ` Mike Galbraith
2008-05-23 23:18 ` Greg Smith
2008-05-23 23:46 ` Mike Galbraith
2008-05-24 8:08 ` Mike Galbraith
2008-05-27 0:28 ` Greg Smith
2008-05-27 5:59 ` [patch] " Mike Galbraith
2008-05-27 8:20 ` Mike Galbraith
2008-05-27 8:35 ` Mike Galbraith
2008-06-06 5:03 ` Greg Smith
2008-06-06 6:13 ` Mike Galbraith
2008-06-07 11:38 ` Mike Galbraith
2008-06-07 12:50 ` Mike Galbraith
2008-06-07 13:07 ` Peter Zijlstra
2008-06-07 14:16 ` Mike Galbraith
2008-06-07 16:16 ` Peter Zijlstra
2008-06-07 17:56 ` Mike Galbraith
2008-06-07 13:08 ` Peter Zijlstra
2008-06-07 14:54 ` [patch part 2] " Mike Galbraith
2008-06-07 16:12 ` Peter Zijlstra
2008-06-07 17:53 ` Mike Galbraith
2008-06-07 18:19 ` Mike Galbraith
2008-05-23 13:05 ` Mike Galbraith
2008-05-23 13:35 ` Mike Galbraith
-- strict thread matches above, loose matches on Subject: below --
2008-05-23 11:26 Anton Petrusevich
2008-05-23 18:46 ` Greg Smith
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox