Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Ankur Arora <ankur.a.arora@oracle.com>
To: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Ankur Arora <ankur.a.arora@oracle.com>,
	tglx@linutronix.de, peterz@infradead.org,
	torvalds@linux-foundation.org, paulmck@kernel.org,
	rostedt@goodmis.org, mark.rutland@arm.com, juri.lelli@redhat.com,
	joel@joelfernandes.org, raghavendra.kt@amd.com,
	boris.ostrovsky@oracle.com, konrad.wilk@oracle.com,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling
Date: Sat, 01 Jun 2024 04:47:09 -0700	[thread overview]
Message-ID: <8734pw51he.fsf@oracle.com> (raw)
In-Reply-To: <2d6ef6d8-6aef-4703-a9c7-90501537cdc5@linux.ibm.com>


Shrikanth Hegde <sshegde@linux.ibm.com> writes:

> On 5/28/24 6:04 AM, Ankur Arora wrote:
>> Hi,
>>
>> This series adds a new scheduling model PREEMPT_AUTO, which like
>> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
>> preemption model. Unlike, PREEMPT_DYNAMIC, it doesn't depend
>> on explicit preemption points for the voluntary models.
>>
>> The series is based on Thomas' original proposal which he outlined
>> in [1], [2] and in his PoC [3].
>>
>> v2 mostly reworks v1, with one of the main changes having less
>> noisy need-resched-lazy related interfaces.
>> More details in the changelog below.
>>
>
> Hi Ankur. Thanks for the series.
>
> nit: had to manually patch 11,12,13 since it didnt apply cleanly on
> tip/master and tip/sched/core. Mostly due some word differences in the change.
>
> tip/master was at:
> commit e874df84d4a5f3ce50b04662b62b91e55b0760fc (HEAD -> master, origin/master, origin/HEAD)
> Merge: 5d145493a139 47ff30cc1be7
> Author: Ingo Molnar <mingo@kernel.org>
> Date:   Tue May 28 12:44:26 2024 +0200
>
>     Merge branch into tip/master: 'x86/percpu'
>
>
>
>> The v1 of the series is at [4] and the RFC at [5].
>>
>> Design
>> ==
>>
>> PREEMPT_AUTO works by always enabling CONFIG_PREEMPTION (and thus
>> PREEMPT_COUNT). This means that the scheduler can always safely
>> preempt. (This is identical to CONFIG_PREEMPT.)
>>
>> Having that, the next step is to make the rescheduling policy dependent
>> on the chosen scheduling model. Currently, the scheduler uses a single
>> need-resched bit (TIF_NEED_RESCHED) which it uses to state that a
>> reschedule is needed.
>> PREEMPT_AUTO extends this by adding an additional need-resched bit
>> (TIF_NEED_RESCHED_LAZY) which, with TIF_NEED_RESCHED now allows the
>> scheduler to express two kinds of rescheduling intent: schedule at
>> the earliest opportunity (TIF_NEED_RESCHED), or express a need for
>> rescheduling while allowing the task on the runqueue to run to
>> timeslice completion (TIF_NEED_RESCHED_LAZY).
>>
>> The scheduler decides which need-resched bits are chosen based on
>> the preemption model in use:
>>
>> 	       TIF_NEED_RESCHED        TIF_NEED_RESCHED_LAZY
>>
>> none		never   		always [*]
>> voluntary       higher sched class	other tasks [*]
>> full 		always                  never
>>
>> [*] some details elided.
>>
>> The last part of the puzzle is, when does preemption happen, or
>> alternately stated, when are the need-resched bits checked:
>>
>>                  exit-to-user    ret-to-kernel    preempt_count()
>>
>> NEED_RESCHED_LAZY     Y               N                N
>> NEED_RESCHED          Y               Y                Y
>>
>> Using NEED_RESCHED_LAZY allows for run-to-completion semantics when
>> none/voluntary preemption policies are in effect. And eager semantics
>> under full preemption.
>>
>> In addition, since this is driven purely by the scheduler (not
>> depending on cond_resched() placement and the like), there is enough
>> flexibility in the scheduler to cope with edge cases -- ex. a kernel
>> task not relinquishing CPU under NEED_RESCHED_LAZY can be handled by
>> simply upgrading to a full NEED_RESCHED which can use more coercive
>> instruments like resched IPI to induce a context-switch.
>>
>> Performance
>> ==
>> The performance in the basic tests (perf bench sched messaging, kernbench,
>> cyclictest) matches or improves what we see under PREEMPT_DYNAMIC.
>> (See patches
>>   "sched: support preempt=none under PREEMPT_AUTO"
>>   "sched: support preempt=full under PREEMPT_AUTO"
>>   "sched: handle preempt=voluntary under PREEMPT_AUTO")
>>
>> For a macro test, a colleague in Oracle's Exadata team tried two
>> OLTP benchmarks (on a 5.4.17 based Oracle kernel, with the v1 series
>> backported.)
>>
>> In both tests the data was cached on remote nodes (cells), and the
>> database nodes (compute) served client queries, with clients being
>> local in the first test and remote in the second.
>>
>> Compute node: Oracle E5, dual socket AMD EPYC 9J14, KVM guest (380 CPUs)
>> Cells (11 nodes): Oracle E5, dual socket AMD EPYC 9334, 128 CPUs
>>
>>
>> 				  PREEMPT_VOLUNTARY                        PREEMPT_AUTO
>> 				                                        (preempt=voluntary)
>>                               ==============================      =============================
>>                       clients  throughput    cpu-usage            throughput     cpu-usage         Gain
>>                                (tx/min)    (utime %/stime %)      (tx/min)    (utime %/stime %)
>> 		      -------  ----------  -----------------      ----------  -----------------   -------
>>
>>
>>   OLTP                  384     9,315,653     25/ 6                9,253,252       25/ 6            -0.7%
>>   benchmark	       1536    13,177,565     50/10               13,657,306       50/10            +3.6%
>>  (local clients)       3456    14,063,017     63/12               14,179,706       64/12            +0.8%
>>
>>
>>   OLTP                   96     8,973,985     17/ 2                8,924,926       17/ 2            -0.5%
>>   benchmark	        384    22,577,254     60/ 8               22,211,419       59/ 8            -1.6%
>>  (remote clients,      2304    25,882,857     82/11               25,536,100       82/11            -1.3%
>>   90/10 RW ratio)
>>
>>
>> (Both sets of tests have a fair amount of NW traffic since the query
>> tables etc are cached on the cells. Additionally, the first set,
>> given the local clients, stress the scheduler a bit more than the
>> second.)
>>
>> The comparative performance for both the tests is fairly close,
>> more or less within a margin of error.
>>
>> Raghu KT also tested v1 on an AMD Milan (2 node, 256 cpu,  512GB RAM):
>>
>> "
>>  a) Base kernel (6.7),
>>  b) v1, PREEMPT_AUTO, preempt=voluntary
>>  c) v1, PREEMPT_DYNAMIC, preempt=voluntary
>>  d) v1, PREEMPT_AUTO=y, preempt=voluntary, PREEMPT_RCU = y
>>
>>  Workloads I tested and their %gain,
>>                     case b           case c       case d
>>  NAS                +2.7%              +1.9%         +2.1%
>>  Hashjoin,          +0.0%              +0.0%         +0.0%
>>  Graph500,          -6.0%              +0.0%         +0.0%
>>  XSBench            +1.7%              +0.0%         +1.2%
>>
>>  (Note about the Graph500 numbers at [8].)
>>
>>  Did kernbench etc test from Mel's mmtests suite also. Did not notice
>>  much difference.
>> "
>>
>> One case where there is a significant performance drop is on powerpc,
>> seen running hackbench on a 320 core system (a test on a smaller system is
>> fine.) In theory there's no reason for this to only happen on powerpc
>> since most of the code is common, but I haven't been able to reproduce
>> it on x86 so far.
>>
>> All in all, I think the tests above show that this scheduling model has legs.
>> However, the none/voluntary models under PREEMPT_AUTO are conceptually
>> different enough from the current none/voluntary models that there
>> likely are workloads where performance would be subpar. That needs more
>> extensive testing to figure out the weak points.
>>
>>
>>
> Did test it again on PowerPC. Unfortunately numbers shows there is regression
> still compared to 6.10-rc1. This is done with preempt=none. I tried again on the
> smaller system too to confirm. For now I have done the comparison for the hackbench
> where highest regression was seen in v1.
>
> perf stat collected for 20 iterations show higher context switch and higher migrations.
> Could it be that LAZY bit is causing more context switches? or could it be something
> else? Could it be that more exit-to-user happens in PowerPC? will continue to debug.

Thanks for trying it out.

As you point out, context-switches and migrations are signficantly higher.

Definitely unexpected. I ran the same test on an x86 box
(Milan, 2x64 cores, 256 threads) and there I see no more than a ~4% difference.

  6.9.0/none.process.pipe.60:       170,719,761      context-switches          #    0.022 M/sec                    ( +-  0.19% )
  6.9.0/none.process.pipe.60:        16,871,449      cpu-migrations            #    0.002 M/sec                    ( +-  0.16% )
  6.9.0/none.process.pipe.60:      30.833112186 seconds time elapsed                                          ( +-  0.11% )

  6.9.0-00035-gc90017e055a6/none.process.pipe.60:       177,889,639      context-switches          #    0.023 M/sec                    ( +-  0.21% )
  6.9.0-00035-gc90017e055a6/none.process.pipe.60:        17,426,670      cpu-migrations            #    0.002 M/sec                    ( +-  0.41% )
  6.9.0-00035-gc90017e055a6/none.process.pipe.60:      30.731126312 seconds time elapsed                                          ( +-  0.07% )

Clearly there's something different going on powerpc. I'm travelling
right now, but will dig deeper into this once I get back.

Meanwhile can you check if the increased context-switches are voluntary or
involuntary (or what the division is)?


Thanks
Ankur

> Meanwhile, will do more test with other micro-benchmarks and post the results.
>
>
> More details below.
> CONFIG_HZ = 100
> ./hackbench -pipe 60 process 100000 loops
>
> ====================================================================================
> On the larger system. (40 Cores, 320CPUS)
> ====================================================================================
> 				6.10-rc1		+preempt_auto
> 				preempt=none		preempt=none
> 20 iterations avg value
> hackbench pipe(60)		26.403			32.368 ( -31.1%)
>
> ++++++++++++++++++
> baseline 6.10-rc1:
> ++++++++++++++++++
>  Performance counter stats for 'system wide' (20 runs):
>     168,980,939.76 msec cpu-clock                        # 6400.026 CPUs utilized               ( +-  6.59% )
>      6,299,247,371      context-switches                 #   70.596 K/sec                       ( +-  6.60% )
>        246,646,236      cpu-migrations                   #    2.764 K/sec                       ( +-  6.57% )
>          1,759,232      page-faults                      #   19.716 /sec                        ( +-  6.61% )
> 577,719,907,794,874      cycles                           #    6.475 GHz                         ( +-  6.60% )
> 226,392,778,622,410      instructions                     #    0.74  insn per cycle              ( +-  6.61% )
> 37,280,192,946,445      branches                         #  417.801 M/sec                       ( +-  6.61% )
>    166,456,311,053      branch-misses                    #    0.85% of all branches             ( +-  6.60% )
>
>             26.403 +- 0.166 seconds time elapsed  ( +-  0.63% )
>
> ++++++++++++
> preempt auto
> ++++++++++++
>  Performance counter stats for 'system wide' (20 runs):
>     207,154,235.95 msec cpu-clock                        # 6400.009 CPUs utilized               ( +-  6.64% )
>      9,337,462,696      context-switches                 #   85.645 K/sec                       ( +-  6.68% )
>        631,276,554      cpu-migrations                   #    5.790 K/sec                       ( +-  6.79% )
>          1,756,583      page-faults                      #   16.112 /sec                        ( +-  6.59% )
> 700,281,729,230,103      cycles                           #    6.423 GHz                         ( +-  6.64% )
> 254,713,123,656,485      instructions                     #    0.69  insn per cycle              ( +-  6.63% )
> 42,275,061,484,512      branches                         #  387.756 M/sec                       ( +-  6.63% )
>    231,944,216,106      branch-misses                    #    1.04% of all branches             ( +-  6.64% )
>
>             32.368 +- 0.200 seconds time elapsed  ( +-  0.62% )
>
>
> ============================================================================================
> Smaller system ( 12Cores, 96CPUS)
> ============================================================================================
> 				6.10-rc1		+preempt_auto
> 				preempt=none		preempt=none
> 20 iterations avg value
> hackbench pipe(60)		55.930			65.75 ( -17.6%)
>
> ++++++++++++++++++
> baseline 6.10-rc1:
> ++++++++++++++++++
>  Performance counter stats for 'system wide' (20 runs):
>     107,386,299.19 msec cpu-clock                        # 1920.003 CPUs utilized               ( +-  6.55% )
>      1,388,830,542      context-switches                 #   24.536 K/sec                       ( +-  6.19% )
>         44,538,641      cpu-migrations                   #  786.840 /sec                        ( +-  6.23% )
>          1,698,710      page-faults                      #   30.010 /sec                        ( +-  6.58% )
> 412,401,110,929,055      cycles                           #    7.286 GHz                         ( +-  6.54% )
> 192,380,094,075,743      instructions                     #    0.88  insn per cycle              ( +-  6.59% )
> 30,328,724,557,878      branches                         #  535.801 M/sec                       ( +-  6.58% )
>     99,642,840,901      branch-misses                    #    0.63% of all branches             ( +-  6.57% )
>
>             55.930 +- 0.509 seconds time elapsed  ( +-  0.91% )
>
>
> +++++++++++++++++
> v2_preempt_auto
> +++++++++++++++++
>  Performance counter stats for 'system wide' (20 runs):
>     126,244,029.04 msec cpu-clock                        # 1920.005 CPUs utilized               ( +-  6.51% )
>      2,563,720,294      context-switches                 #   38.356 K/sec                       ( +-  6.10% )
>        147,445,392      cpu-migrations                   #    2.206 K/sec                       ( +-  6.37% )
>          1,710,637      page-faults                      #   25.593 /sec                        ( +-  6.55% )
> 483,419,889,144,017      cycles                           #    7.232 GHz                         ( +-  6.51% )
> 210,788,030,476,548      instructions                     #    0.82  insn per cycle              ( +-  6.57% )
> 33,851,562,301,187      branches                         #  506.454 M/sec                       ( +-  6.56% )
>    134,059,721,699      branch-misses                    #    0.75% of all branches             ( +-  6.45% )
>
>              65.75 +- 1.06 seconds time elapsed  ( +-  1.61% )

So, the context-switches are meaningfully higher.

--
ankur

next prev parent reply	other threads:[~2024-06-01 11:48 UTC|newest]

Thread overview: 95+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-28  0:34 [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling Ankur Arora
2024-05-28  0:34 ` [PATCH v2 01/35] sched/core: Move preempt_model_*() helpers from sched.h to preempt.h Ankur Arora
2024-06-06 17:45   ` [tip: sched/core] " tip-bot2 for Sean Christopherson
2024-05-28  0:34 ` [PATCH v2 02/35] sched/core: Drop spinlocks on contention iff kernel is preemptible Ankur Arora
2024-05-28  0:34 ` [PATCH v2 03/35] sched: make test_*_tsk_thread_flag() return bool Ankur Arora
2024-05-28  0:34 ` [PATCH v2 04/35] preempt: introduce CONFIG_PREEMPT_AUTO Ankur Arora
2024-06-03 15:04   ` Shrikanth Hegde
2024-06-04 17:52     ` Ankur Arora
2024-05-28  0:34 ` [PATCH v2 05/35] thread_info: selector for TIF_NEED_RESCHED[_LAZY] Ankur Arora
2024-05-28 15:55   ` Peter Zijlstra
2024-05-30  9:07     ` Ankur Arora
2024-05-28  0:34 ` [PATCH v2 06/35] thread_info: define __tif_need_resched(resched_t) Ankur Arora
2024-05-28 16:03   ` Peter Zijlstra
2024-05-28  0:34 ` [PATCH v2 07/35] sched: define *_tsk_need_resched_lazy() helpers Ankur Arora
2024-05-28 16:09   ` Peter Zijlstra
2024-05-30  9:02     ` Ankur Arora
2024-05-29  8:25   ` Peter Zijlstra
2024-05-30  9:08     ` Ankur Arora
2024-05-28  0:34 ` [PATCH v2 08/35] entry: handle lazy rescheduling at user-exit Ankur Arora
2024-05-28 16:12   ` Peter Zijlstra
2024-05-28  0:34 ` [PATCH v2 09/35] entry/kvm: handle lazy rescheduling at guest-entry Ankur Arora
2024-05-28 16:13   ` Peter Zijlstra
2024-05-30  9:04     ` Ankur Arora
2024-05-28  0:34 ` [PATCH v2 10/35] entry: irqentry_exit only preempts for TIF_NEED_RESCHED Ankur Arora
2024-05-28 16:18   ` Peter Zijlstra
2024-05-30  9:03     ` Ankur Arora
2024-05-28  0:34 ` [PATCH v2 11/35] sched: __schedule_loop() doesn't need to check for need_resched_lazy() Ankur Arora
2024-05-28  0:34 ` [PATCH v2 12/35] sched: separate PREEMPT_DYNAMIC config logic Ankur Arora
2024-05-28 16:25   ` Peter Zijlstra
2024-05-30  9:30     ` Ankur Arora
2024-05-28  0:34 ` [PATCH v2 13/35] sched: allow runtime config for PREEMPT_AUTO Ankur Arora
2024-05-28 16:27   ` Peter Zijlstra
2024-05-30  9:29     ` Ankur Arora
2024-06-06 11:51       ` Peter Zijlstra
2024-06-06 15:11         ` Ankur Arora
2024-06-06 17:32           ` Peter Zijlstra
2024-06-09  0:46             ` Ankur Arora
2024-06-12 18:10               ` Paul E. McKenney
2024-05-28  0:35 ` [PATCH v2 14/35] rcu: limit PREEMPT_RCU to full preemption under PREEMPT_AUTO Ankur Arora
2024-05-28  0:35 ` [PATCH v2 15/35] rcu: fix header guard for rcu_all_qs() Ankur Arora
2024-05-28  0:35 ` [PATCH v2 16/35] preempt,rcu: warn on PREEMPT_RCU=n, preempt=full Ankur Arora
2024-05-29  8:14   ` Peter Zijlstra
2024-05-30 18:32     ` Paul E. McKenney
2024-05-30 23:05       ` Ankur Arora
2024-05-30 23:15         ` Paul E. McKenney
2024-05-30 23:04     ` Ankur Arora
2024-05-30 23:20       ` Paul E. McKenney
2024-06-06 11:53         ` Peter Zijlstra
2024-06-06 13:38           ` Paul E. McKenney
2024-06-17 15:54             ` Paul E. McKenney
2024-06-18 16:29               ` Paul E. McKenney
2024-05-28  0:35 ` [PATCH v2 17/35] rcu: handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y Ankur Arora
2024-05-28  0:35 ` [PATCH v2 18/35] rcu: force context-switch " Ankur Arora
2024-05-28  0:35 ` [PATCH v2 19/35] x86/thread_info: define TIF_NEED_RESCHED_LAZY Ankur Arora
2024-05-28  0:35 ` [PATCH v2 20/35] powerpc: add support for PREEMPT_AUTO Ankur Arora
2024-05-28  0:35 ` [PATCH v2 21/35] sched: prepare for lazy rescheduling in resched_curr() Ankur Arora
2024-05-29  9:32   ` Peter Zijlstra
2024-05-28  0:35 ` [PATCH v2 22/35] sched: default preemption policy for PREEMPT_AUTO Ankur Arora
2024-05-28  0:35 ` [PATCH v2 23/35] sched: handle idle preemption " Ankur Arora
2024-05-28  0:35 ` [PATCH v2 24/35] sched: schedule eagerly in resched_cpu() Ankur Arora
2024-05-28  0:35 ` [PATCH v2 25/35] sched/fair: refactor update_curr(), entity_tick() Ankur Arora
2024-05-28  0:35 ` [PATCH v2 26/35] sched/fair: handle tick expiry under lazy preemption Ankur Arora
2024-05-28  0:35 ` [PATCH v2 27/35] sched: support preempt=none under PREEMPT_AUTO Ankur Arora
2024-05-28  0:35 ` [PATCH v2 28/35] sched: support preempt=full " Ankur Arora
2024-05-28  0:35 ` [PATCH v2 29/35] sched: handle preempt=voluntary " Ankur Arora
2024-06-17  3:20   ` Tianchen Ding
2024-06-21 18:58     ` Ankur Arora
2024-06-24  2:35       ` Tianchen Ding
2024-06-25  1:12         ` Ankur Arora
2024-06-26  2:43           ` Tianchen Ding
2024-05-28  0:35 ` [PATCH v2 30/35] sched: latency warn for TIF_NEED_RESCHED_LAZY Ankur Arora
2024-05-28  0:35 ` [PATCH v2 31/35] tracing: support lazy resched Ankur Arora
2024-05-28  0:35 ` [PATCH v2 32/35] Documentation: tracing: add TIF_NEED_RESCHED_LAZY Ankur Arora
2024-05-28  0:35 ` [PATCH v2 33/35] osnoise: handle quiescent states for PREEMPT_RCU=n, PREEMPTION=y Ankur Arora
2024-05-28 13:12   ` Daniel Bristot de Oliveira
2024-05-28  0:35 ` [PATCH v2 34/35] kconfig: decompose ARCH_NO_PREEMPT Ankur Arora
2024-05-28  0:35 ` [PATCH v2 35/35] arch: " Ankur Arora
2024-05-29  6:16 ` [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling Shrikanth Hegde
2024-06-01 11:47   ` Ankur Arora [this message]
2024-06-04  7:32     ` Shrikanth Hegde
2024-06-07 16:48       ` Shrikanth Hegde
2024-06-10  7:23         ` Ankur Arora
2024-06-15 15:04           ` Shrikanth Hegde
2024-06-18 18:27             ` Shrikanth Hegde
2024-06-19  2:40               ` Ankur Arora
2024-06-24 18:37                 ` Shrikanth Hegde
2024-06-27  2:50                   ` Ankur Arora
2024-06-27  5:56                     ` Michael Ellerman
2024-06-27 15:44                       ` Shrikanth Hegde
2024-07-03  5:27                         ` Ankur Arora
2024-08-12 17:32                           ` Shrikanth Hegde
2024-08-12 21:07                             ` Linus Torvalds
2024-08-13  5:40                               ` Ankur Arora
2024-06-05 15:44 ` Sean Christopherson
2024-06-05 17:45   ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8734pw51he.fsf@oracle.com \
    --to=ankur.a.arora@oracle.com \
    --cc=boris.ostrovsky@oracle.com \
    --cc=joel@joelfernandes.org \
    --cc=juri.lelli@redhat.com \
    --cc=konrad.wilk@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=raghavendra.kt@amd.com \
    --cc=rostedt@goodmis.org \
    --cc=sshegde@linux.ibm.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox