[BUG] Random intermittent boost failures (Was Re: [BUG] TREE04..)

All of lore.kernel.org
 help / color / mirror / Atom feed

* [BUG] Random intermittent boost failures (Was Re: [BUG] TREE04..)
@ 2023-09-10 20:14 Joel Fernandes
  2023-09-10 21:16 ` Paul E. McKenney
  0 siblings, 1 reply; 23+ messages in thread
From: Joel Fernandes @ 2023-09-10 20:14 UTC (permalink / raw)
  To: Frederic Weisbecker; +Cc: Paul E. McKenney, rcu

On Fri, Sep 08, 2023 at 09:32:35AM -0400, Joel Fernandes wrote:
> On Fri, Sep 8, 2023 at 7:41 AM Frederic Weisbecker <frederic@kernel.org> wrote:
> >
> > On Fri, Sep 08, 2023 at 01:27:06AM -0700, Paul E. McKenney wrote:
> > > On Thu, Sep 07, 2023 at 08:51:43PM -0400, Joel Fernandes wrote:
> > > > On Thu, Sep 7, 2023 at 4:03 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> > > > >
> > > > >
> > > > >
> > > > > > On Sep 7, 2023, at 12:23 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > > >
> > > > > > On Thu, Sep 07, 2023 at 09:17:15AM -0400, Joel Fernandes wrote:
> > > > > >> Hi,
> > > > > >> Just started seeing this on 6.5 stable. It is new and first occurrence:
> > > > > >>
> > > > > >> TREE04 no success message, 234 successful version messages
> > > > > >> [033mWARNING:  [mTREE04 GP HANG at 14 torture stat 2
> > > > > >> [   38.371120] ??? Writer stall state RTWS_COND_SYNC_FULL(10) g1253
> > > > > >> f0x0 ->state 0x2 cpu 6
> > > > > >> [   38.388342] Call Trace:
> > > > > >> [   53.741039] ??? Writer stall state RTWS_COND_SYNC_FULL(10) g3637
> > > > > >> f0x2 ->state 0x2 cpu 6
> > > > > >> [   69.093462] ??? Writer stall state RTWS_COND_SYNC_FULL(10) g5501
> > > > > >> f0x0 ->state 0x2 cpu 6
> > > > > >> [   84.450028] ??? Writer stall state RTWS_COND_SYNC_FULL(10) g10505
> > > > > >> f0x0 ->state 0x2 cpu 6
> > > > > >> [   99.815871] ??? Writer stall state RTWS_COND_SYNC_FULL(10) g13781
> > > > > >> f0x0 ->state 0x2 cpu 6
> > > > > >> [  115.166476] ??? Writer stall state RTWS_COND_SYNC_FULL(10) g16544
> > > > > >> f0x0 ->state 0x2 cpu 6
> > > > > >> [  130.550116] ??? Writer stall state RTWS_COND_SYNC_FULL(10) g18941
> > > > > >> f0x0 ->state 0x2 cpu 6
> > > > > >> [..]
> > > > > >>
> > > > > >> All logs:
> > > > > >> http://box.joelfernandes.org:9080/job/rcutorture_stable/job/linux-6.5.y/17/artifact/tools/testing/selftests/rcutorture/res/2023.09.07-04.10.25/TREE04/
> > > > > >
> > > > > > Huh.  Does this happen for you in v6.5 mainline?
> > > > > >
> > > > > > Both the code under test (full-state polled grace periods) and the
> > > > > > rcutorture test code are fairly new, so there is some reason for general
> > > > > > suspicion.  ;-)
> > > > >
> > > > > Ah. I never saw it on either 6.5 mainline or stable till today. Even on stable
> > > > > I only ever saw it this once. On mainline I have not seen it yet but I do test
> > > > > stable much more since I have been on stable maintenance duty ;-).
> > > >
> > > > I did a couple of long runs and I am not able to reproduce it anymore. :-/
> > >
> > > I know that feeling!
> >
> > Same here, this is after all the reason why we keep the tick dependency within
> > the hotplug process without really knowing why :o)
> 
> Heh. I have been running into another intermittent one as well which
> is the boost failure and that happens once in 10-15 runs or so.
> 
> I was thinking of running the following configuration on an automated
> regular basis to at least provide a better clue on the lucky run that
> catches an issue. But then the issue is it would change timing enough
> to maybe hide bugs. I could also make it submit logs automatically to
> the list on such occurrences, but one step at a time and all that.  I
> do need to add (hopefully less noisy) tick/timer related trace events.
> 
> # Define the bootargs array
> bootargs=(
>     "ftrace_dump_on_oops"
>     "panic_on_warn=1"
>     "sysctl.kernel.panic_on_rcu_stall=1"
>     "sysctl.kernel.max_rcu_stall_to_panic=1"
>     "trace_buf_size=10K"
>     "traceoff_on_warning=1"
>     "panic_print=0x1f"      # To dump held locks, mem and other info.
> )
> # Define the trace events array passed to bootargs.
> trace_events=(
>     "sched:sched_switch"
>     "sched:sched_waking"
>     "rcu:rcu_callback"
>     "rcu:rcu_fqs"
>     "rcu:rcu_quiescent_state_report"
>     "rcu:rcu_grace_period"
> )

So some insight on this boost failure. Just before the boost failures are
reported, I see the migration thread interferring with the rcu_preempt thread
(aka GP kthread). See trace below. Of note is that the rcu_preempt thread is
runnable while context switching, which means its execution is interferred.
The rcu_preempt thread is at RT prio 2 as can be seen.

So some open-ended questions: what exactly does the migration thread want,
this is something related to CPU hotplug? And if the migration thread had to
run, why did the rcu_preempt thread not get pushed to another CPU by the
scheduler? We have 16 vCPUs for this test.

IMHO, this might be unrelated to boost but just that the boost test detects
a stall in RCU 'sooner'. I am planning to add more tracing to the migration
thread to trace down, but I'm afraid I am hitting some limits of what the
console is capable of in terms of accepting trace dumping. I am seeing a lot
of printk messages dropped.

Maybe if I reduce RCU stall detector threshold to 4 seconds or so, I might

see this even without boost testing, I feel.
migratio-66        8d..2. 1544321393us : sched_switch: prev_comm=migration/8 prev_pid=66 prev_prio=0 prev_state=S ==> next_comm=swapper/8 next_pid=0 next_prio=120
<idle>-0           8d..2. 1544321890us : sched_switch: prev_comm=swapper/8 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=13 next_prio=97
rcu_pree-13        8d..4. 1544321892us : sched_waking: comm=migration/8 pid=66 prio=0 target_cpu=008
rcu_pree-13        8d..2. 1544321895us : sched_switch: prev_comm=rcu_preempt prev_pid=13 prev_prio=97 prev_state=R ==> next_comm=migration/8 next_pid=66 next_prio=0
migratio-66        8d..2. 1544321903us : sched_switch: prev_comm=migration/8 prev_pid=66 prev_prio=0 prev_state=S ==> next_comm=swapper/8 next_pid=0 next_prio=120
<idle>-0           8d..2. 1544322121us : sched_switch: prev_comm=swapper/8 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=13 next_prio=97
rcu_pree-13        8d..4. 1544322123us : sched_waking: comm=migration/8 pid=66 prio=0 target_cpu=008
rcu_pree-13        8d..2. 1544322126us : sched_switch: prev_comm=rcu_preempt prev_pid=13 prev_prio=97 prev_state=R ==> next_comm=migration/8 next_pid=66 next_prio=0
migratio-66        8d..2. 1544322134us : sched_switch: prev_comm=migration/8 prev_pid=66 prev_prio=0 prev_state=S ==> next_comm=swapper/8 next_pid=0 next_prio=120
<idle>-0           8d..2. 1544322495us : sched_switch: prev_comm=swapper/8 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=13 next_prio=97
rcu_pree-13        8d..4. 1544322497us : sched_waking: comm=migration/8 pid=66 prio=0 target_cpu=008
rcu_pree-13        8d..2. 1544322500us : sched_switch: prev_comm=rcu_preempt prev_pid=13 prev_prio=97 prev_state=R ==> next_comm=migration/8 next_pid=66 next_prio=0
migratio-66        8d..2. 1544322508us : sched_switch: prev_comm=migration/8 prev_pid=66 prev_prio=0 prev_state=S ==> next_comm=swapper/8 next_pid=0 next_prio=120
<idle>-0           8d..2. 1544322978us : sched_switch: prev_comm=swapper/8 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=13 next_prio=97
rcu_pree-13        8d..4. 1544322980us : sched_waking: comm=migration/8 pid=66 prio=0 target_cpu=008
rcu_pree-13        8d..2. 1544322984us : sched_switch: prev_comm=rcu_preempt prev_pid=13 prev_prio=97 prev_state=R ==> next_comm=migration/8 next_pid=66 next_prio=0
migratio-66        8d..2. 1544322992us : sched_switch: prev_comm=migration/8 prev_pid=66 prev_prio=0 prev_state=S ==> next_comm=swapper/8 next_pid=0 next_prio=120
<idle>-0           8d..2. 1544323301us : sched_switch: prev_comm=swapper/8 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=13 next_prio=97
rcu_pree-13        8d..4. 1544323303us : sched_waking: comm=migration/8 pid=66 prio=0 target_cpu=008
rcu_pree-13        8d..2. 1544323306us : sched_switch: prev_comm=rcu_preempt prev_pid=13 prev_prio=97 prev_state=R ==> next_comm=migration/8 next_pid=66 next_prio=0
migratio-66        8d..2. 1544323314us : sched_switch: prev_comm=migration/8 prev_pid=66 prev_prio=0 prev_state=S ==> next_comm=swapper/8 next_pid=0 next_prio=120
<idle>-0           8d..2. 1544323673us : sched_switch: prev_comm=swapper/8 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=13 next_prio=97
rcu_pree-13        8d..4. 1544323675us : sched_waking: comm=migration/8 pid=66 prio=0 target_cpu=008
rcu_pree-13        8d..2. 1544323678us : sched_switch: prev_comm=rcu_preempt prev_pid=13 prev_prio=97 prev_state=R ==> next_comm=migration/8 next_pid=66 next_prio=0
migratio-66        8d..2. 1544323686us : sched_switch: prev_comm=migration/8 prev_pid=66 prev_prio=0 prev_state=S ==> next_comm=swapper/8 next_pid=0 next_prio=120
rcu_tort-1726     12..... 1544323807us : rcu_torture_boost_failed: Boost inversion thread ->rt_priority 1 gp_state 1242792 jiffies 2503
rcu_tort-1723      1..... 1544323840us : rcu_torture_boost_failed: Boost failed
rcu_tort-1743      4..... 1544323844us : rcu_torture_boost_failed: Boost failed
rcu_tort-1715      3..... 1544323848us : rcu_torture_boost_failed: Boost failed
rcu_tort-1713      7..... 1544323852us : rcu_torture_boost_failed: Boost failed

(At this point a message says tracing was disabled due to a boost failure
warning due to my use of traceoff_on_warning=1)

Side note: I was talking to Steve the other day that it'd be nice to extract
the trace buffer in a better way than console. Perhaps we can do a core dump
from rcutorture? I think 'crash' can already extract ftrace [1] from core
dumps and Qemu can generate core dumps via QMP.

[1] https://access.redhat.com/solutions/239433

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BUG] Random intermittent boost failures (Was Re: [BUG] TREE04..)
  2023-09-10 20:14 [BUG] Random intermittent boost failures (Was Re: [BUG] TREE04..) Joel Fernandes
@ 2023-09-10 21:16 ` Paul E. McKenney
  2023-09-10 23:37   ` Joel Fernandes
  0 siblings, 1 reply; 23+ messages in thread
From: Paul E. McKenney @ 2023-09-10 21:16 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: Frederic Weisbecker, rcu

On Sun, Sep 10, 2023 at 08:14:45PM +0000, Joel Fernandes wrote:
> On Fri, Sep 08, 2023 at 09:32:35AM -0400, Joel Fernandes wrote:
> > On Fri, Sep 8, 2023 at 7:41 AM Frederic Weisbecker <frederic@kernel.org> wrote:
> > >
> > > On Fri, Sep 08, 2023 at 01:27:06AM -0700, Paul E. McKenney wrote:
> > > > On Thu, Sep 07, 2023 at 08:51:43PM -0400, Joel Fernandes wrote:
> > > > > On Thu, Sep 7, 2023 at 4:03 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > > On Sep 7, 2023, at 12:23 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > > > >
> > > > > > > On Thu, Sep 07, 2023 at 09:17:15AM -0400, Joel Fernandes wrote:
> > > > > > >> Hi,
> > > > > > >> Just started seeing this on 6.5 stable. It is new and first occurrence:
> > > > > > >>
> > > > > > >> TREE04 no success message, 234 successful version messages
> > > > > > >> [033mWARNING:  [mTREE04 GP HANG at 14 torture stat 2
> > > > > > >> [   38.371120] ??? Writer stall state RTWS_COND_SYNC_FULL(10) g1253
> > > > > > >> f0x0 ->state 0x2 cpu 6
> > > > > > >> [   38.388342] Call Trace:
> > > > > > >> [   53.741039] ??? Writer stall state RTWS_COND_SYNC_FULL(10) g3637
> > > > > > >> f0x2 ->state 0x2 cpu 6
> > > > > > >> [   69.093462] ??? Writer stall state RTWS_COND_SYNC_FULL(10) g5501
> > > > > > >> f0x0 ->state 0x2 cpu 6
> > > > > > >> [   84.450028] ??? Writer stall state RTWS_COND_SYNC_FULL(10) g10505
> > > > > > >> f0x0 ->state 0x2 cpu 6
> > > > > > >> [   99.815871] ??? Writer stall state RTWS_COND_SYNC_FULL(10) g13781
> > > > > > >> f0x0 ->state 0x2 cpu 6
> > > > > > >> [  115.166476] ??? Writer stall state RTWS_COND_SYNC_FULL(10) g16544
> > > > > > >> f0x0 ->state 0x2 cpu 6
> > > > > > >> [  130.550116] ??? Writer stall state RTWS_COND_SYNC_FULL(10) g18941
> > > > > > >> f0x0 ->state 0x2 cpu 6
> > > > > > >> [..]
> > > > > > >>
> > > > > > >> All logs:
> > > > > > >> http://box.joelfernandes.org:9080/job/rcutorture_stable/job/linux-6.5.y/17/artifact/tools/testing/selftests/rcutorture/res/2023.09.07-04.10.25/TREE04/
> > > > > > >
> > > > > > > Huh.  Does this happen for you in v6.5 mainline?
> > > > > > >
> > > > > > > Both the code under test (full-state polled grace periods) and the
> > > > > > > rcutorture test code are fairly new, so there is some reason for general
> > > > > > > suspicion.  ;-)
> > > > > >
> > > > > > Ah. I never saw it on either 6.5 mainline or stable till today. Even on stable
> > > > > > I only ever saw it this once. On mainline I have not seen it yet but I do test
> > > > > > stable much more since I have been on stable maintenance duty ;-).
> > > > >
> > > > > I did a couple of long runs and I am not able to reproduce it anymore. :-/
> > > >
> > > > I know that feeling!
> > >
> > > Same here, this is after all the reason why we keep the tick dependency within
> > > the hotplug process without really knowing why :o)
> > 
> > Heh. I have been running into another intermittent one as well which
> > is the boost failure and that happens once in 10-15 runs or so.
> > 
> > I was thinking of running the following configuration on an automated
> > regular basis to at least provide a better clue on the lucky run that
> > catches an issue. But then the issue is it would change timing enough
> > to maybe hide bugs. I could also make it submit logs automatically to
> > the list on such occurrences, but one step at a time and all that.  I
> > do need to add (hopefully less noisy) tick/timer related trace events.
> > 
> > # Define the bootargs array
> > bootargs=(
> >     "ftrace_dump_on_oops"
> >     "panic_on_warn=1"
> >     "sysctl.kernel.panic_on_rcu_stall=1"
> >     "sysctl.kernel.max_rcu_stall_to_panic=1"
> >     "trace_buf_size=10K"
> >     "traceoff_on_warning=1"
> >     "panic_print=0x1f"      # To dump held locks, mem and other info.
> > )
> > # Define the trace events array passed to bootargs.
> > trace_events=(
> >     "sched:sched_switch"
> >     "sched:sched_waking"
> >     "rcu:rcu_callback"
> >     "rcu:rcu_fqs"
> >     "rcu:rcu_quiescent_state_report"
> >     "rcu:rcu_grace_period"
> > )
> 
> So some insight on this boost failure. Just before the boost failures are
> reported, I see the migration thread interferring with the rcu_preempt thread
> (aka GP kthread). See trace below. Of note is that the rcu_preempt thread is
> runnable while context switching, which means its execution is interferred.
> The rcu_preempt thread is at RT prio 2 as can be seen.
> 
> So some open-ended questions: what exactly does the migration thread want,
> this is something related to CPU hotplug? And if the migration thread had to
> run, why did the rcu_preempt thread not get pushed to another CPU by the
> scheduler? We have 16 vCPUs for this test.

Maybe we need a cpus_read_lock() before doing a given boost-test interval
and a cpus_read_unlock() after finishing one?  But much depends on
exactly what is starting those migration threads.

Then again, TREE03 is pretty aggressive about doing CPU hotplug.

> IMHO, this might be unrelated to boost but just that the boost test detects
> a stall in RCU 'sooner'. I am planning to add more tracing to the migration
> thread to trace down, but I'm afraid I am hitting some limits of what the
> console is capable of in terms of accepting trace dumping. I am seeing a lot
> of printk messages dropped.

Quite possibly.

> Maybe if I reduce RCU stall detector threshold to 4 seconds or so, I might
> 
> see this even without boost testing, I feel.

Definitely something to try.

> migratio-66        8d..2. 1544321393us : sched_switch: prev_comm=migration/8 prev_pid=66 prev_prio=0 prev_state=S ==> next_comm=swapper/8 next_pid=0 next_prio=120
> <idle>-0           8d..2. 1544321890us : sched_switch: prev_comm=swapper/8 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=13 next_prio=97
> rcu_pree-13        8d..4. 1544321892us : sched_waking: comm=migration/8 pid=66 prio=0 target_cpu=008
> rcu_pree-13        8d..2. 1544321895us : sched_switch: prev_comm=rcu_preempt prev_pid=13 prev_prio=97 prev_state=R ==> next_comm=migration/8 next_pid=66 next_prio=0
> migratio-66        8d..2. 1544321903us : sched_switch: prev_comm=migration/8 prev_pid=66 prev_prio=0 prev_state=S ==> next_comm=swapper/8 next_pid=0 next_prio=120
> <idle>-0           8d..2. 1544322121us : sched_switch: prev_comm=swapper/8 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=13 next_prio=97
> rcu_pree-13        8d..4. 1544322123us : sched_waking: comm=migration/8 pid=66 prio=0 target_cpu=008
> rcu_pree-13        8d..2. 1544322126us : sched_switch: prev_comm=rcu_preempt prev_pid=13 prev_prio=97 prev_state=R ==> next_comm=migration/8 next_pid=66 next_prio=0
> migratio-66        8d..2. 1544322134us : sched_switch: prev_comm=migration/8 prev_pid=66 prev_prio=0 prev_state=S ==> next_comm=swapper/8 next_pid=0 next_prio=120
> <idle>-0           8d..2. 1544322495us : sched_switch: prev_comm=swapper/8 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=13 next_prio=97
> rcu_pree-13        8d..4. 1544322497us : sched_waking: comm=migration/8 pid=66 prio=0 target_cpu=008
> rcu_pree-13        8d..2. 1544322500us : sched_switch: prev_comm=rcu_preempt prev_pid=13 prev_prio=97 prev_state=R ==> next_comm=migration/8 next_pid=66 next_prio=0
> migratio-66        8d..2. 1544322508us : sched_switch: prev_comm=migration/8 prev_pid=66 prev_prio=0 prev_state=S ==> next_comm=swapper/8 next_pid=0 next_prio=120
> <idle>-0           8d..2. 1544322978us : sched_switch: prev_comm=swapper/8 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=13 next_prio=97
> rcu_pree-13        8d..4. 1544322980us : sched_waking: comm=migration/8 pid=66 prio=0 target_cpu=008
> rcu_pree-13        8d..2. 1544322984us : sched_switch: prev_comm=rcu_preempt prev_pid=13 prev_prio=97 prev_state=R ==> next_comm=migration/8 next_pid=66 next_prio=0
> migratio-66        8d..2. 1544322992us : sched_switch: prev_comm=migration/8 prev_pid=66 prev_prio=0 prev_state=S ==> next_comm=swapper/8 next_pid=0 next_prio=120
> <idle>-0           8d..2. 1544323301us : sched_switch: prev_comm=swapper/8 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=13 next_prio=97
> rcu_pree-13        8d..4. 1544323303us : sched_waking: comm=migration/8 pid=66 prio=0 target_cpu=008
> rcu_pree-13        8d..2. 1544323306us : sched_switch: prev_comm=rcu_preempt prev_pid=13 prev_prio=97 prev_state=R ==> next_comm=migration/8 next_pid=66 next_prio=0
> migratio-66        8d..2. 1544323314us : sched_switch: prev_comm=migration/8 prev_pid=66 prev_prio=0 prev_state=S ==> next_comm=swapper/8 next_pid=0 next_prio=120
> <idle>-0           8d..2. 1544323673us : sched_switch: prev_comm=swapper/8 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=13 next_prio=97
> rcu_pree-13        8d..4. 1544323675us : sched_waking: comm=migration/8 pid=66 prio=0 target_cpu=008
> rcu_pree-13        8d..2. 1544323678us : sched_switch: prev_comm=rcu_preempt prev_pid=13 prev_prio=97 prev_state=R ==> next_comm=migration/8 next_pid=66 next_prio=0
> migratio-66        8d..2. 1544323686us : sched_switch: prev_comm=migration/8 prev_pid=66 prev_prio=0 prev_state=S ==> next_comm=swapper/8 next_pid=0 next_prio=120
> rcu_tort-1726     12..... 1544323807us : rcu_torture_boost_failed: Boost inversion thread ->rt_priority 1 gp_state 1242792 jiffies 2503
> rcu_tort-1723      1..... 1544323840us : rcu_torture_boost_failed: Boost failed
> rcu_tort-1743      4..... 1544323844us : rcu_torture_boost_failed: Boost failed
> rcu_tort-1715      3..... 1544323848us : rcu_torture_boost_failed: Boost failed
> rcu_tort-1713      7..... 1544323852us : rcu_torture_boost_failed: Boost failed
> 
> (At this point a message says tracing was disabled due to a boost failure
> warning due to my use of traceoff_on_warning=1)
> 
> Side note: I was talking to Steve the other day that it'd be nice to extract
> the trace buffer in a better way than console. Perhaps we can do a core dump
> from rcutorture? I think 'crash' can already extract ftrace [1] from core
> dumps and Qemu can generate core dumps via QMP.
> 
> [1] https://access.redhat.com/solutions/239433

That would be a good option to have.  I have never tried it, so I can't say
what the plusses and minuses might be.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BUG] Random intermittent boost failures (Was Re: [BUG] TREE04..)
  2023-09-10 21:16 ` Paul E. McKenney
@ 2023-09-10 23:37   ` Joel Fernandes
  2023-09-11  2:27     ` Joel Fernandes
  0 siblings, 1 reply; 23+ messages in thread
From: Joel Fernandes @ 2023-09-10 23:37 UTC (permalink / raw)
  To: paulmck; +Cc: Frederic Weisbecker, rcu

On Sun, Sep 10, 2023 at 5:16 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Sun, Sep 10, 2023 at 08:14:45PM +0000, Joel Fernandes wrote:
[...]
> > >  I have been running into another intermittent one as well which
> > > is the boost failure and that happens once in 10-15 runs or so.
> > >
> > > I was thinking of running the following configuration on an automated
> > > regular basis to at least provide a better clue on the lucky run that
> > > catches an issue. But then the issue is it would change timing enough
> > > to maybe hide bugs. I could also make it submit logs automatically to
> > > the list on such occurrences, but one step at a time and all that.  I
> > > do need to add (hopefully less noisy) tick/timer related trace events.
> > >
> > > # Define the bootargs array
> > > bootargs=(
> > >     "ftrace_dump_on_oops"
> > >     "panic_on_warn=1"
> > >     "sysctl.kernel.panic_on_rcu_stall=1"
> > >     "sysctl.kernel.max_rcu_stall_to_panic=1"
> > >     "trace_buf_size=10K"
> > >     "traceoff_on_warning=1"
> > >     "panic_print=0x1f"      # To dump held locks, mem and other info.
> > > )
> > > # Define the trace events array passed to bootargs.
> > > trace_events=(
> > >     "sched:sched_switch"
> > >     "sched:sched_waking"
> > >     "rcu:rcu_callback"
> > >     "rcu:rcu_fqs"
> > >     "rcu:rcu_quiescent_state_report"
> > >     "rcu:rcu_grace_period"
> > > )
> >
> > So some insight on this boost failure. Just before the boost failures are
> > reported, I see the migration thread interferring with the rcu_preempt thread
> > (aka GP kthread). See trace below. Of note is that the rcu_preempt thread is
> > runnable while context switching, which means its execution is interferred.
> > The rcu_preempt thread is at RT prio 2 as can be seen.
> >
> > So some open-ended questions: what exactly does the migration thread want,
> > this is something related to CPU hotplug? And if the migration thread had to
> > run, why did the rcu_preempt thread not get pushed to another CPU by the
> > scheduler? We have 16 vCPUs for this test.
>
> Maybe we need a cpus_read_lock() before doing a given boost-test interval
> and a cpus_read_unlock() after finishing one?  But much depends on
> exactly what is starting those migration threads.

But in the field, a real RT task can preempt a reader without doing
cpus_read_lock() and may run into a similar boost issue?

> Then again, TREE03 is pretty aggressive about doing CPU hotplug.

Ok. I put a trace_printk() in the stopper thread to see what the
->fn() is. I'm doing another run to see what falls out.

thanks,

 - Joel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BUG] Random intermittent boost failures (Was Re: [BUG] TREE04..)
  2023-09-10 23:37   ` Joel Fernandes
@ 2023-09-11  2:27     ` Joel Fernandes
  2023-09-11  8:16       ` Paul E. McKenney
  0 siblings, 1 reply; 23+ messages in thread
From: Joel Fernandes @ 2023-09-11  2:27 UTC (permalink / raw)
  To: paulmck; +Cc: Frederic Weisbecker, rcu

On Sun, Sep 10, 2023 at 07:37:13PM -0400, Joel Fernandes wrote:
> On Sun, Sep 10, 2023 at 5:16 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Sun, Sep 10, 2023 at 08:14:45PM +0000, Joel Fernandes wrote:
> [...]
> > > >  I have been running into another intermittent one as well which
> > > > is the boost failure and that happens once in 10-15 runs or so.
> > > >
> > > > I was thinking of running the following configuration on an automated
> > > > regular basis to at least provide a better clue on the lucky run that
> > > > catches an issue. But then the issue is it would change timing enough
> > > > to maybe hide bugs. I could also make it submit logs automatically to
> > > > the list on such occurrences, but one step at a time and all that.  I
> > > > do need to add (hopefully less noisy) tick/timer related trace events.
> > > >
> > > > # Define the bootargs array
> > > > bootargs=(
> > > >     "ftrace_dump_on_oops"
> > > >     "panic_on_warn=1"
> > > >     "sysctl.kernel.panic_on_rcu_stall=1"
> > > >     "sysctl.kernel.max_rcu_stall_to_panic=1"
> > > >     "trace_buf_size=10K"
> > > >     "traceoff_on_warning=1"
> > > >     "panic_print=0x1f"      # To dump held locks, mem and other info.
> > > > )
> > > > # Define the trace events array passed to bootargs.
> > > > trace_events=(
> > > >     "sched:sched_switch"
> > > >     "sched:sched_waking"
> > > >     "rcu:rcu_callback"
> > > >     "rcu:rcu_fqs"
> > > >     "rcu:rcu_quiescent_state_report"
> > > >     "rcu:rcu_grace_period"
> > > > )
> > >
> > > So some insight on this boost failure. Just before the boost failures are
> > > reported, I see the migration thread interferring with the rcu_preempt thread
> > > (aka GP kthread). See trace below. Of note is that the rcu_preempt thread is
> > > runnable while context switching, which means its execution is interferred.
> > > The rcu_preempt thread is at RT prio 2 as can be seen.
> > >
> > > So some open-ended questions: what exactly does the migration thread want,
> > > this is something related to CPU hotplug? And if the migration thread had to
> > > run, why did the rcu_preempt thread not get pushed to another CPU by the
> > > scheduler? We have 16 vCPUs for this test.
> >
> > Maybe we need a cpus_read_lock() before doing a given boost-test interval
> > and a cpus_read_unlock() after finishing one?  But much depends on
> > exactly what is starting those migration threads.
> 
> But in the field, a real RT task can preempt a reader without doing
> cpus_read_lock() and may run into a similar boost issue?
> 
> > Then again, TREE03 is pretty aggressive about doing CPU hotplug.
> 
> Ok. I put a trace_printk() in the stopper thread to see what the
> ->fn() is. I'm doing another run to see what falls out.

I think I am getting somewhere, It looks like the migration thread tries to
push the task out due to __balance_push_cpu_stop. Clearly it fails because
the rcu_preempt task is still on CPU 3. And the whole thing is live locked
retrying the same.

I am not sure if the migration thread is trying to migrate rcu_preempt
because the CPU is going down or because a higher priority RT task is
running.

rcu_pree-13        3d..4. 2936723336us : sched_waking: comm=migration/3 pid=34 prio=0 target_cpu=003
rcu_pree-13        3d..2. 2936723338us : sched_switch: prev_comm=rcu_preempt prev_pid=13 prev_prio=97 prev_state=R ==> next_comm=migration/3 next_pid=34 next_prio=0
migratio-34        3..... 2936723341us : cpu_stopper_thread: cpu_stop: caller=balance_push+0xc9/0x150 fn=__balance_push_cpu_stop+0x0/0x1c0 on cpu 3
migratio-34        3d..2. 2936723345us : sched_switch: prev_comm=migration/3 prev_pid=34 prev_prio=0 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120
<idle>-0         3d..2. 2936723676us : sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=13 next_prio=97
rcu_pree-13        3d..4. 2936723678us : sched_waking: comm=migration/3 pid=34 prio=0 target_cpu=003
rcu_pree-13        3d..2. 2936723680us : sched_switch: prev_comm=rcu_preempt prev_pid=13 prev_prio=97 prev_state=R ==> next_comm=migration/3 next_pid=34 next_prio=0
migratio-34        3..... 2936723683us : cpu_stopper_thread: cpu_stop: caller=balance_push+0xc9/0x150 fn=__balance_push_cpu_stop+0x0/0x1c0 on cpu 3

I am digging deeper to see why the rcu_preempt thread cannot be pushed out
and then I'll also look at why is it being pushed out in the first place.

At least I have a strong repro now running 5 instances of TREE03 in parallel
for several hours.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BUG] Random intermittent boost failures (Was Re: [BUG] TREE04..)
  2023-09-11  2:27     ` Joel Fernandes
@ 2023-09-11  8:16       ` Paul E. McKenney
  2023-09-11 13:17         ` Joel Fernandes
  2023-09-13 20:30         ` Joel Fernandes
  0 siblings, 2 replies; 23+ messages in thread
From: Paul E. McKenney @ 2023-09-11  8:16 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: Frederic Weisbecker, rcu

On Mon, Sep 11, 2023 at 02:27:25AM +0000, Joel Fernandes wrote:
> On Sun, Sep 10, 2023 at 07:37:13PM -0400, Joel Fernandes wrote:
> > On Sun, Sep 10, 2023 at 5:16 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > >
> > > On Sun, Sep 10, 2023 at 08:14:45PM +0000, Joel Fernandes wrote:
> > [...]
> > > > >  I have been running into another intermittent one as well which
> > > > > is the boost failure and that happens once in 10-15 runs or so.
> > > > >
> > > > > I was thinking of running the following configuration on an automated
> > > > > regular basis to at least provide a better clue on the lucky run that
> > > > > catches an issue. But then the issue is it would change timing enough
> > > > > to maybe hide bugs. I could also make it submit logs automatically to
> > > > > the list on such occurrences, but one step at a time and all that.  I
> > > > > do need to add (hopefully less noisy) tick/timer related trace events.
> > > > >
> > > > > # Define the bootargs array
> > > > > bootargs=(
> > > > >     "ftrace_dump_on_oops"
> > > > >     "panic_on_warn=1"
> > > > >     "sysctl.kernel.panic_on_rcu_stall=1"
> > > > >     "sysctl.kernel.max_rcu_stall_to_panic=1"
> > > > >     "trace_buf_size=10K"
> > > > >     "traceoff_on_warning=1"
> > > > >     "panic_print=0x1f"      # To dump held locks, mem and other info.
> > > > > )
> > > > > # Define the trace events array passed to bootargs.
> > > > > trace_events=(
> > > > >     "sched:sched_switch"
> > > > >     "sched:sched_waking"
> > > > >     "rcu:rcu_callback"
> > > > >     "rcu:rcu_fqs"
> > > > >     "rcu:rcu_quiescent_state_report"
> > > > >     "rcu:rcu_grace_period"
> > > > > )
> > > >
> > > > So some insight on this boost failure. Just before the boost failures are
> > > > reported, I see the migration thread interferring with the rcu_preempt thread
> > > > (aka GP kthread). See trace below. Of note is that the rcu_preempt thread is
> > > > runnable while context switching, which means its execution is interferred.
> > > > The rcu_preempt thread is at RT prio 2 as can be seen.
> > > >
> > > > So some open-ended questions: what exactly does the migration thread want,
> > > > this is something related to CPU hotplug? And if the migration thread had to
> > > > run, why did the rcu_preempt thread not get pushed to another CPU by the
> > > > scheduler? We have 16 vCPUs for this test.
> > >
> > > Maybe we need a cpus_read_lock() before doing a given boost-test interval
> > > and a cpus_read_unlock() after finishing one?  But much depends on
> > > exactly what is starting those migration threads.
> > 
> > But in the field, a real RT task can preempt a reader without doing
> > cpus_read_lock() and may run into a similar boost issue?

The sysctl_sched_rt_runtime should prevent a livelock in most
configurations.  Here, rcutorture explicitly disables this.

> > > Then again, TREE03 is pretty aggressive about doing CPU hotplug.
> > 
> > Ok. I put a trace_printk() in the stopper thread to see what the
> > ->fn() is. I'm doing another run to see what falls out.
> 
> I think I am getting somewhere, It looks like the migration thread tries to
> push the task out due to __balance_push_cpu_stop. Clearly it fails because
> the rcu_preempt task is still on CPU 3. And the whole thing is live locked
> retrying the same.
> 
> I am not sure if the migration thread is trying to migrate rcu_preempt
> because the CPU is going down or because a higher priority RT task is
> running.
> 
> rcu_pree-13        3d..4. 2936723336us : sched_waking: comm=migration/3 pid=34 prio=0 target_cpu=003
> rcu_pree-13        3d..2. 2936723338us : sched_switch: prev_comm=rcu_preempt prev_pid=13 prev_prio=97 prev_state=R ==> next_comm=migration/3 next_pid=34 next_prio=0
> migratio-34        3..... 2936723341us : cpu_stopper_thread: cpu_stop: caller=balance_push+0xc9/0x150 fn=__balance_push_cpu_stop+0x0/0x1c0 on cpu 3
> migratio-34        3d..2. 2936723345us : sched_switch: prev_comm=migration/3 prev_pid=34 prev_prio=0 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120
> <idle>-0         3d..2. 2936723676us : sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=13 next_prio=97
> rcu_pree-13        3d..4. 2936723678us : sched_waking: comm=migration/3 pid=34 prio=0 target_cpu=003
> rcu_pree-13        3d..2. 2936723680us : sched_switch: prev_comm=rcu_preempt prev_pid=13 prev_prio=97 prev_state=R ==> next_comm=migration/3 next_pid=34 next_prio=0
> migratio-34        3..... 2936723683us : cpu_stopper_thread: cpu_stop: caller=balance_push+0xc9/0x150 fn=__balance_push_cpu_stop+0x0/0x1c0 on cpu 3
> 
> I am digging deeper to see why the rcu_preempt thread cannot be pushed out
> and then I'll also look at why is it being pushed out in the first place.
> 
> At least I have a strong repro now running 5 instances of TREE03 in parallel
> for several hours.

Very good!  Then why not boot with rcutorture.onoff_interval=0 and see if
the problem still occurs?  If yes, then there is definitely some reason
other than CPU hotplug that makes this happen.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BUG] Random intermittent boost failures (Was Re: [BUG] TREE04..)
  2023-09-11  8:16       ` Paul E. McKenney
@ 2023-09-11 13:17         ` Joel Fernandes
  2023-09-11 13:49           ` Paul E. McKenney
  2023-09-13 20:30         ` Joel Fernandes
  1 sibling, 1 reply; 23+ messages in thread
From: Joel Fernandes @ 2023-09-11 13:17 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Frederic Weisbecker, rcu

On Mon, Sep 11, 2023 at 01:16:21AM -0700, Paul E. McKenney wrote:
> On Mon, Sep 11, 2023 at 02:27:25AM +0000, Joel Fernandes wrote:
> > On Sun, Sep 10, 2023 at 07:37:13PM -0400, Joel Fernandes wrote:
> > > On Sun, Sep 10, 2023 at 5:16 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > >
> > > > On Sun, Sep 10, 2023 at 08:14:45PM +0000, Joel Fernandes wrote:
> > > [...]
> > > > > >  I have been running into another intermittent one as well which
> > > > > > is the boost failure and that happens once in 10-15 runs or so.
> > > > > >
> > > > > > I was thinking of running the following configuration on an automated
> > > > > > regular basis to at least provide a better clue on the lucky run that
> > > > > > catches an issue. But then the issue is it would change timing enough
> > > > > > to maybe hide bugs. I could also make it submit logs automatically to
> > > > > > the list on such occurrences, but one step at a time and all that.  I
> > > > > > do need to add (hopefully less noisy) tick/timer related trace events.
> > > > > >
> > > > > > # Define the bootargs array
> > > > > > bootargs=(
> > > > > >     "ftrace_dump_on_oops"
> > > > > >     "panic_on_warn=1"
> > > > > >     "sysctl.kernel.panic_on_rcu_stall=1"
> > > > > >     "sysctl.kernel.max_rcu_stall_to_panic=1"
> > > > > >     "trace_buf_size=10K"
> > > > > >     "traceoff_on_warning=1"
> > > > > >     "panic_print=0x1f"      # To dump held locks, mem and other info.
> > > > > > )
> > > > > > # Define the trace events array passed to bootargs.
> > > > > > trace_events=(
> > > > > >     "sched:sched_switch"
> > > > > >     "sched:sched_waking"
> > > > > >     "rcu:rcu_callback"
> > > > > >     "rcu:rcu_fqs"
> > > > > >     "rcu:rcu_quiescent_state_report"
> > > > > >     "rcu:rcu_grace_period"
> > > > > > )
> > > > >
> > > > > So some insight on this boost failure. Just before the boost failures are
> > > > > reported, I see the migration thread interferring with the rcu_preempt thread
> > > > > (aka GP kthread). See trace below. Of note is that the rcu_preempt thread is
> > > > > runnable while context switching, which means its execution is interferred.
> > > > > The rcu_preempt thread is at RT prio 2 as can be seen.
> > > > >
> > > > > So some open-ended questions: what exactly does the migration thread want,
> > > > > this is something related to CPU hotplug? And if the migration thread had to
> > > > > run, why did the rcu_preempt thread not get pushed to another CPU by the
> > > > > scheduler? We have 16 vCPUs for this test.
> > > >
> > > > Maybe we need a cpus_read_lock() before doing a given boost-test interval
> > > > and a cpus_read_unlock() after finishing one?  But much depends on
> > > > exactly what is starting those migration threads.
> > > 
> > > But in the field, a real RT task can preempt a reader without doing
> > > cpus_read_lock() and may run into a similar boost issue?
> 
> The sysctl_sched_rt_runtime should prevent a livelock in most
> configurations.  Here, rcutorture explicitly disables this.

I see. Though RT throttling will actually stall the rcu_preempt thread as
well in the real world. RT throttling is a bit broken and we're trying to fix
it in scheduler land. Even if there are idle CPUs, RT throttling will starve
not just the offending RT task, but all of them essentially causing a
priority inversion between running RT and CFS tasks.

> > > > Then again, TREE03 is pretty aggressive about doing CPU hotplug.
> > > 
> > > Ok. I put a trace_printk() in the stopper thread to see what the
> > > ->fn() is. I'm doing another run to see what falls out.
> > 
> > I think I am getting somewhere, It looks like the migration thread tries to
> > push the task out due to __balance_push_cpu_stop. Clearly it fails because
> > the rcu_preempt task is still on CPU 3. And the whole thing is live locked
> > retrying the same.
> > 
> > I am not sure if the migration thread is trying to migrate rcu_preempt
> > because the CPU is going down or because a higher priority RT task is
> > running.
> > 
> > rcu_pree-13        3d..4. 2936723336us : sched_waking: comm=migration/3 pid=34 prio=0 target_cpu=003
> > rcu_pree-13        3d..2. 2936723338us : sched_switch: prev_comm=rcu_preempt prev_pid=13 prev_prio=97 prev_state=R ==> next_comm=migration/3 next_pid=34 next_prio=0
> > migratio-34        3..... 2936723341us : cpu_stopper_thread: cpu_stop: caller=balance_push+0xc9/0x150 fn=__balance_push_cpu_stop+0x0/0x1c0 on cpu 3
> > migratio-34        3d..2. 2936723345us : sched_switch: prev_comm=migration/3 prev_pid=34 prev_prio=0 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120
> > <idle>-0         3d..2. 2936723676us : sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=13 next_prio=97
> > rcu_pree-13        3d..4. 2936723678us : sched_waking: comm=migration/3 pid=34 prio=0 target_cpu=003
> > rcu_pree-13        3d..2. 2936723680us : sched_switch: prev_comm=rcu_preempt prev_pid=13 prev_prio=97 prev_state=R ==> next_comm=migration/3 next_pid=34 next_prio=0
> > migratio-34        3..... 2936723683us : cpu_stopper_thread: cpu_stop: caller=balance_push+0xc9/0x150 fn=__balance_push_cpu_stop+0x0/0x1c0 on cpu 3
> > 
> > I am digging deeper to see why the rcu_preempt thread cannot be pushed out
> > and then I'll also look at why is it being pushed out in the first place.
> > 
> > At least I have a strong repro now running 5 instances of TREE03 in parallel
> > for several hours.
> 
> Very good!  Then why not boot with rcutorture.onoff_interval=0 and see if
> the problem still occurs?  If yes, then there is definitely some reason
> other than CPU hotplug that makes this happen.

Sure, I'll do this now. If that disappears, I'll do the cpu_read_lock()
suggestion you mentioned. IMHO this is increasingly looking like a scheduling
issue with RT and too many RT tasks running at the same time where the
rcu_preempt thread migrates around looking for a CPU than it does any work.

Today morning's traces show that __balance_push_cpu_stop() actually
successfully moves the rcu_preempt task out of the CPU. But a few ms later,
it is back on the same CPU again only to have the migration thread try to
move it out again. Maybe the cpuhp threads are fighting with it. There's also
all the rcutorture boost kthreads that are all running at the same time
during the failure but those are lower priority than the rcu_preempt thread
in this test.

Also I am switching to some day job things while the onoff_interval=0 test
runs, thanks a lot Paul for your help!  I'll continue to look at this as I
can.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BUG] Random intermittent boost failures (Was Re: [BUG] TREE04..)
  2023-09-11 13:17         ` Joel Fernandes
@ 2023-09-11 13:49           ` Paul E. McKenney
  2023-09-11 16:18             ` Joel Fernandes
  0 siblings, 1 reply; 23+ messages in thread
From: Paul E. McKenney @ 2023-09-11 13:49 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: Frederic Weisbecker, rcu

On Mon, Sep 11, 2023 at 01:17:30PM +0000, Joel Fernandes wrote:
> On Mon, Sep 11, 2023 at 01:16:21AM -0700, Paul E. McKenney wrote:
> > On Mon, Sep 11, 2023 at 02:27:25AM +0000, Joel Fernandes wrote:
> > > On Sun, Sep 10, 2023 at 07:37:13PM -0400, Joel Fernandes wrote:
> > > > On Sun, Sep 10, 2023 at 5:16 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > >
> > > > > On Sun, Sep 10, 2023 at 08:14:45PM +0000, Joel Fernandes wrote:
> > > > [...]
> > > > > > >  I have been running into another intermittent one as well which
> > > > > > > is the boost failure and that happens once in 10-15 runs or so.
> > > > > > >
> > > > > > > I was thinking of running the following configuration on an automated
> > > > > > > regular basis to at least provide a better clue on the lucky run that
> > > > > > > catches an issue. But then the issue is it would change timing enough
> > > > > > > to maybe hide bugs. I could also make it submit logs automatically to
> > > > > > > the list on such occurrences, but one step at a time and all that.  I
> > > > > > > do need to add (hopefully less noisy) tick/timer related trace events.
> > > > > > >
> > > > > > > # Define the bootargs array
> > > > > > > bootargs=(
> > > > > > >     "ftrace_dump_on_oops"
> > > > > > >     "panic_on_warn=1"
> > > > > > >     "sysctl.kernel.panic_on_rcu_stall=1"
> > > > > > >     "sysctl.kernel.max_rcu_stall_to_panic=1"
> > > > > > >     "trace_buf_size=10K"
> > > > > > >     "traceoff_on_warning=1"
> > > > > > >     "panic_print=0x1f"      # To dump held locks, mem and other info.
> > > > > > > )
> > > > > > > # Define the trace events array passed to bootargs.
> > > > > > > trace_events=(
> > > > > > >     "sched:sched_switch"
> > > > > > >     "sched:sched_waking"
> > > > > > >     "rcu:rcu_callback"
> > > > > > >     "rcu:rcu_fqs"
> > > > > > >     "rcu:rcu_quiescent_state_report"
> > > > > > >     "rcu:rcu_grace_period"
> > > > > > > )
> > > > > >
> > > > > > So some insight on this boost failure. Just before the boost failures are
> > > > > > reported, I see the migration thread interferring with the rcu_preempt thread
> > > > > > (aka GP kthread). See trace below. Of note is that the rcu_preempt thread is
> > > > > > runnable while context switching, which means its execution is interferred.
> > > > > > The rcu_preempt thread is at RT prio 2 as can be seen.
> > > > > >
> > > > > > So some open-ended questions: what exactly does the migration thread want,
> > > > > > this is something related to CPU hotplug? And if the migration thread had to
> > > > > > run, why did the rcu_preempt thread not get pushed to another CPU by the
> > > > > > scheduler? We have 16 vCPUs for this test.
> > > > >
> > > > > Maybe we need a cpus_read_lock() before doing a given boost-test interval
> > > > > and a cpus_read_unlock() after finishing one?  But much depends on
> > > > > exactly what is starting those migration threads.
> > > > 
> > > > But in the field, a real RT task can preempt a reader without doing
> > > > cpus_read_lock() and may run into a similar boost issue?
> > 
> > The sysctl_sched_rt_runtime should prevent a livelock in most
> > configurations.  Here, rcutorture explicitly disables this.
> 
> I see. Though RT throttling will actually stall the rcu_preempt thread as
> well in the real world. RT throttling is a bit broken and we're trying to fix
> it in scheduler land. Even if there are idle CPUs, RT throttling will starve
> not just the offending RT task, but all of them essentially causing a
> priority inversion between running RT and CFS tasks.

Fair point.  But that requires that the offending runaway RT task hit both
a reader and the grace-period kthread.  Keeping in mind that rcutorture
is provisioning one runaway RT task per CPU, which in the real world is
hopefully quite rare.  Hopefully.  ;-)

> > > > > Then again, TREE03 is pretty aggressive about doing CPU hotplug.
> > > > 
> > > > Ok. I put a trace_printk() in the stopper thread to see what the
> > > > ->fn() is. I'm doing another run to see what falls out.
> > > 
> > > I think I am getting somewhere, It looks like the migration thread tries to
> > > push the task out due to __balance_push_cpu_stop. Clearly it fails because
> > > the rcu_preempt task is still on CPU 3. And the whole thing is live locked
> > > retrying the same.
> > > 
> > > I am not sure if the migration thread is trying to migrate rcu_preempt
> > > because the CPU is going down or because a higher priority RT task is
> > > running.
> > > 
> > > rcu_pree-13        3d..4. 2936723336us : sched_waking: comm=migration/3 pid=34 prio=0 target_cpu=003
> > > rcu_pree-13        3d..2. 2936723338us : sched_switch: prev_comm=rcu_preempt prev_pid=13 prev_prio=97 prev_state=R ==> next_comm=migration/3 next_pid=34 next_prio=0
> > > migratio-34        3..... 2936723341us : cpu_stopper_thread: cpu_stop: caller=balance_push+0xc9/0x150 fn=__balance_push_cpu_stop+0x0/0x1c0 on cpu 3
> > > migratio-34        3d..2. 2936723345us : sched_switch: prev_comm=migration/3 prev_pid=34 prev_prio=0 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120
> > > <idle>-0         3d..2. 2936723676us : sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=13 next_prio=97
> > > rcu_pree-13        3d..4. 2936723678us : sched_waking: comm=migration/3 pid=34 prio=0 target_cpu=003
> > > rcu_pree-13        3d..2. 2936723680us : sched_switch: prev_comm=rcu_preempt prev_pid=13 prev_prio=97 prev_state=R ==> next_comm=migration/3 next_pid=34 next_prio=0
> > > migratio-34        3..... 2936723683us : cpu_stopper_thread: cpu_stop: caller=balance_push+0xc9/0x150 fn=__balance_push_cpu_stop+0x0/0x1c0 on cpu 3
> > > 
> > > I am digging deeper to see why the rcu_preempt thread cannot be pushed out
> > > and then I'll also look at why is it being pushed out in the first place.
> > > 
> > > At least I have a strong repro now running 5 instances of TREE03 in parallel
> > > for several hours.
> > 
> > Very good!  Then why not boot with rcutorture.onoff_interval=0 and see if
> > the problem still occurs?  If yes, then there is definitely some reason
> > other than CPU hotplug that makes this happen.
> 
> Sure, I'll do this now. If that disappears, I'll do the cpu_read_lock()
> suggestion you mentioned. IMHO this is increasingly looking like a scheduling
> issue with RT and too many RT tasks running at the same time where the
> rcu_preempt thread migrates around looking for a CPU than it does any work.
> 
> Today morning's traces show that __balance_push_cpu_stop() actually
> successfully moves the rcu_preempt task out of the CPU. But a few ms later,
> it is back on the same CPU again only to have the migration thread try to
> move it out again. Maybe the cpuhp threads are fighting with it. There's also
> all the rcutorture boost kthreads that are all running at the same time
> during the failure but those are lower priority than the rcu_preempt thread
> in this test.
> 
> Also I am switching to some day job things while the onoff_interval=0 test
> runs, thanks a lot Paul for your help!  I'll continue to look at this as I
> can.

Sounds like good progress!  Please let me know how it goes!!!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BUG] Random intermittent boost failures (Was Re: [BUG] TREE04..)
  2023-09-11 13:49           ` Paul E. McKenney
@ 2023-09-11 16:18             ` Joel Fernandes
  0 siblings, 0 replies; 23+ messages in thread
From: Joel Fernandes @ 2023-09-11 16:18 UTC (permalink / raw)
  To: paulmck; +Cc: Frederic Weisbecker, rcu

On Mon, Sep 11, 2023 at 9:49 AM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Mon, Sep 11, 2023 at 01:17:30PM +0000, Joel Fernandes wrote:
> > On Mon, Sep 11, 2023 at 01:16:21AM -0700, Paul E. McKenney wrote:
> > > On Mon, Sep 11, 2023 at 02:27:25AM +0000, Joel Fernandes wrote:
> > > > On Sun, Sep 10, 2023 at 07:37:13PM -0400, Joel Fernandes wrote:
> > > > > On Sun, Sep 10, 2023 at 5:16 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > > >
> > > > > > On Sun, Sep 10, 2023 at 08:14:45PM +0000, Joel Fernandes wrote:
> > > > > [...]
> > > > > > > >  I have been running into another intermittent one as well which
> > > > > > > > is the boost failure and that happens once in 10-15 runs or so.
> > > > > > > >
> > > > > > > > I was thinking of running the following configuration on an automated
> > > > > > > > regular basis to at least provide a better clue on the lucky run that
> > > > > > > > catches an issue. But then the issue is it would change timing enough
> > > > > > > > to maybe hide bugs. I could also make it submit logs automatically to
> > > > > > > > the list on such occurrences, but one step at a time and all that.  I
> > > > > > > > do need to add (hopefully less noisy) tick/timer related trace events.
> > > > > > > >
> > > > > > > > # Define the bootargs array
> > > > > > > > bootargs=(
[...]
> > > > > > > So some insight on this boost failure. Just before the boost failures are
> > > > > > > reported, I see the migration thread interferring with the rcu_preempt thread
> > > > > > > (aka GP kthread). See trace below. Of note is that the rcu_preempt thread is
> > > > > > > runnable while context switching, which means its execution is interferred.
> > > > > > > The rcu_preempt thread is at RT prio 2 as can be seen.
> > > > > > >
> > > > > > > So some open-ended questions: what exactly does the migration thread want,
> > > > > > > this is something related to CPU hotplug? And if the migration thread had to
> > > > > > > run, why did the rcu_preempt thread not get pushed to another CPU by the
> > > > > > > scheduler? We have 16 vCPUs for this test.
> > > > > >
> > > > > > Maybe we need a cpus_read_lock() before doing a given boost-test interval
> > > > > > and a cpus_read_unlock() after finishing one?  But much depends on
> > > > > > exactly what is starting those migration threads.
> > > > >
> > > > > But in the field, a real RT task can preempt a reader without doing
> > > > > cpus_read_lock() and may run into a similar boost issue?
> > >
> > > The sysctl_sched_rt_runtime should prevent a livelock in most
> > > configurations.  Here, rcutorture explicitly disables this.
> >
> > I see. Though RT throttling will actually stall the rcu_preempt thread as
> > well in the real world. RT throttling is a bit broken and we're trying to fix
> > it in scheduler land. Even if there are idle CPUs, RT throttling will starve
> > not just the offending RT task, but all of them essentially causing a
> > priority inversion between running RT and CFS tasks.
>
> Fair point.  But that requires that the offending runaway RT task hit both
> a reader and the grace-period kthread.  Keeping in mind that rcutorture
> is provisioning one runaway RT task per CPU, which in the real world is
> hopefully quite rare.  Hopefully.  ;-)

You are right, I exaggerated a bit. Indeed in the real world, RT
throttling can cause a prio inversion with CFS only if all other CPUs
are also RT throttled. Otherwise it tries to migrate the RT task to
another CPU. That's a very great point.

> Sounds like good progress!  Please let me know how it goes!!!

Thanks! Will do,

 - Joel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BUG] Random intermittent boost failures (Was Re: [BUG] TREE04..)
  2023-09-11  8:16       ` Paul E. McKenney
  2023-09-11 13:17         ` Joel Fernandes
@ 2023-09-13 20:30         ` Joel Fernandes
  2023-09-14 11:11           ` Paul E. McKenney
  1 sibling, 1 reply; 23+ messages in thread
From: Joel Fernandes @ 2023-09-13 20:30 UTC (permalink / raw)
  To: paulmck; +Cc: Frederic Weisbecker, rcu

On Mon, Sep 11, 2023 at 4:16 AM Paul E. McKenney <paulmck@kernel.org> wrote:
[..]
> > I am digging deeper to see why the rcu_preempt thread cannot be pushed out
> > and then I'll also look at why is it being pushed out in the first place.
> >
> > At least I have a strong repro now running 5 instances of TREE03 in parallel
> > for several hours.
>
> Very good!  Then why not boot with rcutorture.onoff_interval=0 and see if
> the problem still occurs?  If yes, then there is definitely some reason
> other than CPU hotplug that makes this happen.

Hi Paul,
So looks so far like onoff_interval=0 makes the issue disappear. So
likely hotplug related. I am ok with doing the cpus_read_lock during
boost testing and seeing if that fixes it. If it does, I can move on
to the next thing in my backlog.

What do you think? Or should I spend more time root-causing it? It is
most like runaway RT threads combined with the CPU hotplug threads,
making scheduling of the rcu_preempt thread not happen. But I can't
say for sure without more/better tracing (Speaking of better tracing,
I am adding core-dump support to rcutorture, but it is not there yet).

thanks,

 - Joel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BUG] Random intermittent boost failures (Was Re: [BUG] TREE04..)
  2023-09-13 20:30         ` Joel Fernandes
@ 2023-09-14 11:11           ` Paul E. McKenney
  2023-09-14 13:13             ` Joel Fernandes
  0 siblings, 1 reply; 23+ messages in thread
From: Paul E. McKenney @ 2023-09-14 11:11 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: Frederic Weisbecker, rcu

On Wed, Sep 13, 2023 at 04:30:20PM -0400, Joel Fernandes wrote:
> On Mon, Sep 11, 2023 at 4:16 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> [..]
> > > I am digging deeper to see why the rcu_preempt thread cannot be pushed out
> > > and then I'll also look at why is it being pushed out in the first place.
> > >
> > > At least I have a strong repro now running 5 instances of TREE03 in parallel
> > > for several hours.
> >
> > Very good!  Then why not boot with rcutorture.onoff_interval=0 and see if
> > the problem still occurs?  If yes, then there is definitely some reason
> > other than CPU hotplug that makes this happen.
> 
> Hi Paul,
> So looks so far like onoff_interval=0 makes the issue disappear. So
> likely hotplug related. I am ok with doing the cpus_read_lock during
> boost testing and seeing if that fixes it. If it does, I can move on
> to the next thing in my backlog.
> 
> What do you think? Or should I spend more time root-causing it? It is
> most like runaway RT threads combined with the CPU hotplug threads,
> making scheduling of the rcu_preempt thread not happen. But I can't
> say for sure without more/better tracing (Speaking of better tracing,
> I am adding core-dump support to rcutorture, but it is not there yet).

This would not be the first time rcutorture has had trouble with those
threads, so I am for adding the cpus_read_lock().

Additional root-causing might be helpful, but then again, you might
have higher priority things to worry about.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BUG] Random intermittent boost failures (Was Re: [BUG] TREE04..)
  2023-09-14 11:11           ` Paul E. McKenney
@ 2023-09-14 13:13             ` Joel Fernandes
  2023-09-14 15:23               ` Paul E. McKenney
  0 siblings, 1 reply; 23+ messages in thread
From: Joel Fernandes @ 2023-09-14 13:13 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Frederic Weisbecker, rcu

On Thu, Sep 14, 2023 at 04:11:26AM -0700, Paul E. McKenney wrote:
> On Wed, Sep 13, 2023 at 04:30:20PM -0400, Joel Fernandes wrote:
> > On Mon, Sep 11, 2023 at 4:16 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> > [..]
> > > > I am digging deeper to see why the rcu_preempt thread cannot be pushed out
> > > > and then I'll also look at why is it being pushed out in the first place.
> > > >
> > > > At least I have a strong repro now running 5 instances of TREE03 in parallel
> > > > for several hours.
> > >
> > > Very good!  Then why not boot with rcutorture.onoff_interval=0 and see if
> > > the problem still occurs?  If yes, then there is definitely some reason
> > > other than CPU hotplug that makes this happen.
> > 
> > Hi Paul,
> > So looks so far like onoff_interval=0 makes the issue disappear. So
> > likely hotplug related. I am ok with doing the cpus_read_lock during
> > boost testing and seeing if that fixes it. If it does, I can move on
> > to the next thing in my backlog.
> > 
> > What do you think? Or should I spend more time root-causing it? It is
> > most like runaway RT threads combined with the CPU hotplug threads,
> > making scheduling of the rcu_preempt thread not happen. But I can't
> > say for sure without more/better tracing (Speaking of better tracing,
> > I am adding core-dump support to rcutorture, but it is not there yet).
> 
> This would not be the first time rcutorture has had trouble with those
> threads, so I am for adding the cpus_read_lock().
> 
> Additional root-causing might be helpful, but then again, you might
> have higher priority things to worry about.  ;-)

No worries. Unfortunately putting cpus_read_lock() around the boost test
causes hangs. I tried something like the following [1]. If you have a diff, I can
quickly try something to see if the issue goes away as well.

I am also interested in tracing further the state of CPUs when the back and
forth migration happens. That code is really hairy (select_fallback_rq())
with some scary code comments about more audit being needed, so I would not
be surprised if something is off there. I left some trace prints in there [2]
to see if I can gather more information. I'll let you know if I find
anything there.

[ btw I'm officially OOO from Mon-Thurs next week, that's mostly Ok since the
last set stable releases just got done. Though I might cheat and attend to
low-hanging fruit type of tasks like documentation during the OOO :-D ]

[1]
diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 9c028b118abc..4354f87f522a 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -1115,6 +1115,7 @@ static int rcu_torture_boost(void *arg)
 		}
 
 		// Do one boost-test interval.
+		cpus_read_lock();
 		endtime = oldstarttime + test_boost_duration * HZ;
 		while (time_before(jiffies, endtime)) {
 			// Has current GP gone too long?
@@ -1164,7 +1165,7 @@ static int rcu_torture_boost(void *arg)
 			}
 			schedule_timeout_uninterruptible(1);
 		}
-
+		cpus_read_unlock();
 		/* Go do the stutter. */
 checkwait:	if (stutter_wait("rcu_torture_boost"))
 			sched_set_fifo_low(current);
-- 

[2]
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1569e9591654..1e79e059e578 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2552,6 +2552,12 @@ static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
 	rq = cpu_rq(new_cpu);
 
 	rq_lock(rq, rf);
+	if (rq->curr) {
+		trace_printk("Curr on new_cpu (%d) is: %s[%d]\n",
+				new_cpu, rq->curr->comm, task_pid_nr(rq->curr));
+	} else {
+		trace_printk("Curr on new_cpu (%d) is NULL\n", new_cpu);
+	}
 	WARN_ON_ONCE(task_cpu(p) != new_cpu);
 	activate_task(rq, p, 0);
 	check_preempt_curr(rq, p, 0);
@@ -9490,6 +9496,11 @@ static int __balance_push_cpu_stop(void *arg)
 		trace_printk("attempt push task %d currently on cpu %d to...\n", task_pid_nr(p), rq->cpu);
 		cpu = select_fallback_rq(rq->cpu, p);
 		trace_printk("to new cpu %d\n", cpu);
+		trace_printk("is_cpu_allowed(%d), md(p)=%d, pf_kthread=%d, kthread_is_per_cpu=%d, cpu_online=%d, cpu_dying=%d\n",
+				cpu,
+				is_migration_disabled(p),
+				p->flags & PF_KTHREAD, kthread_is_per_cpu(p),
+				cpu_online(cpu), cpu_dying(cpu));
 		rq = __migrate_task(rq, &rf, p, cpu);
 		trace_printk("After __migrate_task, new cpu = %d\n", rq->cpu);
 	}
-- 
2.42.0.459.ge4e396fd5e-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [BUG] Random intermittent boost failures (Was Re: [BUG] TREE04..)
  2023-09-14 13:13             ` Joel Fernandes
@ 2023-09-14 15:23               ` Paul E. McKenney
  2023-09-14 18:56                 ` Joel Fernandes
  0 siblings, 1 reply; 23+ messages in thread
From: Paul E. McKenney @ 2023-09-14 15:23 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: Frederic Weisbecker, rcu

On Thu, Sep 14, 2023 at 01:13:51PM +0000, Joel Fernandes wrote:
> On Thu, Sep 14, 2023 at 04:11:26AM -0700, Paul E. McKenney wrote:
> > On Wed, Sep 13, 2023 at 04:30:20PM -0400, Joel Fernandes wrote:
> > > On Mon, Sep 11, 2023 at 4:16 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > [..]
> > > > > I am digging deeper to see why the rcu_preempt thread cannot be pushed out
> > > > > and then I'll also look at why is it being pushed out in the first place.
> > > > >
> > > > > At least I have a strong repro now running 5 instances of TREE03 in parallel
> > > > > for several hours.
> > > >
> > > > Very good!  Then why not boot with rcutorture.onoff_interval=0 and see if
> > > > the problem still occurs?  If yes, then there is definitely some reason
> > > > other than CPU hotplug that makes this happen.
> > > 
> > > Hi Paul,
> > > So looks so far like onoff_interval=0 makes the issue disappear. So
> > > likely hotplug related. I am ok with doing the cpus_read_lock during
> > > boost testing and seeing if that fixes it. If it does, I can move on
> > > to the next thing in my backlog.
> > > 
> > > What do you think? Or should I spend more time root-causing it? It is
> > > most like runaway RT threads combined with the CPU hotplug threads,
> > > making scheduling of the rcu_preempt thread not happen. But I can't
> > > say for sure without more/better tracing (Speaking of better tracing,
> > > I am adding core-dump support to rcutorture, but it is not there yet).
> > 
> > This would not be the first time rcutorture has had trouble with those
> > threads, so I am for adding the cpus_read_lock().
> > 
> > Additional root-causing might be helpful, but then again, you might
> > have higher priority things to worry about.  ;-)
> 
> No worries. Unfortunately putting cpus_read_lock() around the boost test
> causes hangs. I tried something like the following [1]. If you have a diff, I can
> quickly try something to see if the issue goes away as well.

The other approaches that occur to me are:

1.	Synchronize with the torture.c CPU-hotplug code.  This is a bit
	tricky as well.

2.	Rearrange the testing to convert one of the TREE0* scenarios that
	is not in CFLIST (TREE06 or TREE08) to a real-time configuration,
	with boosting but without CPU hotplug.	Then remove boosting
	from TREE04.

Of these, #2 seems most productive.  But is there a better way?

> I am also interested in tracing further the state of CPUs when the back and
> forth migration happens. That code is really hairy (select_fallback_rq())
> with some scary code comments about more audit being needed, so I would not
> be surprised if something is off there. I left some trace prints in there [2]
> to see if I can gather more information. I'll let you know if I find
> anything there.

Looking forward to seeing what you come up with!

> [ btw I'm officially OOO from Mon-Thurs next week, that's mostly Ok since the
> last set stable releases just got done. Though I might cheat and attend to
> low-hanging fruit type of tasks like documentation during the OOO :-D ]

Have a good time away!

							Thanx, Paul

> [1]
> diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
> index 9c028b118abc..4354f87f522a 100644
> --- a/kernel/rcu/rcutorture.c
> +++ b/kernel/rcu/rcutorture.c
> @@ -1115,6 +1115,7 @@ static int rcu_torture_boost(void *arg)
>  		}
>  
>  		// Do one boost-test interval.
> +		cpus_read_lock();
>  		endtime = oldstarttime + test_boost_duration * HZ;
>  		while (time_before(jiffies, endtime)) {
>  			// Has current GP gone too long?
> @@ -1164,7 +1165,7 @@ static int rcu_torture_boost(void *arg)
>  			}
>  			schedule_timeout_uninterruptible(1);
>  		}
> -
> +		cpus_read_unlock();
>  		/* Go do the stutter. */
>  checkwait:	if (stutter_wait("rcu_torture_boost"))
>  			sched_set_fifo_low(current);
> -- 
> 
> [2]
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1569e9591654..1e79e059e578 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2552,6 +2552,12 @@ static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
>  	rq = cpu_rq(new_cpu);
>  
>  	rq_lock(rq, rf);
> +	if (rq->curr) {
> +		trace_printk("Curr on new_cpu (%d) is: %s[%d]\n",
> +				new_cpu, rq->curr->comm, task_pid_nr(rq->curr));
> +	} else {
> +		trace_printk("Curr on new_cpu (%d) is NULL\n", new_cpu);
> +	}
>  	WARN_ON_ONCE(task_cpu(p) != new_cpu);
>  	activate_task(rq, p, 0);
>  	check_preempt_curr(rq, p, 0);
> @@ -9490,6 +9496,11 @@ static int __balance_push_cpu_stop(void *arg)
>  		trace_printk("attempt push task %d currently on cpu %d to...\n", task_pid_nr(p), rq->cpu);
>  		cpu = select_fallback_rq(rq->cpu, p);
>  		trace_printk("to new cpu %d\n", cpu);
> +		trace_printk("is_cpu_allowed(%d), md(p)=%d, pf_kthread=%d, kthread_is_per_cpu=%d, cpu_online=%d, cpu_dying=%d\n",
> +				cpu,
> +				is_migration_disabled(p),
> +				p->flags & PF_KTHREAD, kthread_is_per_cpu(p),
> +				cpu_online(cpu), cpu_dying(cpu));
>  		rq = __migrate_task(rq, &rf, p, cpu);
>  		trace_printk("After __migrate_task, new cpu = %d\n", rq->cpu);
>  	}
> -- 
> 2.42.0.459.ge4e396fd5e-goog
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BUG] Random intermittent boost failures (Was Re: [BUG] TREE04..)
  2023-09-14 15:23               ` Paul E. McKenney
@ 2023-09-14 18:56                 ` Joel Fernandes
  2023-09-14 21:53                   ` Joel Fernandes
  0 siblings, 1 reply; 23+ messages in thread
From: Joel Fernandes @ 2023-09-14 18:56 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Frederic Weisbecker, rcu

On Thu, Sep 14, 2023 at 08:23:38AM -0700, Paul E. McKenney wrote:
> On Thu, Sep 14, 2023 at 01:13:51PM +0000, Joel Fernandes wrote:
> > On Thu, Sep 14, 2023 at 04:11:26AM -0700, Paul E. McKenney wrote:
> > > On Wed, Sep 13, 2023 at 04:30:20PM -0400, Joel Fernandes wrote:
> > > > On Mon, Sep 11, 2023 at 4:16 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > [..]
> > > > > > I am digging deeper to see why the rcu_preempt thread cannot be pushed out
> > > > > > and then I'll also look at why is it being pushed out in the first place.
> > > > > >
> > > > > > At least I have a strong repro now running 5 instances of TREE03 in parallel
> > > > > > for several hours.
> > > > >
> > > > > Very good!  Then why not boot with rcutorture.onoff_interval=0 and see if
> > > > > the problem still occurs?  If yes, then there is definitely some reason
> > > > > other than CPU hotplug that makes this happen.
> > > > 
> > > > Hi Paul,
> > > > So looks so far like onoff_interval=0 makes the issue disappear. So
> > > > likely hotplug related. I am ok with doing the cpus_read_lock during
> > > > boost testing and seeing if that fixes it. If it does, I can move on
> > > > to the next thing in my backlog.
> > > > 
> > > > What do you think? Or should I spend more time root-causing it? It is
> > > > most like runaway RT threads combined with the CPU hotplug threads,
> > > > making scheduling of the rcu_preempt thread not happen. But I can't
> > > > say for sure without more/better tracing (Speaking of better tracing,
> > > > I am adding core-dump support to rcutorture, but it is not there yet).
> > > 
> > > This would not be the first time rcutorture has had trouble with those
> > > threads, so I am for adding the cpus_read_lock().
> > > 
> > > Additional root-causing might be helpful, but then again, you might
> > > have higher priority things to worry about.  ;-)
> > 
> > No worries. Unfortunately putting cpus_read_lock() around the boost test
> > causes hangs. I tried something like the following [1]. If you have a diff, I can
> > quickly try something to see if the issue goes away as well.
> 
> The other approaches that occur to me are:
> 
> 1.	Synchronize with the torture.c CPU-hotplug code.  This is a bit
> 	tricky as well.
> 
> 2.	Rearrange the testing to convert one of the TREE0* scenarios that
> 	is not in CFLIST (TREE06 or TREE08) to a real-time configuration,
> 	with boosting but without CPU hotplug.	Then remove boosting
> 	from TREE04.
> 
> Of these, #2 seems most productive.  But is there a better way?

We could have the gp thread at higher priority for TREE03. What I see
consistently is that the GP thread gets migrated from CPU M to CPU N only to
be immediately sent back. Dumping the state showed CPU N is running ksoftirqd
which is also a rt priority 2.  Making rcu_preempt 3 and ksoftirqd 2 might
give less of a run-around to rcu_preempt maybe enough to prevent the grace
period from stalling. I am not sure if this will fix it, but I am running a
test to see how it goes, will let you know.

thanks,

 - Joel

> 
> > I am also interested in tracing further the state of CPUs when the back and
> > forth migration happens. That code is really hairy (select_fallback_rq())
> > with some scary code comments about more audit being needed, so I would not
> > be surprised if something is off there. I left some trace prints in there [2]
> > to see if I can gather more information. I'll let you know if I find
> > anything there.
> 
> Looking forward to seeing what you come up with!
> 
> > [ btw I'm officially OOO from Mon-Thurs next week, that's mostly Ok since the
> > last set stable releases just got done. Though I might cheat and attend to
> > low-hanging fruit type of tasks like documentation during the OOO :-D ]
> 
> Have a good time away!
> 
> 							Thanx, Paul
> 
> > [1]
> > diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
> > index 9c028b118abc..4354f87f522a 100644
> > --- a/kernel/rcu/rcutorture.c
> > +++ b/kernel/rcu/rcutorture.c
> > @@ -1115,6 +1115,7 @@ static int rcu_torture_boost(void *arg)
> >  		}
> >  
> >  		// Do one boost-test interval.
> > +		cpus_read_lock();
> >  		endtime = oldstarttime + test_boost_duration * HZ;
> >  		while (time_before(jiffies, endtime)) {
> >  			// Has current GP gone too long?
> > @@ -1164,7 +1165,7 @@ static int rcu_torture_boost(void *arg)
> >  			}
> >  			schedule_timeout_uninterruptible(1);
> >  		}
> > -
> > +		cpus_read_unlock();
> >  		/* Go do the stutter. */
> >  checkwait:	if (stutter_wait("rcu_torture_boost"))
> >  			sched_set_fifo_low(current);
> > -- 
> > 
> > [2]
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 1569e9591654..1e79e059e578 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -2552,6 +2552,12 @@ static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
> >  	rq = cpu_rq(new_cpu);
> >  
> >  	rq_lock(rq, rf);
> > +	if (rq->curr) {
> > +		trace_printk("Curr on new_cpu (%d) is: %s[%d]\n",
> > +				new_cpu, rq->curr->comm, task_pid_nr(rq->curr));
> > +	} else {
> > +		trace_printk("Curr on new_cpu (%d) is NULL\n", new_cpu);
> > +	}
> >  	WARN_ON_ONCE(task_cpu(p) != new_cpu);
> >  	activate_task(rq, p, 0);
> >  	check_preempt_curr(rq, p, 0);
> > @@ -9490,6 +9496,11 @@ static int __balance_push_cpu_stop(void *arg)
> >  		trace_printk("attempt push task %d currently on cpu %d to...\n", task_pid_nr(p), rq->cpu);
> >  		cpu = select_fallback_rq(rq->cpu, p);
> >  		trace_printk("to new cpu %d\n", cpu);
> > +		trace_printk("is_cpu_allowed(%d), md(p)=%d, pf_kthread=%d, kthread_is_per_cpu=%d, cpu_online=%d, cpu_dying=%d\n",
> > +				cpu,
> > +				is_migration_disabled(p),
> > +				p->flags & PF_KTHREAD, kthread_is_per_cpu(p),
> > +				cpu_online(cpu), cpu_dying(cpu));
> >  		rq = __migrate_task(rq, &rf, p, cpu);
> >  		trace_printk("After __migrate_task, new cpu = %d\n", rq->cpu);
> >  	}
> > -- 
> > 2.42.0.459.ge4e396fd5e-goog
> > 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BUG] Random intermittent boost failures (Was Re: [BUG] TREE04..)
  2023-09-14 18:56                 ` Joel Fernandes
@ 2023-09-14 21:53                   ` Joel Fernandes
  2023-09-15  0:13                     ` Joel Fernandes
  2023-09-15 14:45                     ` Paul E. McKenney
  0 siblings, 2 replies; 23+ messages in thread
From: Joel Fernandes @ 2023-09-14 21:53 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Frederic Weisbecker, rcu

On Thu, Sep 14, 2023 at 06:56:27PM +0000, Joel Fernandes wrote:
> On Thu, Sep 14, 2023 at 08:23:38AM -0700, Paul E. McKenney wrote:
> > On Thu, Sep 14, 2023 at 01:13:51PM +0000, Joel Fernandes wrote:
> > > On Thu, Sep 14, 2023 at 04:11:26AM -0700, Paul E. McKenney wrote:
> > > > On Wed, Sep 13, 2023 at 04:30:20PM -0400, Joel Fernandes wrote:
> > > > > On Mon, Sep 11, 2023 at 4:16 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > > [..]
> > > > > > > I am digging deeper to see why the rcu_preempt thread cannot be pushed out
> > > > > > > and then I'll also look at why is it being pushed out in the first place.
> > > > > > >
> > > > > > > At least I have a strong repro now running 5 instances of TREE03 in parallel
> > > > > > > for several hours.
> > > > > >
> > > > > > Very good!  Then why not boot with rcutorture.onoff_interval=0 and see if
> > > > > > the problem still occurs?  If yes, then there is definitely some reason
> > > > > > other than CPU hotplug that makes this happen.
> > > > > 
> > > > > Hi Paul,
> > > > > So looks so far like onoff_interval=0 makes the issue disappear. So
> > > > > likely hotplug related. I am ok with doing the cpus_read_lock during
> > > > > boost testing and seeing if that fixes it. If it does, I can move on
> > > > > to the next thing in my backlog.
> > > > > 
> > > > > What do you think? Or should I spend more time root-causing it? It is
> > > > > most like runaway RT threads combined with the CPU hotplug threads,
> > > > > making scheduling of the rcu_preempt thread not happen. But I can't
> > > > > say for sure without more/better tracing (Speaking of better tracing,
> > > > > I am adding core-dump support to rcutorture, but it is not there yet).
> > > > 
> > > > This would not be the first time rcutorture has had trouble with those
> > > > threads, so I am for adding the cpus_read_lock().
> > > > 
> > > > Additional root-causing might be helpful, but then again, you might
> > > > have higher priority things to worry about.  ;-)
> > > 
> > > No worries. Unfortunately putting cpus_read_lock() around the boost test
> > > causes hangs. I tried something like the following [1]. If you have a diff, I can
> > > quickly try something to see if the issue goes away as well.
> > 
> > The other approaches that occur to me are:
> > 
> > 1.	Synchronize with the torture.c CPU-hotplug code.  This is a bit
> > 	tricky as well.
> > 
> > 2.	Rearrange the testing to convert one of the TREE0* scenarios that
> > 	is not in CFLIST (TREE06 or TREE08) to a real-time configuration,
> > 	with boosting but without CPU hotplug.	Then remove boosting
> > 	from TREE04.
> > 
> > Of these, #2 seems most productive.  But is there a better way?
> 
> We could have the gp thread at higher priority for TREE03. What I see
> consistently is that the GP thread gets migrated from CPU M to CPU N only to
> be immediately sent back. Dumping the state showed CPU N is running ksoftirqd
> which is also a rt priority 2.  Making rcu_preempt 3 and ksoftirqd 2 might
> give less of a run-around to rcu_preempt maybe enough to prevent the grace
> period from stalling. I am not sure if this will fix it, but I am running a
> test to see how it goes, will let you know.

That led to a lot of fireworks. :-) I am thinking though, do we really need
to run a boost kthread on all CPUs? I think that might be the root cause
because the boost threads run on all CPUs except perhaps the one dying.

We could run them on just the odd, or even ones and still be able to get
sufficient boost testing. This may be especially important without RT
throttling. I'll go ahead and queue a test like that.

Thoughts?

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BUG] Random intermittent boost failures (Was Re: [BUG] TREE04..)
  2023-09-14 21:53                   ` Joel Fernandes
@ 2023-09-15  0:13                     ` Joel Fernandes
  2023-09-15 11:33                       ` Joel Fernandes
  2023-09-15 14:48                       ` Paul E. McKenney
  2023-09-15 14:45                     ` Paul E. McKenney
  1 sibling, 2 replies; 23+ messages in thread
From: Joel Fernandes @ 2023-09-15  0:13 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Frederic Weisbecker, rcu

On Thu, Sep 14, 2023 at 09:53:24PM +0000, Joel Fernandes wrote:
> On Thu, Sep 14, 2023 at 06:56:27PM +0000, Joel Fernandes wrote:
> > On Thu, Sep 14, 2023 at 08:23:38AM -0700, Paul E. McKenney wrote:
> > > On Thu, Sep 14, 2023 at 01:13:51PM +0000, Joel Fernandes wrote:
> > > > On Thu, Sep 14, 2023 at 04:11:26AM -0700, Paul E. McKenney wrote:
> > > > > On Wed, Sep 13, 2023 at 04:30:20PM -0400, Joel Fernandes wrote:
> > > > > > On Mon, Sep 11, 2023 at 4:16 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > > > [..]
> > > > > > > > I am digging deeper to see why the rcu_preempt thread cannot be pushed out
> > > > > > > > and then I'll also look at why is it being pushed out in the first place.
> > > > > > > >
> > > > > > > > At least I have a strong repro now running 5 instances of TREE03 in parallel
> > > > > > > > for several hours.
> > > > > > >
> > > > > > > Very good!  Then why not boot with rcutorture.onoff_interval=0 and see if
> > > > > > > the problem still occurs?  If yes, then there is definitely some reason
> > > > > > > other than CPU hotplug that makes this happen.
> > > > > > 
> > > > > > Hi Paul,
> > > > > > So looks so far like onoff_interval=0 makes the issue disappear. So
> > > > > > likely hotplug related. I am ok with doing the cpus_read_lock during
> > > > > > boost testing and seeing if that fixes it. If it does, I can move on
> > > > > > to the next thing in my backlog.
> > > > > > 
> > > > > > What do you think? Or should I spend more time root-causing it? It is
> > > > > > most like runaway RT threads combined with the CPU hotplug threads,
> > > > > > making scheduling of the rcu_preempt thread not happen. But I can't
> > > > > > say for sure without more/better tracing (Speaking of better tracing,
> > > > > > I am adding core-dump support to rcutorture, but it is not there yet).
> > > > > 
> > > > > This would not be the first time rcutorture has had trouble with those
> > > > > threads, so I am for adding the cpus_read_lock().
> > > > > 
> > > > > Additional root-causing might be helpful, but then again, you might
> > > > > have higher priority things to worry about.  ;-)
> > > > 
> > > > No worries. Unfortunately putting cpus_read_lock() around the boost test
> > > > causes hangs. I tried something like the following [1]. If you have a diff, I can
> > > > quickly try something to see if the issue goes away as well.
> > > 
> > > The other approaches that occur to me are:
> > > 
> > > 1.	Synchronize with the torture.c CPU-hotplug code.  This is a bit
> > > 	tricky as well.
> > > 
> > > 2.	Rearrange the testing to convert one of the TREE0* scenarios that
> > > 	is not in CFLIST (TREE06 or TREE08) to a real-time configuration,
> > > 	with boosting but without CPU hotplug.	Then remove boosting
> > > 	from TREE04.
> > > 
> > > Of these, #2 seems most productive.  But is there a better way?
> > 
> > We could have the gp thread at higher priority for TREE03. What I see
> > consistently is that the GP thread gets migrated from CPU M to CPU N only to
> > be immediately sent back. Dumping the state showed CPU N is running ksoftirqd
> > which is also a rt priority 2.  Making rcu_preempt 3 and ksoftirqd 2 might
> > give less of a run-around to rcu_preempt maybe enough to prevent the grace
> > period from stalling. I am not sure if this will fix it, but I am running a
> > test to see how it goes, will let you know.
> 
> That led to a lot of fireworks. :-) I am thinking though, do we really need
> to run a boost kthread on all CPUs? I think that might be the root cause
> because the boost threads run on all CPUs except perhaps the one dying.
> 
> We could run them on just the odd, or even ones and still be able to get
> sufficient boost testing. This may be especially important without RT
> throttling. I'll go ahead and queue a test like that.

Sorry if I am too noisy. So far only letting the rcutorture boost threads
exist on odd CPUs, I am seeing the issue go away (but I'm running an extended
test to confirm).

On the other hand, I came up with a real fix [1] and I am currently testing it.
This is to fix a live lock between RT push and CPU hotplug's
select_fallback_rq()-induced push. I am not sure if the fix works but I have
some faith based on what I'm seeing in traces. Fingers crossed. I also feel
the real fix is needed to prevent these issues even if we're able to hide it
by halving the total rcutorture boost threads.

[1]
---8<-----------------------

From: Joel Fernandes <joelaf@google.com>
Subject: [PATCH] Fix livelock between RT and select_fallback_rq

Signed-off-by: Joel Fernandes <joelaf@google.com>
---
 kernel/sched/rt.c | 4 ++--
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 00e0e5074115..b92aab35d7ec 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1945,7 +1945,7 @@ static int find_lowest_rq(struct task_struct *task)
 
 			best_cpu = cpumask_any_and_distribute(lowest_mask,
 							      sched_domain_span(sd));
-			if (best_cpu < nr_cpu_ids) {
+			if (best_cpu < nr_cpu_ids && !cpu_dying(best_cpu)) {
 				rcu_read_unlock();
 				return best_cpu;
 			}
@@ -1962,7 +1962,7 @@ static int find_lowest_rq(struct task_struct *task)
 		return this_cpu;
 
 	cpu = cpumask_any_distribute(lowest_mask);
-	if (cpu < nr_cpu_ids)
+	if (cpu < nr_cpu_ids && !cpu_dying(cpu))
 		return cpu;
 
 	return -1;
-- 
2.42.0.459.ge4e396fd5e-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [BUG] Random intermittent boost failures (Was Re: [BUG] TREE04..)
  2023-09-15  0:13                     ` Joel Fernandes
@ 2023-09-15 11:33                       ` Joel Fernandes
  2023-09-15 14:53                         ` Paul E. McKenney
  2023-09-15 14:48                       ` Paul E. McKenney
  1 sibling, 1 reply; 23+ messages in thread
From: Joel Fernandes @ 2023-09-15 11:33 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Frederic Weisbecker, rcu

On Fri, Sep 15, 2023 at 12:13:31AM +0000, Joel Fernandes wrote:
> On Thu, Sep 14, 2023 at 09:53:24PM +0000, Joel Fernandes wrote:
> > On Thu, Sep 14, 2023 at 06:56:27PM +0000, Joel Fernandes wrote:
> > > On Thu, Sep 14, 2023 at 08:23:38AM -0700, Paul E. McKenney wrote:
> > > > On Thu, Sep 14, 2023 at 01:13:51PM +0000, Joel Fernandes wrote:
> > > > > On Thu, Sep 14, 2023 at 04:11:26AM -0700, Paul E. McKenney wrote:
> > > > > > On Wed, Sep 13, 2023 at 04:30:20PM -0400, Joel Fernandes wrote:
> > > > > > > On Mon, Sep 11, 2023 at 4:16 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > > > > [..]
> > > > > > > > > I am digging deeper to see why the rcu_preempt thread cannot be pushed out
> > > > > > > > > and then I'll also look at why is it being pushed out in the first place.
> > > > > > > > >
> > > > > > > > > At least I have a strong repro now running 5 instances of TREE03 in parallel
> > > > > > > > > for several hours.
> > > > > > > >
> > > > > > > > Very good!  Then why not boot with rcutorture.onoff_interval=0 and see if
> > > > > > > > the problem still occurs?  If yes, then there is definitely some reason
> > > > > > > > other than CPU hotplug that makes this happen.
> > > > > > > 
> > > > > > > Hi Paul,
> > > > > > > So looks so far like onoff_interval=0 makes the issue disappear. So
> > > > > > > likely hotplug related. I am ok with doing the cpus_read_lock during
> > > > > > > boost testing and seeing if that fixes it. If it does, I can move on
> > > > > > > to the next thing in my backlog.
> > > > > > > 
> > > > > > > What do you think? Or should I spend more time root-causing it? It is
> > > > > > > most like runaway RT threads combined with the CPU hotplug threads,
> > > > > > > making scheduling of the rcu_preempt thread not happen. But I can't
> > > > > > > say for sure without more/better tracing (Speaking of better tracing,
> > > > > > > I am adding core-dump support to rcutorture, but it is not there yet).
> > > > > > 
> > > > > > This would not be the first time rcutorture has had trouble with those
> > > > > > threads, so I am for adding the cpus_read_lock().
> > > > > > 
> > > > > > Additional root-causing might be helpful, but then again, you might
> > > > > > have higher priority things to worry about.  ;-)
> > > > > 
> > > > > No worries. Unfortunately putting cpus_read_lock() around the boost test
> > > > > causes hangs. I tried something like the following [1]. If you have a diff, I can
> > > > > quickly try something to see if the issue goes away as well.
> > > > 
> > > > The other approaches that occur to me are:
> > > > 
> > > > 1.	Synchronize with the torture.c CPU-hotplug code.  This is a bit
> > > > 	tricky as well.
> > > > 
> > > > 2.	Rearrange the testing to convert one of the TREE0* scenarios that
> > > > 	is not in CFLIST (TREE06 or TREE08) to a real-time configuration,
> > > > 	with boosting but without CPU hotplug.	Then remove boosting
> > > > 	from TREE04.
> > > > 
> > > > Of these, #2 seems most productive.  But is there a better way?
> > > 
> > > We could have the gp thread at higher priority for TREE03. What I see
> > > consistently is that the GP thread gets migrated from CPU M to CPU N only to
> > > be immediately sent back. Dumping the state showed CPU N is running ksoftirqd
> > > which is also a rt priority 2.  Making rcu_preempt 3 and ksoftirqd 2 might
> > > give less of a run-around to rcu_preempt maybe enough to prevent the grace
> > > period from stalling. I am not sure if this will fix it, but I am running a
> > > test to see how it goes, will let you know.
> > 
> > That led to a lot of fireworks. :-) I am thinking though, do we really need
> > to run a boost kthread on all CPUs? I think that might be the root cause
> > because the boost threads run on all CPUs except perhaps the one dying.
> > 
> > We could run them on just the odd, or even ones and still be able to get
> > sufficient boost testing. This may be especially important without RT
> > throttling. I'll go ahead and queue a test like that.
> 
> Sorry if I am too noisy. So far only letting the rcutorture boost threads
> exist on odd CPUs, I am seeing the issue go away (but I'm running an extended
> test to confirm).
> 
> On the other hand, I came up with a real fix [1] and I am currently testing it.
> This is to fix a live lock between RT push and CPU hotplug's
> select_fallback_rq()-induced push. I am not sure if the fix works but I have
> some faith based on what I'm seeing in traces. Fingers crossed. I also feel
> the real fix is needed to prevent these issues even if we're able to hide it
> by halving the total rcutorture boost threads.

So that fixed it without any changes to RCU. Below is the updated patch also
for the archives. Though I'm rewriting it slightly differently and testing
that more. The main thing I am doing in the new patch is I find that RT
should not select !cpu_active() CPUs since those have the scheduler turned
off. Though checking for cpu_dying() also works. I could not find any
instance where cpu_dying() != cpu_active() but there could be a tiny window
where that is true. Anyway, I'll make some noise with scheduler folks once I
have the new version of the patch tested.

Also halving the number of RT boost threads makes it less likely to occur but
does not work. Not too surprising since the issue actually may not be related
to too many RT threads but rather a lockup between hotplug and RT..

---8<-----------------------

From: Joel Fernandes <joelaf@google.com>
Subject: [PATCH] Fix livelock between RT and select_fallback_rq

Signed-off-by: Joel Fernandes <joelaf@google.com>
---
 kernel/sched/rt.c | 25 +++++++++----------------
 1 file changed, 9 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 00e0e5074115..a089d6f24e5b 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -526,6 +526,11 @@ static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
 }
 #endif
 
+static inline bool rt_task_fits_in_cpu(struct task_struct *p, int cpu)
+{
+	return rt_task_fits_capacity(p, cpu) && !cpu_dying(cpu);
+}
+
 #ifdef CONFIG_RT_GROUP_SCHED
 
 static inline u64 sched_rt_runtime(struct rt_rq *rt_rq)
@@ -1641,14 +1646,14 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
 	       unlikely(rt_task(curr)) &&
 	       (curr->nr_cpus_allowed < 2 || curr->prio <= p->prio);
 
-	if (test || !rt_task_fits_capacity(p, cpu)) {
+	if (test || !rt_task_fits_in_cpu(p, cpu)) {
 		int target = find_lowest_rq(p);
 
 		/*
 		 * Bail out if we were forcing a migration to find a better
 		 * fitting CPU but our search failed.
 		 */
-		if (!test && target != -1 && !rt_task_fits_capacity(p, target))
+		if (!test && target != -1 && !rt_task_fits_in_cpu(p, target))
 			goto out_unlock;
 
 		/*
@@ -1892,21 +1897,9 @@ static int find_lowest_rq(struct task_struct *task)
 	if (task->nr_cpus_allowed == 1)
 		return -1; /* No other targets possible */
 
-	/*
-	 * If we're on asym system ensure we consider the different capacities
-	 * of the CPUs when searching for the lowest_mask.
-	 */
-	if (sched_asym_cpucap_active()) {
-
-		ret = cpupri_find_fitness(&task_rq(task)->rd->cpupri,
+	ret = cpupri_find_fitness(&task_rq(task)->rd->cpupri,
 					  task, lowest_mask,
-					  rt_task_fits_capacity);
-	} else {
-
-		ret = cpupri_find(&task_rq(task)->rd->cpupri,
-				  task, lowest_mask);
-	}
-
+					  rt_task_fits_in_cpu);
 	if (!ret)
 		return -1; /* No targets found */
 
-- 
2.42.0.459.ge4e396fd5e-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [BUG] Random intermittent boost failures (Was Re: [BUG] TREE04..)
  2023-09-14 21:53                   ` Joel Fernandes
  2023-09-15  0:13                     ` Joel Fernandes
@ 2023-09-15 14:45                     ` Paul E. McKenney
  1 sibling, 0 replies; 23+ messages in thread
From: Paul E. McKenney @ 2023-09-15 14:45 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: Frederic Weisbecker, rcu

On Thu, Sep 14, 2023 at 09:53:24PM +0000, Joel Fernandes wrote:
> On Thu, Sep 14, 2023 at 06:56:27PM +0000, Joel Fernandes wrote:
> > On Thu, Sep 14, 2023 at 08:23:38AM -0700, Paul E. McKenney wrote:
> > > On Thu, Sep 14, 2023 at 01:13:51PM +0000, Joel Fernandes wrote:
> > > > On Thu, Sep 14, 2023 at 04:11:26AM -0700, Paul E. McKenney wrote:
> > > > > On Wed, Sep 13, 2023 at 04:30:20PM -0400, Joel Fernandes wrote:
> > > > > > On Mon, Sep 11, 2023 at 4:16 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > > > [..]
> > > > > > > > I am digging deeper to see why the rcu_preempt thread cannot be pushed out
> > > > > > > > and then I'll also look at why is it being pushed out in the first place.
> > > > > > > >
> > > > > > > > At least I have a strong repro now running 5 instances of TREE03 in parallel
> > > > > > > > for several hours.
> > > > > > >
> > > > > > > Very good!  Then why not boot with rcutorture.onoff_interval=0 and see if
> > > > > > > the problem still occurs?  If yes, then there is definitely some reason
> > > > > > > other than CPU hotplug that makes this happen.
> > > > > > 
> > > > > > Hi Paul,
> > > > > > So looks so far like onoff_interval=0 makes the issue disappear. So
> > > > > > likely hotplug related. I am ok with doing the cpus_read_lock during
> > > > > > boost testing and seeing if that fixes it. If it does, I can move on
> > > > > > to the next thing in my backlog.
> > > > > > 
> > > > > > What do you think? Or should I spend more time root-causing it? It is
> > > > > > most like runaway RT threads combined with the CPU hotplug threads,
> > > > > > making scheduling of the rcu_preempt thread not happen. But I can't
> > > > > > say for sure without more/better tracing (Speaking of better tracing,
> > > > > > I am adding core-dump support to rcutorture, but it is not there yet).
> > > > > 
> > > > > This would not be the first time rcutorture has had trouble with those
> > > > > threads, so I am for adding the cpus_read_lock().
> > > > > 
> > > > > Additional root-causing might be helpful, but then again, you might
> > > > > have higher priority things to worry about.  ;-)
> > > > 
> > > > No worries. Unfortunately putting cpus_read_lock() around the boost test
> > > > causes hangs. I tried something like the following [1]. If you have a diff, I can
> > > > quickly try something to see if the issue goes away as well.
> > > 
> > > The other approaches that occur to me are:
> > > 
> > > 1.	Synchronize with the torture.c CPU-hotplug code.  This is a bit
> > > 	tricky as well.
> > > 
> > > 2.	Rearrange the testing to convert one of the TREE0* scenarios that
> > > 	is not in CFLIST (TREE06 or TREE08) to a real-time configuration,
> > > 	with boosting but without CPU hotplug.	Then remove boosting
> > > 	from TREE04.
> > > 
> > > Of these, #2 seems most productive.  But is there a better way?
> > 
> > We could have the gp thread at higher priority for TREE03. What I see
> > consistently is that the GP thread gets migrated from CPU M to CPU N only to
> > be immediately sent back. Dumping the state showed CPU N is running ksoftirqd
> > which is also a rt priority 2.  Making rcu_preempt 3 and ksoftirqd 2 might
> > give less of a run-around to rcu_preempt maybe enough to prevent the grace
> > period from stalling. I am not sure if this will fix it, but I am running a
> > test to see how it goes, will let you know.
> 
> That led to a lot of fireworks. :-) I am thinking though, do we really need
> to run a boost kthread on all CPUs? I think that might be the root cause
> because the boost threads run on all CPUs except perhaps the one dying.
> 
> We could run them on just the odd, or even ones and still be able to get
> sufficient boost testing. This may be especially important without RT
> throttling. I'll go ahead and queue a test like that.
> 
> Thoughts?

The problem with this is that it will often render RCU priority boosting
unnecessary.  Any kthread preempted within an RCU read-side critical
section will with high probability quickly be resumed on one of the
even-numbered CPUs.

Or were you also planning to bind the rcu_torture_reader() kthreads to
a specific CPU, preventing such migration?  Or am I missing something
here?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BUG] Random intermittent boost failures (Was Re: [BUG] TREE04..)
  2023-09-15  0:13                     ` Joel Fernandes
  2023-09-15 11:33                       ` Joel Fernandes
@ 2023-09-15 14:48                       ` Paul E. McKenney
  1 sibling, 0 replies; 23+ messages in thread
From: Paul E. McKenney @ 2023-09-15 14:48 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: Frederic Weisbecker, rcu

On Fri, Sep 15, 2023 at 12:13:31AM +0000, Joel Fernandes wrote:
> On Thu, Sep 14, 2023 at 09:53:24PM +0000, Joel Fernandes wrote:
> > On Thu, Sep 14, 2023 at 06:56:27PM +0000, Joel Fernandes wrote:
> > > On Thu, Sep 14, 2023 at 08:23:38AM -0700, Paul E. McKenney wrote:
> > > > On Thu, Sep 14, 2023 at 01:13:51PM +0000, Joel Fernandes wrote:
> > > > > On Thu, Sep 14, 2023 at 04:11:26AM -0700, Paul E. McKenney wrote:
> > > > > > On Wed, Sep 13, 2023 at 04:30:20PM -0400, Joel Fernandes wrote:
> > > > > > > On Mon, Sep 11, 2023 at 4:16 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > > > > [..]
> > > > > > > > > I am digging deeper to see why the rcu_preempt thread cannot be pushed out
> > > > > > > > > and then I'll also look at why is it being pushed out in the first place.
> > > > > > > > >
> > > > > > > > > At least I have a strong repro now running 5 instances of TREE03 in parallel
> > > > > > > > > for several hours.
> > > > > > > >
> > > > > > > > Very good!  Then why not boot with rcutorture.onoff_interval=0 and see if
> > > > > > > > the problem still occurs?  If yes, then there is definitely some reason
> > > > > > > > other than CPU hotplug that makes this happen.
> > > > > > > 
> > > > > > > Hi Paul,
> > > > > > > So looks so far like onoff_interval=0 makes the issue disappear. So
> > > > > > > likely hotplug related. I am ok with doing the cpus_read_lock during
> > > > > > > boost testing and seeing if that fixes it. If it does, I can move on
> > > > > > > to the next thing in my backlog.
> > > > > > > 
> > > > > > > What do you think? Or should I spend more time root-causing it? It is
> > > > > > > most like runaway RT threads combined with the CPU hotplug threads,
> > > > > > > making scheduling of the rcu_preempt thread not happen. But I can't
> > > > > > > say for sure without more/better tracing (Speaking of better tracing,
> > > > > > > I am adding core-dump support to rcutorture, but it is not there yet).
> > > > > > 
> > > > > > This would not be the first time rcutorture has had trouble with those
> > > > > > threads, so I am for adding the cpus_read_lock().
> > > > > > 
> > > > > > Additional root-causing might be helpful, but then again, you might
> > > > > > have higher priority things to worry about.  ;-)
> > > > > 
> > > > > No worries. Unfortunately putting cpus_read_lock() around the boost test
> > > > > causes hangs. I tried something like the following [1]. If you have a diff, I can
> > > > > quickly try something to see if the issue goes away as well.
> > > > 
> > > > The other approaches that occur to me are:
> > > > 
> > > > 1.	Synchronize with the torture.c CPU-hotplug code.  This is a bit
> > > > 	tricky as well.
> > > > 
> > > > 2.	Rearrange the testing to convert one of the TREE0* scenarios that
> > > > 	is not in CFLIST (TREE06 or TREE08) to a real-time configuration,
> > > > 	with boosting but without CPU hotplug.	Then remove boosting
> > > > 	from TREE04.
> > > > 
> > > > Of these, #2 seems most productive.  But is there a better way?
> > > 
> > > We could have the gp thread at higher priority for TREE03. What I see
> > > consistently is that the GP thread gets migrated from CPU M to CPU N only to
> > > be immediately sent back. Dumping the state showed CPU N is running ksoftirqd
> > > which is also a rt priority 2.  Making rcu_preempt 3 and ksoftirqd 2 might
> > > give less of a run-around to rcu_preempt maybe enough to prevent the grace
> > > period from stalling. I am not sure if this will fix it, but I am running a
> > > test to see how it goes, will let you know.
> > 
> > That led to a lot of fireworks. :-) I am thinking though, do we really need
> > to run a boost kthread on all CPUs? I think that might be the root cause
> > because the boost threads run on all CPUs except perhaps the one dying.
> > 
> > We could run them on just the odd, or even ones and still be able to get
> > sufficient boost testing. This may be especially important without RT
> > throttling. I'll go ahead and queue a test like that.
> 
> Sorry if I am too noisy. So far only letting the rcutorture boost threads
> exist on odd CPUs, I am seeing the issue go away (but I'm running an extended
> test to confirm).
> 
> On the other hand, I came up with a real fix [1] and I am currently testing it.
> This is to fix a live lock between RT push and CPU hotplug's
> select_fallback_rq()-induced push. I am not sure if the fix works but I have
> some faith based on what I'm seeing in traces. Fingers crossed. I also feel
> the real fix is needed to prevent these issues even if we're able to hide it
> by halving the total rcutorture boost threads.

This don't-schedule-on-dying CPUs approach does quite look promising
to me!

Then again, I cannot claim to be a scheduler expert.  And I am a bit
surprised that this does not already happen.  Which makes me wonder
(admittedly without evidence either way) whether there is some CPU-hotplug
race that it might induce.  But then again, figuring this sort of thing
out is what part of the scheduler guys are there for, right?  ;-)

							Thanx, Paul

> [1]
> ---8<-----------------------
> 
> From: Joel Fernandes <joelaf@google.com>
> Subject: [PATCH] Fix livelock between RT and select_fallback_rq
> 
> Signed-off-by: Joel Fernandes <joelaf@google.com>
> ---
>  kernel/sched/rt.c | 4 ++--
>  1 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 00e0e5074115..b92aab35d7ec 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1945,7 +1945,7 @@ static int find_lowest_rq(struct task_struct *task)
>  
>  			best_cpu = cpumask_any_and_distribute(lowest_mask,
>  							      sched_domain_span(sd));
> -			if (best_cpu < nr_cpu_ids) {
> +			if (best_cpu < nr_cpu_ids && !cpu_dying(best_cpu)) {
>  				rcu_read_unlock();
>  				return best_cpu;
>  			}
> @@ -1962,7 +1962,7 @@ static int find_lowest_rq(struct task_struct *task)
>  		return this_cpu;
>  
>  	cpu = cpumask_any_distribute(lowest_mask);
> -	if (cpu < nr_cpu_ids)
> +	if (cpu < nr_cpu_ids && !cpu_dying(cpu))
>  		return cpu;
>  
>  	return -1;
> -- 
> 2.42.0.459.ge4e396fd5e-goog
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BUG] Random intermittent boost failures (Was Re: [BUG] TREE04..)
  2023-09-15 11:33                       ` Joel Fernandes
@ 2023-09-15 14:53                         ` Paul E. McKenney
  2023-09-15 16:37                           ` Joel Fernandes
  0 siblings, 1 reply; 23+ messages in thread
From: Paul E. McKenney @ 2023-09-15 14:53 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: Frederic Weisbecker, rcu

On Fri, Sep 15, 2023 at 11:33:13AM +0000, Joel Fernandes wrote:
> On Fri, Sep 15, 2023 at 12:13:31AM +0000, Joel Fernandes wrote:
> > On Thu, Sep 14, 2023 at 09:53:24PM +0000, Joel Fernandes wrote:
> > > On Thu, Sep 14, 2023 at 06:56:27PM +0000, Joel Fernandes wrote:
> > > > On Thu, Sep 14, 2023 at 08:23:38AM -0700, Paul E. McKenney wrote:
> > > > > On Thu, Sep 14, 2023 at 01:13:51PM +0000, Joel Fernandes wrote:
> > > > > > On Thu, Sep 14, 2023 at 04:11:26AM -0700, Paul E. McKenney wrote:
> > > > > > > On Wed, Sep 13, 2023 at 04:30:20PM -0400, Joel Fernandes wrote:
> > > > > > > > On Mon, Sep 11, 2023 at 4:16 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > > > > > [..]
> > > > > > > > > > I am digging deeper to see why the rcu_preempt thread cannot be pushed out
> > > > > > > > > > and then I'll also look at why is it being pushed out in the first place.
> > > > > > > > > >
> > > > > > > > > > At least I have a strong repro now running 5 instances of TREE03 in parallel
> > > > > > > > > > for several hours.
> > > > > > > > >
> > > > > > > > > Very good!  Then why not boot with rcutorture.onoff_interval=0 and see if
> > > > > > > > > the problem still occurs?  If yes, then there is definitely some reason
> > > > > > > > > other than CPU hotplug that makes this happen.
> > > > > > > > 
> > > > > > > > Hi Paul,
> > > > > > > > So looks so far like onoff_interval=0 makes the issue disappear. So
> > > > > > > > likely hotplug related. I am ok with doing the cpus_read_lock during
> > > > > > > > boost testing and seeing if that fixes it. If it does, I can move on
> > > > > > > > to the next thing in my backlog.
> > > > > > > > 
> > > > > > > > What do you think? Or should I spend more time root-causing it? It is
> > > > > > > > most like runaway RT threads combined with the CPU hotplug threads,
> > > > > > > > making scheduling of the rcu_preempt thread not happen. But I can't
> > > > > > > > say for sure without more/better tracing (Speaking of better tracing,
> > > > > > > > I am adding core-dump support to rcutorture, but it is not there yet).
> > > > > > > 
> > > > > > > This would not be the first time rcutorture has had trouble with those
> > > > > > > threads, so I am for adding the cpus_read_lock().
> > > > > > > 
> > > > > > > Additional root-causing might be helpful, but then again, you might
> > > > > > > have higher priority things to worry about.  ;-)
> > > > > > 
> > > > > > No worries. Unfortunately putting cpus_read_lock() around the boost test
> > > > > > causes hangs. I tried something like the following [1]. If you have a diff, I can
> > > > > > quickly try something to see if the issue goes away as well.
> > > > > 
> > > > > The other approaches that occur to me are:
> > > > > 
> > > > > 1.	Synchronize with the torture.c CPU-hotplug code.  This is a bit
> > > > > 	tricky as well.
> > > > > 
> > > > > 2.	Rearrange the testing to convert one of the TREE0* scenarios that
> > > > > 	is not in CFLIST (TREE06 or TREE08) to a real-time configuration,
> > > > > 	with boosting but without CPU hotplug.	Then remove boosting
> > > > > 	from TREE04.
> > > > > 
> > > > > Of these, #2 seems most productive.  But is there a better way?
> > > > 
> > > > We could have the gp thread at higher priority for TREE03. What I see
> > > > consistently is that the GP thread gets migrated from CPU M to CPU N only to
> > > > be immediately sent back. Dumping the state showed CPU N is running ksoftirqd
> > > > which is also a rt priority 2.  Making rcu_preempt 3 and ksoftirqd 2 might
> > > > give less of a run-around to rcu_preempt maybe enough to prevent the grace
> > > > period from stalling. I am not sure if this will fix it, but I am running a
> > > > test to see how it goes, will let you know.
> > > 
> > > That led to a lot of fireworks. :-) I am thinking though, do we really need
> > > to run a boost kthread on all CPUs? I think that might be the root cause
> > > because the boost threads run on all CPUs except perhaps the one dying.
> > > 
> > > We could run them on just the odd, or even ones and still be able to get
> > > sufficient boost testing. This may be especially important without RT
> > > throttling. I'll go ahead and queue a test like that.
> > 
> > Sorry if I am too noisy. So far only letting the rcutorture boost threads
> > exist on odd CPUs, I am seeing the issue go away (but I'm running an extended
> > test to confirm).
> > 
> > On the other hand, I came up with a real fix [1] and I am currently testing it.
> > This is to fix a live lock between RT push and CPU hotplug's
> > select_fallback_rq()-induced push. I am not sure if the fix works but I have
> > some faith based on what I'm seeing in traces. Fingers crossed. I also feel
> > the real fix is needed to prevent these issues even if we're able to hide it
> > by halving the total rcutorture boost threads.
> 
> So that fixed it without any changes to RCU. Below is the updated patch also
> for the archives. Though I'm rewriting it slightly differently and testing
> that more. The main thing I am doing in the new patch is I find that RT
> should not select !cpu_active() CPUs since those have the scheduler turned
> off. Though checking for cpu_dying() also works. I could not find any
> instance where cpu_dying() != cpu_active() but there could be a tiny window
> where that is true. Anyway, I'll make some noise with scheduler folks once I
> have the new version of the patch tested.
> 
> Also halving the number of RT boost threads makes it less likely to occur but
> does not work. Not too surprising since the issue actually may not be related
> to too many RT threads but rather a lockup between hotplug and RT..

Again, looks promising!  When I get the non-RCU -rcu stuff moved to
v6.6-rc1 and appropriately branched and tested, I will give it a go on
the test setup here.

							Thanx, Paul

> ---8<-----------------------
> 
> From: Joel Fernandes <joelaf@google.com>
> Subject: [PATCH] Fix livelock between RT and select_fallback_rq
> 
> Signed-off-by: Joel Fernandes <joelaf@google.com>
> ---
>  kernel/sched/rt.c | 25 +++++++++----------------
>  1 file changed, 9 insertions(+), 16 deletions(-)
> 
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 00e0e5074115..a089d6f24e5b 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -526,6 +526,11 @@ static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
>  }
>  #endif
>  
> +static inline bool rt_task_fits_in_cpu(struct task_struct *p, int cpu)
> +{
> +	return rt_task_fits_capacity(p, cpu) && !cpu_dying(cpu);
> +}
> +
>  #ifdef CONFIG_RT_GROUP_SCHED
>  
>  static inline u64 sched_rt_runtime(struct rt_rq *rt_rq)
> @@ -1641,14 +1646,14 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
>  	       unlikely(rt_task(curr)) &&
>  	       (curr->nr_cpus_allowed < 2 || curr->prio <= p->prio);
>  
> -	if (test || !rt_task_fits_capacity(p, cpu)) {
> +	if (test || !rt_task_fits_in_cpu(p, cpu)) {
>  		int target = find_lowest_rq(p);
>  
>  		/*
>  		 * Bail out if we were forcing a migration to find a better
>  		 * fitting CPU but our search failed.
>  		 */
> -		if (!test && target != -1 && !rt_task_fits_capacity(p, target))
> +		if (!test && target != -1 && !rt_task_fits_in_cpu(p, target))
>  			goto out_unlock;
>  
>  		/*
> @@ -1892,21 +1897,9 @@ static int find_lowest_rq(struct task_struct *task)
>  	if (task->nr_cpus_allowed == 1)
>  		return -1; /* No other targets possible */
>  
> -	/*
> -	 * If we're on asym system ensure we consider the different capacities
> -	 * of the CPUs when searching for the lowest_mask.
> -	 */
> -	if (sched_asym_cpucap_active()) {
> -
> -		ret = cpupri_find_fitness(&task_rq(task)->rd->cpupri,
> +	ret = cpupri_find_fitness(&task_rq(task)->rd->cpupri,
>  					  task, lowest_mask,
> -					  rt_task_fits_capacity);
> -	} else {
> -
> -		ret = cpupri_find(&task_rq(task)->rd->cpupri,
> -				  task, lowest_mask);
> -	}
> -
> +					  rt_task_fits_in_cpu);
>  	if (!ret)
>  		return -1; /* No targets found */
>  
> -- 
> 2.42.0.459.ge4e396fd5e-goog
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BUG] Random intermittent boost failures (Was Re: [BUG] TREE04..)
  2023-09-15 14:53                         ` Paul E. McKenney
@ 2023-09-15 16:37                           ` Joel Fernandes
  2023-09-15 16:57                             ` Paul E. McKenney
  2023-09-18  6:05                             ` Paul E. McKenney
  0 siblings, 2 replies; 23+ messages in thread
From: Joel Fernandes @ 2023-09-15 16:37 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Frederic Weisbecker, rcu

Hi Paul,
Thanks! I merged replies to 3 threads into this one to make it easier to follow.

On Fri, Sep 15, 2023 at 07:53:15AM -0700, Paul E. McKenney wrote:
> On Fri, Sep 15, 2023 at 11:33:13AM +0000, Joel Fernandes wrote:
> > On Fri, Sep 15, 2023 at 12:13:31AM +0000, Joel Fernandes wrote:
[...]
> > On the other hand, I came up with a real fix [1] and I am currently testing it.
> > This is to fix a live lock between RT push and CPU hotplug's
> > select_fallback_rq()-induced push. I am not sure if the fix works but I have
> > some faith based on what I'm seeing in traces. Fingers crossed. I also feel
> > the real fix is needed to prevent these issues even if we're able to hide it
> > by halving the total rcutorture boost threads.
> 
> This don't-schedule-on-dying CPUs approach does quite look promising
> to me!
> 
> Then again, I cannot claim to be a scheduler expert.  And I am a bit
> surprised that this does not already happen.  Which makes me wonder
> (admittedly without evidence either way) whether there is some CPU-hotplug
> race that it might induce.  But then again, figuring this sort of thing
> out is what part of the scheduler guys are there for, right?  ;-)

Yes it looks promising. Actually this sort of thing also seems to be already
done in CFS, it just not there in RT. So maybe it is OK. Testing so far
showed me pretty good results even with hotplug testing. Here is hoping!

> > > On the other hand, I came up with a real fix [1] and I am currently testing it.
> > > This is to fix a live lock between RT push and CPU hotplug's
> > > select_fallback_rq()-induced push. I am not sure if the fix works but I have
> > > some faith based on what I'm seeing in traces. Fingers crossed. I also feel
> > > the real fix is needed to prevent these issues even if we're able to hide it
> > > by halving the total rcutorture boost threads.
> > 
> > So that fixed it without any changes to RCU. Below is the updated patch also
> > for the archives. Though I'm rewriting it slightly differently and testing
> > that more. The main thing I am doing in the new patch is I find that RT
> > should not select !cpu_active() CPUs since those have the scheduler turned
> > off. Though checking for cpu_dying() also works. I could not find any
> > instance where cpu_dying() != cpu_active() but there could be a tiny window
> > where that is true. Anyway, I'll make some noise with scheduler folks once I
> > have the new version of the patch tested.
> > 
> > Also halving the number of RT boost threads makes it less likely to occur but
> > does not work. Not too surprising since the issue actually may not be related
> > to too many RT threads but rather a lockup between hotplug and RT..
> 
> Again, looks promising!  When I get the non-RCU -rcu stuff moved to
> v6.6-rc1 and appropriately branched and tested, I will give it a go on
> the test setup here.

Thanks a lot, and I have enclosed a simpler updated patch below which also
similarly shows very good results. This is the one I would like to test
more and send to scheduler folks. I'll send it out once I have it tested more
and also possibly after seeing your results (I am on vacation next week so
there's time).

> > We could run them on just the odd, or even ones and still be able to get
> > sufficient boost testing. This may be especially important without RT
> > throttling. I'll go ahead and queue a test like that.
> > 
> > Thoughts?
> 
> The problem with this is that it will often render RCU priority boosting
> unnecessary.  Any kthread preempted within an RCU read-side critical
> section will with high probability quickly be resumed on one of the
> even-numbered CPUs.
> 
> Or were you also planning to bind the rcu_torture_reader() kthreads to
> a specific CPU, preventing such migration?  Or am I missing something
> here?

You are right, I see now why you were running them on all CPUs. One note
though, af the RCU reader threads are CFS, they will not be immediately
pushed out so maybe it may work (unlike if the RCU reader threads being
preempted were RT). However testing shows thread-halving does not fix the
issue anyway so we can scratch that. Instead, below is the updated patch for
don't schedule-on-dying/inactive CPUs approach which is showing really good
results!

And I'll most likely see you the week after, after entering into a 4-day
quiescent state. ;-)

thanks,

 - Joel

---8<-----------------------

From: Joel Fernandes (Google) <joel@joelfernandes.org>
Subject: [PATCH] RT: Alternative fix for livelock with hotplug

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/sched/cpupri.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
index a286e726eb4b..42c40cfdf836 100644
--- a/kernel/sched/cpupri.c
+++ b/kernel/sched/cpupri.c
@@ -101,6 +101,7 @@ static inline int __cpupri_find(struct cpupri *cp, struct task_struct *p,
 
 	if (lowest_mask) {
 		cpumask_and(lowest_mask, &p->cpus_mask, vec->mask);
+		cpumask_and(lowest_mask, lowest_mask, cpu_active_mask);
 
 		/*
 		 * We have to ensure that we have at least one bit
-- 
2.42.0.459.ge4e396fd5e-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [BUG] Random intermittent boost failures (Was Re: [BUG] TREE04..)
  2023-09-15 16:37                           ` Joel Fernandes
@ 2023-09-15 16:57                             ` Paul E. McKenney
  2023-09-15 21:14                               ` Joel Fernandes
  2023-09-18  6:05                             ` Paul E. McKenney
  1 sibling, 1 reply; 23+ messages in thread
From: Paul E. McKenney @ 2023-09-15 16:57 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: Frederic Weisbecker, rcu

On Fri, Sep 15, 2023 at 04:37:11PM +0000, Joel Fernandes wrote:
> Hi Paul,
> Thanks! I merged replies to 3 threads into this one to make it easier to follow.
> 
> On Fri, Sep 15, 2023 at 07:53:15AM -0700, Paul E. McKenney wrote:
> > On Fri, Sep 15, 2023 at 11:33:13AM +0000, Joel Fernandes wrote:
> > > On Fri, Sep 15, 2023 at 12:13:31AM +0000, Joel Fernandes wrote:
> [...]
> > > On the other hand, I came up with a real fix [1] and I am currently testing it.
> > > This is to fix a live lock between RT push and CPU hotplug's
> > > select_fallback_rq()-induced push. I am not sure if the fix works but I have
> > > some faith based on what I'm seeing in traces. Fingers crossed. I also feel
> > > the real fix is needed to prevent these issues even if we're able to hide it
> > > by halving the total rcutorture boost threads.
> > 
> > This don't-schedule-on-dying CPUs approach does quite look promising
> > to me!
> > 
> > Then again, I cannot claim to be a scheduler expert.  And I am a bit
> > surprised that this does not already happen.  Which makes me wonder
> > (admittedly without evidence either way) whether there is some CPU-hotplug
> > race that it might induce.  But then again, figuring this sort of thing
> > out is what part of the scheduler guys are there for, right?  ;-)
> 
> Yes it looks promising. Actually this sort of thing also seems to be already
> done in CFS, it just not there in RT. So maybe it is OK. Testing so far
> showed me pretty good results even with hotplug testing. Here is hoping!
> 
> > > > On the other hand, I came up with a real fix [1] and I am currently testing it.
> > > > This is to fix a live lock between RT push and CPU hotplug's
> > > > select_fallback_rq()-induced push. I am not sure if the fix works but I have
> > > > some faith based on what I'm seeing in traces. Fingers crossed. I also feel
> > > > the real fix is needed to prevent these issues even if we're able to hide it
> > > > by halving the total rcutorture boost threads.
> > > 
> > > So that fixed it without any changes to RCU. Below is the updated patch also
> > > for the archives. Though I'm rewriting it slightly differently and testing
> > > that more. The main thing I am doing in the new patch is I find that RT
> > > should not select !cpu_active() CPUs since those have the scheduler turned
> > > off. Though checking for cpu_dying() also works. I could not find any
> > > instance where cpu_dying() != cpu_active() but there could be a tiny window
> > > where that is true. Anyway, I'll make some noise with scheduler folks once I
> > > have the new version of the patch tested.
> > > 
> > > Also halving the number of RT boost threads makes it less likely to occur but
> > > does not work. Not too surprising since the issue actually may not be related
> > > to too many RT threads but rather a lockup between hotplug and RT..
> > 
> > Again, looks promising!  When I get the non-RCU -rcu stuff moved to
> > v6.6-rc1 and appropriately branched and tested, I will give it a go on
> > the test setup here.
> 
> Thanks a lot, and I have enclosed a simpler updated patch below which also
> similarly shows very good results. This is the one I would like to test
> more and send to scheduler folks. I'll send it out once I have it tested more
> and also possibly after seeing your results (I am on vacation next week so
> there's time).

Much nicer!  This is just on current mainline, correct?

> > > We could run them on just the odd, or even ones and still be able to get
> > > sufficient boost testing. This may be especially important without RT
> > > throttling. I'll go ahead and queue a test like that.
> > > 
> > > Thoughts?
> > 
> > The problem with this is that it will often render RCU priority boosting
> > unnecessary.  Any kthread preempted within an RCU read-side critical
> > section will with high probability quickly be resumed on one of the
> > even-numbered CPUs.
> > 
> > Or were you also planning to bind the rcu_torture_reader() kthreads to
> > a specific CPU, preventing such migration?  Or am I missing something
> > here?
> 
> You are right, I see now why you were running them on all CPUs. One note
> though, af the RCU reader threads are CFS, they will not be immediately
> pushed out so maybe it may work (unlike if the RCU reader threads being
> preempted were RT). However testing shows thread-halving does not fix the
> issue anyway so we can scratch that. Instead, below is the updated patch for
> don't schedule-on-dying/inactive CPUs approach which is showing really good
> results!

I will admit that you did have me going there for a minute or two.  ;-)

> And I'll most likely see you the week after, after entering into a 4-day
> quiescent state. ;-)

Have a great time away!!!

							Thanx, Paul

> thanks,
> 
>  - Joel
> 
> ---8<-----------------------
> 
> From: Joel Fernandes (Google) <joel@joelfernandes.org>
> Subject: [PATCH] RT: Alternative fix for livelock with hotplug
> 
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  kernel/sched/cpupri.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
> index a286e726eb4b..42c40cfdf836 100644
> --- a/kernel/sched/cpupri.c
> +++ b/kernel/sched/cpupri.c
> @@ -101,6 +101,7 @@ static inline int __cpupri_find(struct cpupri *cp, struct task_struct *p,
>  
>  	if (lowest_mask) {
>  		cpumask_and(lowest_mask, &p->cpus_mask, vec->mask);
> +		cpumask_and(lowest_mask, lowest_mask, cpu_active_mask);
>  
>  		/*
>  		 * We have to ensure that we have at least one bit
> -- 
> 2.42.0.459.ge4e396fd5e-goog
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BUG] Random intermittent boost failures (Was Re: [BUG] TREE04..)
  2023-09-15 16:57                             ` Paul E. McKenney
@ 2023-09-15 21:14                               ` Joel Fernandes
  0 siblings, 0 replies; 23+ messages in thread
From: Joel Fernandes @ 2023-09-15 21:14 UTC (permalink / raw)
  To: paulmck; +Cc: Frederic Weisbecker, rcu

On Fri, Sep 15, 2023 at 12:57 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
[...]
> > > > > On the other hand, I came up with a real fix [1] and I am currently testing it.
> > > > > This is to fix a live lock between RT push and CPU hotplug's
> > > > > select_fallback_rq()-induced push. I am not sure if the fix works but I have
> > > > > some faith based on what I'm seeing in traces. Fingers crossed. I also feel
> > > > > the real fix is needed to prevent these issues even if we're able to hide it
> > > > > by halving the total rcutorture boost threads.
> > > >
> > > > So that fixed it without any changes to RCU. Below is the updated patch also
> > > > for the archives. Though I'm rewriting it slightly differently and testing
> > > > that more. The main thing I am doing in the new patch is I find that RT
> > > > should not select !cpu_active() CPUs since those have the scheduler turned
> > > > off. Though checking for cpu_dying() also works. I could not find any
> > > > instance where cpu_dying() != cpu_active() but there could be a tiny window
> > > > where that is true. Anyway, I'll make some noise with scheduler folks once I
> > > > have the new version of the patch tested.
> > > >
> > > > Also halving the number of RT boost threads makes it less likely to occur but
> > > > does not work. Not too surprising since the issue actually may not be related
> > > > to too many RT threads but rather a lockup between hotplug and RT..
> > >
> > > Again, looks promising!  When I get the non-RCU -rcu stuff moved to
> > > v6.6-rc1 and appropriately branched and tested, I will give it a go on
> > > the test setup here.
> >
> > Thanks a lot, and I have enclosed a simpler updated patch below which also
> > similarly shows very good results. This is the one I would like to test
> > more and send to scheduler folks. I'll send it out once I have it tested more
> > and also possibly after seeing your results (I am on vacation next week so
> > there's time).
>
> Much nicer!  This is just on current mainline, correct?

Yes, correct. I also applied it cleanly to all stable kernels for my
test rigs. Only 5.10 had a little merge conflict but it was trivially
fixed.

thanks,

 - Joel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BUG] Random intermittent boost failures (Was Re: [BUG] TREE04..)
  2023-09-15 16:37                           ` Joel Fernandes
  2023-09-15 16:57                             ` Paul E. McKenney
@ 2023-09-18  6:05                             ` Paul E. McKenney
  1 sibling, 0 replies; 23+ messages in thread
From: Paul E. McKenney @ 2023-09-18  6:05 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: Frederic Weisbecker, rcu

On Fri, Sep 15, 2023 at 04:37:11PM +0000, Joel Fernandes wrote:
> Hi Paul,
> Thanks! I merged replies to 3 threads into this one to make it easier to follow.
> 
> On Fri, Sep 15, 2023 at 07:53:15AM -0700, Paul E. McKenney wrote:
> > On Fri, Sep 15, 2023 at 11:33:13AM +0000, Joel Fernandes wrote:
> > > On Fri, Sep 15, 2023 at 12:13:31AM +0000, Joel Fernandes wrote:
> [...]
> > > On the other hand, I came up with a real fix [1] and I am currently testing it.
> > > This is to fix a live lock between RT push and CPU hotplug's
> > > select_fallback_rq()-induced push. I am not sure if the fix works but I have
> > > some faith based on what I'm seeing in traces. Fingers crossed. I also feel
> > > the real fix is needed to prevent these issues even if we're able to hide it
> > > by halving the total rcutorture boost threads.
> > 
> > This don't-schedule-on-dying CPUs approach does quite look promising
> > to me!
> > 
> > Then again, I cannot claim to be a scheduler expert.  And I am a bit
> > surprised that this does not already happen.  Which makes me wonder
> > (admittedly without evidence either way) whether there is some CPU-hotplug
> > race that it might induce.  But then again, figuring this sort of thing
> > out is what part of the scheduler guys are there for, right?  ;-)
> 
> Yes it looks promising. Actually this sort of thing also seems to be already
> done in CFS, it just not there in RT. So maybe it is OK. Testing so far
> showed me pretty good results even with hotplug testing. Here is hoping!
> 
> > > > On the other hand, I came up with a real fix [1] and I am currently testing it.
> > > > This is to fix a live lock between RT push and CPU hotplug's
> > > > select_fallback_rq()-induced push. I am not sure if the fix works but I have
> > > > some faith based on what I'm seeing in traces. Fingers crossed. I also feel
> > > > the real fix is needed to prevent these issues even if we're able to hide it
> > > > by halving the total rcutorture boost threads.
> > > 
> > > So that fixed it without any changes to RCU. Below is the updated patch also
> > > for the archives. Though I'm rewriting it slightly differently and testing
> > > that more. The main thing I am doing in the new patch is I find that RT
> > > should not select !cpu_active() CPUs since those have the scheduler turned
> > > off. Though checking for cpu_dying() also works. I could not find any
> > > instance where cpu_dying() != cpu_active() but there could be a tiny window
> > > where that is true. Anyway, I'll make some noise with scheduler folks once I
> > > have the new version of the patch tested.
> > > 
> > > Also halving the number of RT boost threads makes it less likely to occur but
> > > does not work. Not too surprising since the issue actually may not be related
> > > to too many RT threads but rather a lockup between hotplug and RT..
> > 
> > Again, looks promising!  When I get the non-RCU -rcu stuff moved to
> > v6.6-rc1 and appropriately branched and tested, I will give it a go on
> > the test setup here.
> 
> Thanks a lot, and I have enclosed a simpler updated patch below which also
> similarly shows very good results. This is the one I would like to test
> more and send to scheduler folks. I'll send it out once I have it tested more
> and also possibly after seeing your results (I am on vacation next week so
> there's time).
> 
> > > We could run them on just the odd, or even ones and still be able to get
> > > sufficient boost testing. This may be especially important without RT
> > > throttling. I'll go ahead and queue a test like that.
> > > 
> > > Thoughts?
> > 
> > The problem with this is that it will often render RCU priority boosting
> > unnecessary.  Any kthread preempted within an RCU read-side critical
> > section will with high probability quickly be resumed on one of the
> > even-numbered CPUs.
> > 
> > Or were you also planning to bind the rcu_torture_reader() kthreads to
> > a specific CPU, preventing such migration?  Or am I missing something
> > here?
> 
> You are right, I see now why you were running them on all CPUs. One note
> though, af the RCU reader threads are CFS, they will not be immediately
> pushed out so maybe it may work (unlike if the RCU reader threads being
> preempted were RT). However testing shows thread-halving does not fix the
> issue anyway so we can scratch that. Instead, below is the updated patch for
> don't schedule-on-dying/inactive CPUs approach which is showing really good
> results!
> 
> And I'll most likely see you the week after, after entering into a 4-day
> quiescent state. ;-)

And I queued this as an experimental patch and am starting testing.
I am assuming that the path to mainline is through the scheduler tree.

							Thanx, Paul

> thanks,
> 
>  - Joel
> 
> ---8<-----------------------
> 
> From: Joel Fernandes (Google) <joel@joelfernandes.org>
> Subject: [PATCH] RT: Alternative fix for livelock with hotplug
> 
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  kernel/sched/cpupri.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
> index a286e726eb4b..42c40cfdf836 100644
> --- a/kernel/sched/cpupri.c
> +++ b/kernel/sched/cpupri.c
> @@ -101,6 +101,7 @@ static inline int __cpupri_find(struct cpupri *cp, struct task_struct *p,
>  
>  	if (lowest_mask) {
>  		cpumask_and(lowest_mask, &p->cpus_mask, vec->mask);
> +		cpumask_and(lowest_mask, lowest_mask, cpu_active_mask);
>  
>  		/*
>  		 * We have to ensure that we have at least one bit
> -- 
> 2.42.0.459.ge4e396fd5e-goog
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2023-09-18  6:06 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-10 20:14 [BUG] Random intermittent boost failures (Was Re: [BUG] TREE04..) Joel Fernandes
2023-09-10 21:16 ` Paul E. McKenney
2023-09-10 23:37   ` Joel Fernandes
2023-09-11  2:27     ` Joel Fernandes
2023-09-11  8:16       ` Paul E. McKenney
2023-09-11 13:17         ` Joel Fernandes
2023-09-11 13:49           ` Paul E. McKenney
2023-09-11 16:18             ` Joel Fernandes
2023-09-13 20:30         ` Joel Fernandes
2023-09-14 11:11           ` Paul E. McKenney
2023-09-14 13:13             ` Joel Fernandes
2023-09-14 15:23               ` Paul E. McKenney
2023-09-14 18:56                 ` Joel Fernandes
2023-09-14 21:53                   ` Joel Fernandes
2023-09-15  0:13                     ` Joel Fernandes
2023-09-15 11:33                       ` Joel Fernandes
2023-09-15 14:53                         ` Paul E. McKenney
2023-09-15 16:37                           ` Joel Fernandes
2023-09-15 16:57                             ` Paul E. McKenney
2023-09-15 21:14                               ` Joel Fernandes
2023-09-18  6:05                             ` Paul E. McKenney
2023-09-15 14:48                       ` Paul E. McKenney
2023-09-15 14:45                     ` Paul E. McKenney

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.