* sched/core warning triggers on rcu torture test
@ 2018-06-26 16:16 Anna-Maria Gleixner
2018-06-26 16:32 ` Peter Zijlstra
2018-06-27 11:29 ` Frederic Weisbecker
0 siblings, 2 replies; 9+ messages in thread
From: Anna-Maria Gleixner @ 2018-06-26 16:16 UTC (permalink / raw)
To: linux-kernel
Cc: Paul E. McKenney, Thomas Gleixner, Frederic Weisbecker,
Peter Zijlstra
Hi,
during rcu torture tests (TREE04 and TREE07) I noticed, that a
WARN_ON_ONCE() in sched core triggers on a recent 4.18-rc2 based
kernel (6f0d349d922b ("Merge
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")) as well as
on a 4.17.3.
I'm running the tests on a machine with 144 cores:
tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 144 --duration 120 --configs "9*TREE07"
tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 144 --duration 120 --configs "18*TREE04"
The warning was introduced by commit d84b31313ef8 ("sched/isolation:
Offload residual 1Hz scheduler tick").
Output looks similar for all tests I did (this one is the output of
the 4.18-rc2 based kernel):
WARNING: CPU: 11 PID: 906 at kernel/sched/core.c:3138 sched_tick_remote+0xb6/0xc0
Modules linked in:
CPU: 11 PID: 906 Comm: kworker/u32:3 Not tainted 4.18.0-rc2+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
Workqueue: events_unbound sched_tick_remote
RIP: 0010:sched_tick_remote+0xb6/0xc0
Code: e8 0f 06 b8 00 c6 03 00 fb eb 9d 8b 43 04 85 c0 75 8d 48 8b 83 e0 0a 00 00 48 85 c0 75 81 eb 88 48 89 df e8 bc fe ff ff eb aa <0f> 0b eb c5 66 0f 1f 44 00 00 bf 17 00 00 00 e8 b6 2e fe ff 0f b6
Call Trace:
process_one_work+0x1df/0x3b0
worker_thread+0x44/0x3d0
kthread+0xf3/0x130
? set_worker_desc+0xb0/0xb0
? kthread_create_worker_on_cpu+0x70/0x70
ret_from_fork+0x35/0x40
---[ end trace 7c99b83eb0ec64e8 ]---
Do you need some more information?
Thanks,
Anna-Maria
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: sched/core warning triggers on rcu torture test 2018-06-26 16:16 sched/core warning triggers on rcu torture test Anna-Maria Gleixner @ 2018-06-26 16:32 ` Peter Zijlstra 2018-06-26 17:48 ` Paul E. McKenney 2018-06-27 11:29 ` Frederic Weisbecker 1 sibling, 1 reply; 9+ messages in thread From: Peter Zijlstra @ 2018-06-26 16:32 UTC (permalink / raw) To: Anna-Maria Gleixner Cc: linux-kernel, Paul E. McKenney, Thomas Gleixner, Frederic Weisbecker On Tue, Jun 26, 2018 at 06:16:04PM +0200, Anna-Maria Gleixner wrote: > Hi, > > during rcu torture tests (TREE04 and TREE07) I noticed, that a > WARN_ON_ONCE() in sched core triggers on a recent 4.18-rc2 based > kernel (6f0d349d922b ("Merge > git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")) as well as > on a 4.17.3. > > I'm running the tests on a machine with 144 cores: > > tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 144 --duration 120 --configs "9*TREE07" > tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 144 --duration 120 --configs "18*TREE04" > > > The warning was introduced by commit d84b31313ef8 ("sched/isolation: > Offload residual 1Hz scheduler tick"). > > > Output looks similar for all tests I did (this one is the output of > the 4.18-rc2 based kernel): > > WARNING: CPU: 11 PID: 906 at kernel/sched/core.c:3138 sched_tick_remote+0xb6/0xc0 That's nohz_full stuff, is that a normal part of rcutorture? In any case, is the one housekeeping CPU getting seriously overloaded or something? ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: sched/core warning triggers on rcu torture test 2018-06-26 16:32 ` Peter Zijlstra @ 2018-06-26 17:48 ` Paul E. McKenney 2018-06-27 10:40 ` Frederic Weisbecker 0 siblings, 1 reply; 9+ messages in thread From: Paul E. McKenney @ 2018-06-26 17:48 UTC (permalink / raw) To: Peter Zijlstra Cc: Anna-Maria Gleixner, linux-kernel, Thomas Gleixner, Frederic Weisbecker On Tue, Jun 26, 2018 at 06:32:55PM +0200, Peter Zijlstra wrote: > On Tue, Jun 26, 2018 at 06:16:04PM +0200, Anna-Maria Gleixner wrote: > > Hi, > > > > during rcu torture tests (TREE04 and TREE07) I noticed, that a > > WARN_ON_ONCE() in sched core triggers on a recent 4.18-rc2 based > > kernel (6f0d349d922b ("Merge > > git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")) as well as > > on a 4.17.3. First, I am very glad that I am not the only one running rcutorture! ;-) > > I'm running the tests on a machine with 144 cores: > > > > tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 144 --duration 120 --configs "9*TREE07" > > tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 144 --duration 120 --configs "18*TREE04" > > > > > > The warning was introduced by commit d84b31313ef8 ("sched/isolation: > > Offload residual 1Hz scheduler tick"). > > > > > > Output looks similar for all tests I did (this one is the output of > > the 4.18-rc2 based kernel): > > > > WARNING: CPU: 11 PID: 906 at kernel/sched/core.c:3138 sched_tick_remote+0xb6/0xc0 > > That's nohz_full stuff, is that a normal part of rcutorture? In any > case, is the one housekeeping CPU getting seriously overloaded or > something? Yes, nohz_full is a normal part for rcutorture because RCU has to deal differently with userspace execution in the nohz_full case. I do see this splat (at least when I don't comment it out), but I do share my system with others, so I could easily be overloading the housekeeping vCPUs due to hypervisor preemption. I was intending to dig into this one once I got done consolidating RCU-bh, RCU-preempt, and RCU-sched at Linus's behest. On overloading the housekeeping CPU without outside load, let's look at TREE04 and TREE07 separately. TREE04 uses eight CPUs, and seven of them ("nohz_full=1-7") are nohz_full CPUs, and rcutorture doesn't generate all that large of a callback load. It looks like all 144 CPUs are used in this case (18*8), though RCU enforces idle periods in order to test idle/non-idle transitions. But was there anything else running on the machine at the time? TREE07 uses 16 CPUs, and eight of them ("nohz_full=2-9") are nohz_full CPUs. Again, it looks like all 144 CPUs are used (9*8). I sometimes see this on TASKS03 as well, which uses two CPUs, and one of them ("nohz_full=1") is a nohz_full CPU. If your system is otherwise idle, would it make sense to trace context switches on CPU 0 to see what it is up to? And to do an ftrace_dump() and turn tracing off when the warning triggers as well? Thanx, Paul ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: sched/core warning triggers on rcu torture test 2018-06-26 17:48 ` Paul E. McKenney @ 2018-06-27 10:40 ` Frederic Weisbecker 2018-06-27 14:25 ` Paul E. McKenney 0 siblings, 1 reply; 9+ messages in thread From: Frederic Weisbecker @ 2018-06-27 10:40 UTC (permalink / raw) To: Paul E. McKenney Cc: Peter Zijlstra, Anna-Maria Gleixner, linux-kernel, Thomas Gleixner, Frederic Weisbecker On Tue, Jun 26, 2018 at 10:48:26AM -0700, Paul E. McKenney wrote: > On Tue, Jun 26, 2018 at 06:32:55PM +0200, Peter Zijlstra wrote: > > On Tue, Jun 26, 2018 at 06:16:04PM +0200, Anna-Maria Gleixner wrote: > > > Hi, > > > > > > during rcu torture tests (TREE04 and TREE07) I noticed, that a > > > WARN_ON_ONCE() in sched core triggers on a recent 4.18-rc2 based > > > kernel (6f0d349d922b ("Merge > > > git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")) as well as > > > on a 4.17.3. > > First, I am very glad that I am not the only one running rcutorture! ;-) > > > > I'm running the tests on a machine with 144 cores: > > > > > > tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 144 --duration 120 --configs "9*TREE07" > > > tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 144 --duration 120 --configs "18*TREE04" > > > > > > > > > The warning was introduced by commit d84b31313ef8 ("sched/isolation: > > > Offload residual 1Hz scheduler tick"). > > > > > > > > > Output looks similar for all tests I did (this one is the output of > > > the 4.18-rc2 based kernel): > > > > > > WARNING: CPU: 11 PID: 906 at kernel/sched/core.c:3138 sched_tick_remote+0xb6/0xc0 > > > > That's nohz_full stuff, is that a normal part of rcutorture? In any > > case, is the one housekeeping CPU getting seriously overloaded or > > something? > > Yes, nohz_full is a normal part for rcutorture because RCU has to deal > differently with userspace execution in the nohz_full case. > > I do see this splat (at least when I don't comment it out), but I > do share my system with others, so I could easily be overloading the > housekeeping vCPUs due to hypervisor preemption. I was intending to > dig into this one once I got done consolidating RCU-bh, RCU-preempt, > and RCU-sched at Linus's behest. > > On overloading the housekeeping CPU without outside load, let's look at > TREE04 and TREE07 separately. > > TREE04 uses eight CPUs, and seven of them ("nohz_full=1-7") are nohz_full > CPUs, and rcutorture doesn't generate all that large of a callback load. > It looks like all 144 CPUs are used in this case (18*8), though RCU > enforces idle periods in order to test idle/non-idle transitions. > But was there anything else running on the machine at the time? > > TREE07 uses 16 CPUs, and eight of them ("nohz_full=2-9") are nohz_full > CPUs. Again, it looks like all 144 CPUs are used (9*8). > > I sometimes see this on TASKS03 as well, which uses two CPUs, and one of > them ("nohz_full=1") is a nohz_full CPU. > > If your system is otherwise idle, would it make sense to trace context > switches on CPU 0 to see what it is up to? And to do an ftrace_dump() > and turn tracing off when the warning triggers as well? Yeah you guys reported me this warning a few times ago. I didn't manage to reproduce it because I fought and failed with a high NR_CPUS machine. But apparently 8 CPUs are enough. Let me try that with TREE04. Thanks. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: sched/core warning triggers on rcu torture test 2018-06-27 10:40 ` Frederic Weisbecker @ 2018-06-27 14:25 ` Paul E. McKenney 2018-06-28 16:33 ` Frederic Weisbecker 0 siblings, 1 reply; 9+ messages in thread From: Paul E. McKenney @ 2018-06-27 14:25 UTC (permalink / raw) To: Frederic Weisbecker Cc: Peter Zijlstra, Anna-Maria Gleixner, linux-kernel, Thomas Gleixner, Frederic Weisbecker On Wed, Jun 27, 2018 at 12:40:15PM +0200, Frederic Weisbecker wrote: > On Tue, Jun 26, 2018 at 10:48:26AM -0700, Paul E. McKenney wrote: > > On Tue, Jun 26, 2018 at 06:32:55PM +0200, Peter Zijlstra wrote: > > > On Tue, Jun 26, 2018 at 06:16:04PM +0200, Anna-Maria Gleixner wrote: > > > > Hi, > > > > > > > > during rcu torture tests (TREE04 and TREE07) I noticed, that a > > > > WARN_ON_ONCE() in sched core triggers on a recent 4.18-rc2 based > > > > kernel (6f0d349d922b ("Merge > > > > git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")) as well as > > > > on a 4.17.3. > > > > First, I am very glad that I am not the only one running rcutorture! ;-) > > > > > > I'm running the tests on a machine with 144 cores: > > > > > > > > tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 144 --duration 120 --configs "9*TREE07" > > > > tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 144 --duration 120 --configs "18*TREE04" > > > > > > > > > > > > The warning was introduced by commit d84b31313ef8 ("sched/isolation: > > > > Offload residual 1Hz scheduler tick"). > > > > > > > > > > > > Output looks similar for all tests I did (this one is the output of > > > > the 4.18-rc2 based kernel): > > > > > > > > WARNING: CPU: 11 PID: 906 at kernel/sched/core.c:3138 sched_tick_remote+0xb6/0xc0 > > > > > > That's nohz_full stuff, is that a normal part of rcutorture? In any > > > case, is the one housekeeping CPU getting seriously overloaded or > > > something? > > > > Yes, nohz_full is a normal part for rcutorture because RCU has to deal > > differently with userspace execution in the nohz_full case. > > > > I do see this splat (at least when I don't comment it out), but I > > do share my system with others, so I could easily be overloading the > > housekeeping vCPUs due to hypervisor preemption. I was intending to > > dig into this one once I got done consolidating RCU-bh, RCU-preempt, > > and RCU-sched at Linus's behest. > > > > On overloading the housekeeping CPU without outside load, let's look at > > TREE04 and TREE07 separately. > > > > TREE04 uses eight CPUs, and seven of them ("nohz_full=1-7") are nohz_full > > CPUs, and rcutorture doesn't generate all that large of a callback load. > > It looks like all 144 CPUs are used in this case (18*8), though RCU > > enforces idle periods in order to test idle/non-idle transitions. > > But was there anything else running on the machine at the time? > > > > TREE07 uses 16 CPUs, and eight of them ("nohz_full=2-9") are nohz_full > > CPUs. Again, it looks like all 144 CPUs are used (9*8). > > > > I sometimes see this on TASKS03 as well, which uses two CPUs, and one of > > them ("nohz_full=1") is a nohz_full CPU. > > > > If your system is otherwise idle, would it make sense to trace context > > switches on CPU 0 to see what it is up to? And to do an ftrace_dump() > > and turn tracing off when the warning triggers as well? > > Yeah you guys reported me this warning a few times ago. I didn't manage to reproduce > it because I fought and failed with a high NR_CPUS machine. But apparently 8 CPUs > are enough. Let me try that with TREE04. Looking forward to hearing what you find! Thanx, Paul ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: sched/core warning triggers on rcu torture test 2018-06-27 14:25 ` Paul E. McKenney @ 2018-06-28 16:33 ` Frederic Weisbecker 2018-06-28 16:44 ` Paul E. McKenney 0 siblings, 1 reply; 9+ messages in thread From: Frederic Weisbecker @ 2018-06-28 16:33 UTC (permalink / raw) To: Paul E. McKenney Cc: Peter Zijlstra, Anna-Maria Gleixner, linux-kernel, Thomas Gleixner, Frederic Weisbecker On Wed, Jun 27, 2018 at 07:25:29AM -0700, Paul E. McKenney wrote: > On Wed, Jun 27, 2018 at 12:40:15PM +0200, Frederic Weisbecker wrote: > > On Tue, Jun 26, 2018 at 10:48:26AM -0700, Paul E. McKenney wrote: > > > On Tue, Jun 26, 2018 at 06:32:55PM +0200, Peter Zijlstra wrote: > > > > On Tue, Jun 26, 2018 at 06:16:04PM +0200, Anna-Maria Gleixner wrote: > > > > > Hi, > > > > > > > > > > during rcu torture tests (TREE04 and TREE07) I noticed, that a > > > > > WARN_ON_ONCE() in sched core triggers on a recent 4.18-rc2 based > > > > > kernel (6f0d349d922b ("Merge > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")) as well as > > > > > on a 4.17.3. > > > > > > First, I am very glad that I am not the only one running rcutorture! ;-) > > > > > > > > I'm running the tests on a machine with 144 cores: > > > > > > > > > > tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 144 --duration 120 --configs "9*TREE07" > > > > > tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 144 --duration 120 --configs "18*TREE04" > > > > > > > > > > > > > > > The warning was introduced by commit d84b31313ef8 ("sched/isolation: > > > > > Offload residual 1Hz scheduler tick"). > > > > > > > > > > > > > > > Output looks similar for all tests I did (this one is the output of > > > > > the 4.18-rc2 based kernel): > > > > > > > > > > WARNING: CPU: 11 PID: 906 at kernel/sched/core.c:3138 sched_tick_remote+0xb6/0xc0 > > > > > > > > That's nohz_full stuff, is that a normal part of rcutorture? In any > > > > case, is the one housekeeping CPU getting seriously overloaded or > > > > something? > > > > > > Yes, nohz_full is a normal part for rcutorture because RCU has to deal > > > differently with userspace execution in the nohz_full case. > > > > > > I do see this splat (at least when I don't comment it out), but I > > > do share my system with others, so I could easily be overloading the > > > housekeeping vCPUs due to hypervisor preemption. I was intending to > > > dig into this one once I got done consolidating RCU-bh, RCU-preempt, > > > and RCU-sched at Linus's behest. > > > > > > On overloading the housekeeping CPU without outside load, let's look at > > > TREE04 and TREE07 separately. > > > > > > TREE04 uses eight CPUs, and seven of them ("nohz_full=1-7") are nohz_full > > > CPUs, and rcutorture doesn't generate all that large of a callback load. > > > It looks like all 144 CPUs are used in this case (18*8), though RCU > > > enforces idle periods in order to test idle/non-idle transitions. > > > But was there anything else running on the machine at the time? > > > > > > TREE07 uses 16 CPUs, and eight of them ("nohz_full=2-9") are nohz_full > > > CPUs. Again, it looks like all 144 CPUs are used (9*8). > > > > > > I sometimes see this on TASKS03 as well, which uses two CPUs, and one of > > > them ("nohz_full=1") is a nohz_full CPU. > > > > > > If your system is otherwise idle, would it make sense to trace context > > > switches on CPU 0 to see what it is up to? And to do an ftrace_dump() > > > and turn tracing off when the warning triggers as well? > > > > Yeah you guys reported me this warning a few times ago. I didn't manage to reproduce > > it because I fought and failed with a high NR_CPUS machine. But apparently 8 CPUs > > are enough. Let me try that with TREE04. > > Looking forward to hearing what you find! Please check "[PATCH] sched/nohz: Skip remote tick on idle task entirely" which I just posted. In the hope that the warning didn't trigger for another reason on your testings. Thanks. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: sched/core warning triggers on rcu torture test 2018-06-28 16:33 ` Frederic Weisbecker @ 2018-06-28 16:44 ` Paul E. McKenney 2018-06-28 19:04 ` Paul E. McKenney 0 siblings, 1 reply; 9+ messages in thread From: Paul E. McKenney @ 2018-06-28 16:44 UTC (permalink / raw) To: Frederic Weisbecker Cc: Peter Zijlstra, Anna-Maria Gleixner, linux-kernel, Thomas Gleixner, Frederic Weisbecker On Thu, Jun 28, 2018 at 06:33:24PM +0200, Frederic Weisbecker wrote: > On Wed, Jun 27, 2018 at 07:25:29AM -0700, Paul E. McKenney wrote: > > On Wed, Jun 27, 2018 at 12:40:15PM +0200, Frederic Weisbecker wrote: > > > On Tue, Jun 26, 2018 at 10:48:26AM -0700, Paul E. McKenney wrote: > > > > On Tue, Jun 26, 2018 at 06:32:55PM +0200, Peter Zijlstra wrote: > > > > > On Tue, Jun 26, 2018 at 06:16:04PM +0200, Anna-Maria Gleixner wrote: > > > > > > Hi, > > > > > > > > > > > > during rcu torture tests (TREE04 and TREE07) I noticed, that a > > > > > > WARN_ON_ONCE() in sched core triggers on a recent 4.18-rc2 based > > > > > > kernel (6f0d349d922b ("Merge > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")) as well as > > > > > > on a 4.17.3. > > > > > > > > First, I am very glad that I am not the only one running rcutorture! ;-) > > > > > > > > > > I'm running the tests on a machine with 144 cores: > > > > > > > > > > > > tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 144 --duration 120 --configs "9*TREE07" > > > > > > tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 144 --duration 120 --configs "18*TREE04" > > > > > > > > > > > > > > > > > > The warning was introduced by commit d84b31313ef8 ("sched/isolation: > > > > > > Offload residual 1Hz scheduler tick"). > > > > > > > > > > > > > > > > > > Output looks similar for all tests I did (this one is the output of > > > > > > the 4.18-rc2 based kernel): > > > > > > > > > > > > WARNING: CPU: 11 PID: 906 at kernel/sched/core.c:3138 sched_tick_remote+0xb6/0xc0 > > > > > > > > > > That's nohz_full stuff, is that a normal part of rcutorture? In any > > > > > case, is the one housekeeping CPU getting seriously overloaded or > > > > > something? > > > > > > > > Yes, nohz_full is a normal part for rcutorture because RCU has to deal > > > > differently with userspace execution in the nohz_full case. > > > > > > > > I do see this splat (at least when I don't comment it out), but I > > > > do share my system with others, so I could easily be overloading the > > > > housekeeping vCPUs due to hypervisor preemption. I was intending to > > > > dig into this one once I got done consolidating RCU-bh, RCU-preempt, > > > > and RCU-sched at Linus's behest. > > > > > > > > On overloading the housekeeping CPU without outside load, let's look at > > > > TREE04 and TREE07 separately. > > > > > > > > TREE04 uses eight CPUs, and seven of them ("nohz_full=1-7") are nohz_full > > > > CPUs, and rcutorture doesn't generate all that large of a callback load. > > > > It looks like all 144 CPUs are used in this case (18*8), though RCU > > > > enforces idle periods in order to test idle/non-idle transitions. > > > > But was there anything else running on the machine at the time? > > > > > > > > TREE07 uses 16 CPUs, and eight of them ("nohz_full=2-9") are nohz_full > > > > CPUs. Again, it looks like all 144 CPUs are used (9*8). > > > > > > > > I sometimes see this on TASKS03 as well, which uses two CPUs, and one of > > > > them ("nohz_full=1") is a nohz_full CPU. > > > > > > > > If your system is otherwise idle, would it make sense to trace context > > > > switches on CPU 0 to see what it is up to? And to do an ftrace_dump() > > > > and turn tracing off when the warning triggers as well? > > > > > > Yeah you guys reported me this warning a few times ago. I didn't manage to reproduce > > > it because I fought and failed with a high NR_CPUS machine. But apparently 8 CPUs > > > are enough. Let me try that with TREE04. > > > > Looking forward to hearing what you find! > > Please check "[PATCH] sched/nohz: Skip remote tick on idle task entirely" which I > just posted. In the hope that the warning didn't trigger for another reason on > your testings. Very cool, thank you! Firing up rcutorture with this now. Thanx, Paul ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: sched/core warning triggers on rcu torture test 2018-06-28 16:44 ` Paul E. McKenney @ 2018-06-28 19:04 ` Paul E. McKenney 0 siblings, 0 replies; 9+ messages in thread From: Paul E. McKenney @ 2018-06-28 19:04 UTC (permalink / raw) To: Frederic Weisbecker Cc: Peter Zijlstra, Anna-Maria Gleixner, linux-kernel, Thomas Gleixner, Frederic Weisbecker On Thu, Jun 28, 2018 at 09:44:48AM -0700, Paul E. McKenney wrote: > On Thu, Jun 28, 2018 at 06:33:24PM +0200, Frederic Weisbecker wrote: > > On Wed, Jun 27, 2018 at 07:25:29AM -0700, Paul E. McKenney wrote: > > > On Wed, Jun 27, 2018 at 12:40:15PM +0200, Frederic Weisbecker wrote: > > > > On Tue, Jun 26, 2018 at 10:48:26AM -0700, Paul E. McKenney wrote: > > > > > On Tue, Jun 26, 2018 at 06:32:55PM +0200, Peter Zijlstra wrote: > > > > > > On Tue, Jun 26, 2018 at 06:16:04PM +0200, Anna-Maria Gleixner wrote: > > > > > > > Hi, > > > > > > > > > > > > > > during rcu torture tests (TREE04 and TREE07) I noticed, that a > > > > > > > WARN_ON_ONCE() in sched core triggers on a recent 4.18-rc2 based > > > > > > > kernel (6f0d349d922b ("Merge > > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")) as well as > > > > > > > on a 4.17.3. > > > > > > > > > > First, I am very glad that I am not the only one running rcutorture! ;-) > > > > > > > > > > > > I'm running the tests on a machine with 144 cores: > > > > > > > > > > > > > > tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 144 --duration 120 --configs "9*TREE07" > > > > > > > tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 144 --duration 120 --configs "18*TREE04" > > > > > > > > > > > > > > > > > > > > > The warning was introduced by commit d84b31313ef8 ("sched/isolation: > > > > > > > Offload residual 1Hz scheduler tick"). > > > > > > > > > > > > > > > > > > > > > Output looks similar for all tests I did (this one is the output of > > > > > > > the 4.18-rc2 based kernel): > > > > > > > > > > > > > > WARNING: CPU: 11 PID: 906 at kernel/sched/core.c:3138 sched_tick_remote+0xb6/0xc0 > > > > > > > > > > > > That's nohz_full stuff, is that a normal part of rcutorture? In any > > > > > > case, is the one housekeeping CPU getting seriously overloaded or > > > > > > something? > > > > > > > > > > Yes, nohz_full is a normal part for rcutorture because RCU has to deal > > > > > differently with userspace execution in the nohz_full case. > > > > > > > > > > I do see this splat (at least when I don't comment it out), but I > > > > > do share my system with others, so I could easily be overloading the > > > > > housekeeping vCPUs due to hypervisor preemption. I was intending to > > > > > dig into this one once I got done consolidating RCU-bh, RCU-preempt, > > > > > and RCU-sched at Linus's behest. > > > > > > > > > > On overloading the housekeeping CPU without outside load, let's look at > > > > > TREE04 and TREE07 separately. > > > > > > > > > > TREE04 uses eight CPUs, and seven of them ("nohz_full=1-7") are nohz_full > > > > > CPUs, and rcutorture doesn't generate all that large of a callback load. > > > > > It looks like all 144 CPUs are used in this case (18*8), though RCU > > > > > enforces idle periods in order to test idle/non-idle transitions. > > > > > But was there anything else running on the machine at the time? > > > > > > > > > > TREE07 uses 16 CPUs, and eight of them ("nohz_full=2-9") are nohz_full > > > > > CPUs. Again, it looks like all 144 CPUs are used (9*8). > > > > > > > > > > I sometimes see this on TASKS03 as well, which uses two CPUs, and one of > > > > > them ("nohz_full=1") is a nohz_full CPU. > > > > > > > > > > If your system is otherwise idle, would it make sense to trace context > > > > > switches on CPU 0 to see what it is up to? And to do an ftrace_dump() > > > > > and turn tracing off when the warning triggers as well? > > > > > > > > Yeah you guys reported me this warning a few times ago. I didn't manage to reproduce > > > > it because I fought and failed with a high NR_CPUS machine. But apparently 8 CPUs > > > > are enough. Let me try that with TREE04. > > > > > > Looking forward to hearing what you find! > > > > Please check "[PATCH] sched/nohz: Skip remote tick on idle task entirely" which I > > just posted. In the hope that the warning didn't trigger for another reason on > > your testings. > > Very cool, thank you! Firing up rcutorture with this now. And the three scenarios (TASKS03, TREE04, and TREE07) that reliably give me a splat without this patch are properly silent with it. So: Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> And thank you very much! Thanx, Paul ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: sched/core warning triggers on rcu torture test 2018-06-26 16:16 sched/core warning triggers on rcu torture test Anna-Maria Gleixner 2018-06-26 16:32 ` Peter Zijlstra @ 2018-06-27 11:29 ` Frederic Weisbecker 1 sibling, 0 replies; 9+ messages in thread From: Frederic Weisbecker @ 2018-06-27 11:29 UTC (permalink / raw) To: Anna-Maria Gleixner Cc: linux-kernel, Paul E. McKenney, Thomas Gleixner, Frederic Weisbecker, Peter Zijlstra On Tue, Jun 26, 2018 at 06:16:04PM +0200, Anna-Maria Gleixner wrote: > Hi, > > during rcu torture tests (TREE04 and TREE07) I noticed, that a > WARN_ON_ONCE() in sched core triggers on a recent 4.18-rc2 based > kernel (6f0d349d922b ("Merge > git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")) as well as > on a 4.17.3. > > I'm running the tests on a machine with 144 cores: > > tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 144 --duration 120 --configs "9*TREE07" > tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 144 --duration 120 --configs "18*TREE04" > > > The warning was introduced by commit d84b31313ef8 ("sched/isolation: > Offload residual 1Hz scheduler tick"). > > > Output looks similar for all tests I did (this one is the output of > the 4.18-rc2 based kernel): > > WARNING: CPU: 11 PID: 906 at kernel/sched/core.c:3138 sched_tick_remote+0xb6/0xc0 > Modules linked in: > CPU: 11 PID: 906 Comm: kworker/u32:3 Not tainted 4.18.0-rc2+ #1 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 > Workqueue: events_unbound sched_tick_remote > RIP: 0010:sched_tick_remote+0xb6/0xc0 > Code: e8 0f 06 b8 00 c6 03 00 fb eb 9d 8b 43 04 85 c0 75 8d 48 8b 83 e0 0a 00 00 48 85 c0 75 81 eb 88 48 89 df e8 bc fe ff ff eb aa <0f> 0b eb c5 66 0f 1f 44 00 00 bf 17 00 00 00 e8 b6 2e fe ff 0f b6 > Call Trace: > process_one_work+0x1df/0x3b0 > worker_thread+0x44/0x3d0 > kthread+0xf3/0x130 > ? set_worker_desc+0xb0/0xb0 > ? kthread_create_worker_on_cpu+0x70/0x70 > ret_from_fork+0x35/0x40 > ---[ end trace 7c99b83eb0ec64e8 ]--- > > > Do you need some more information? > > > Thanks, > > Anna-Maria Ok so now I reproduce it immediately after the boot, time for me to debug :-) Thanks. ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2018-06-28 19:02 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-06-26 16:16 sched/core warning triggers on rcu torture test Anna-Maria Gleixner 2018-06-26 16:32 ` Peter Zijlstra 2018-06-26 17:48 ` Paul E. McKenney 2018-06-27 10:40 ` Frederic Weisbecker 2018-06-27 14:25 ` Paul E. McKenney 2018-06-28 16:33 ` Frederic Weisbecker 2018-06-28 16:44 ` Paul E. McKenney 2018-06-28 19:04 ` Paul E. McKenney 2018-06-27 11:29 ` Frederic Weisbecker
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.