Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`
       [not found]                   ` <20161128110449.GK14788@dhcp22.suse.cz>
@ 2016-11-28 12:26                     ` Paul Menzel
  2016-11-30 10:28                       ` Donald Buczek
  0 siblings, 1 reply; 44+ messages in thread
From: Paul Menzel @ 2016-11-28 12:26 UTC (permalink / raw)
  To: Michal Hocko, Donald Buczek; +Cc: dvteam, linux-mm, linux-kernel, Josh Triplett

+linux-mm@kvack.org
-linux-xfs@vger.kernel.org

Dear Michal,


Thank you for your reply, and for looking at the log files.

On 11/28/16 12:04, Michal Hocko wrote:
> On Sun 27-11-16 10:19:06, Donald Buczek wrote:
>> On 24.11.2016 11:15, Michal Hocko wrote:
>>> On Mon 21-11-16 16:35:53, Donald Buczek wrote:
>>> [...]
>>>> Hello,
>>>>
>>>> thanks a lot for looking into this!
>>>>
>>>> Let me add some information from the reporting site:
>>>>
>>>> * We've tried the patch from Paul E. McKenney (the one posted Wed, 16 Nov
>>>> 2016)  and it doesn't shut up the rcu stall warnings.
>>>>
>>>> * Log file from a boot with the patch applied ( grep kernel
>>>> /var/log/messages ) is here :
>>>> http://owww.molgen.mpg.de/~buczek/321322/2016-11-21_syslog.txt
>>>>
>>>> * This system is a backup server and walks over thousands of files sometimes
>>>> with multiple parallel rsync processes.
>>>>
>>>> * No rcu_* warnings on that machine with 4.7.2, but with 4.8.4 , 4.8.6 ,
>>>> 4.8.8 and now 4.9.0-rc5+Pauls patch
>>> I assume you haven't tried the Linus 4.8 kernel without any further
>>> stable patches? Just to be sure we are not talking about some later
>>> regression which found its way to the stable tree.
>>
>> We've tried v4.8 and got the first rcu stall warnings with this, too. First
>> one after about 20 hours uptime.
>>
>>
>>>> * When the backups are actually happening there might be relevant memory
>>>> pressure from inode cache and the rsync processes. We saw the oom-killer
>>>> kick in on another machine with same hardware and similar (a bit higher)
>>>> workload. This other machine also shows a lot of rcu stall warnings since
>>>> 4.8.4.
>>>>
>>>> * We see "rcu_sched detected stalls" also on some other machines since we
>>>> switched to 4.8 but not as frequently as on the two backup servers. Usually
>>>> there's "shrink_node" and "kswapd" on the top of the stack. Often
>>>> "xfs_reclaim_inodes" variants on top of that.
>>> I would be interested to see some reclaim tracepoints enabled. Could you
>>> try that out? At least mm_shrink_slab_{start,end} and
>>> mm_vmscan_lru_shrink_inactive. This should tell us more about how the
>>> reclaim behaved.
>>
>> http://owww.molgen.mpg.de/~buczek/321322/2016-11-26.dmesg.txt  (80K)
>> http://owww.molgen.mpg.de/~buczek/321322/2016-11-26.trace.txt (50M)
>>
>> Traces wrapped, but the last event is covered. all vmscan events were
>> enabled
>
> OK, so one of the stall is reported at
> [118077.988410] INFO: rcu_sched detected stalls on CPUs/tasks:
> [118077.988416] 	1-...: (181 ticks this GP) idle=6d5/140000000000000/0 softirq=46417663/46417663 fqs=10691
> [118077.988417] 	(detected by 4, t=60002 jiffies, g=11845915, c=11845914, q=46475)
> [118077.988421] Task dump for CPU 1:
> [118077.988421] kswapd1         R  running task        0    86      2 0x00000008
> [118077.988424]  ffff88080ad87c58 ffff88080ad87c58 ffff88080ad87cf8 ffff88100c1e5200
> [118077.988426]  0000000000000003 0000000000000000 ffff88080ad87e60 ffff88080ad87d90
> [118077.988428]  ffffffff811345f5 ffff88080ad87da0 ffff88100c1e5200 ffff88080ad87dd0
> [118077.988430] Call Trace:
> [118077.988436]  [<ffffffff811345f5>] ? shrink_node_memcg+0x605/0x870
> [118077.988438]  [<ffffffff8113491f>] ? shrink_node+0xbf/0x1c0
> [118077.988440]  [<ffffffff81135642>] ? kswapd+0x342/0x6b0
>
> the interesting part of the traces would be around the same time:
>         clusterd-989   [009] .... 118023.654491: mm_vmscan_direct_reclaim_end: nr_reclaimed=193
>          kswapd1-86    [001] dN.. 118023.987475: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239830 nr_taken=0 file=1
>          kswapd1-86    [001] dN.. 118024.320968: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239844 nr_taken=0 file=1
>          kswapd1-86    [001] dN.. 118024.654375: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239858 nr_taken=0 file=1
>          kswapd1-86    [001] dN.. 118024.987036: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239872 nr_taken=0 file=1
>          kswapd1-86    [001] dN.. 118025.319651: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239886 nr_taken=0 file=1
>          kswapd1-86    [001] dN.. 118025.652248: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239900 nr_taken=0 file=1
>          kswapd1-86    [001] dN.. 118025.984870: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239914 nr_taken=0 file=1
> [...]
>          kswapd1-86    [001] dN.. 118084.274403: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4241133 nr_taken=0 file=1
>
> Note the Need resched flag. The IRQ off part is expected because we are
> holding the LRU lock which is IRQ safe. That is not a problem because
> the lock is only held for SWAP_CLUSTER_MAX pages at maximum. It is also
> interesing to see that we have scanned only 1303 pages during that 1
> minute. That would be dead slow. None of them were good enough for the
> reclaim but that doesn't sound like a problem. The trace simply suggests
> that the reclaim was preempted by something else. Otherwise I cannot
> imagine such a slow scanning.
>
> Is it possible that something else is hogging the CPU and the RCU just
> happens to blame kswapd which is running in the standard user process
> context?

 From looking at the monitoring graphs, there was always enough CPU 
resources available. The machine has 12x E5-2630 @ 2.30GHz. So that 
shouldn?t have been a problem.


Kind regards,

Paul Menzel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`
       [not found]             ` <20161128143435.GC3924@linux.vnet.ibm.com>
@ 2016-11-28 14:40               ` Boris Zhmurov
  2016-11-28 15:05                 ` Paul E. McKenney
  0 siblings, 1 reply; 44+ messages in thread
From: Boris Zhmurov @ 2016-11-28 14:40 UTC (permalink / raw)
  To: paulmck; +Cc: Michal Hocko, Paul Menzel, Donald Buczek, linux-mm

Paul E. McKenney 28/11/16 17:34:


>> So Paul, I've dropped "mm: Prevent shrink_node_memcg() RCU CPU stall
>> warnings" patch, and stalls got back (attached).
>>
>> With this patch "commit 7cebc6b63bf75db48cb19a94564c39294fd40959" from
>> your tree stalls gone. Looks like that.
> 
> So with only this commit and no other commit or configuration adjustment,
> everything works?  Or it the solution this commit and some other stuff?
> 
> The reason I ask is that if just this commit does the trick, I should
> drop the others.


I'd like to ask for some more time to make sure this is it.
Approximately 2 or 3 days.

-- 
Boris Zhmurov
System/Network Administrator
mailto: bb@kernelpanic.ru
"wget http://kernelpanic.ru/bb_public_key.pgp -O - | gpg --import"

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`
  2016-11-28 14:40               ` INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node` Boris Zhmurov
@ 2016-11-28 15:05                 ` Paul E. McKenney
  2016-11-28 19:16                   ` Boris Zhmurov
  2016-11-30 17:41                   ` Boris Zhmurov
  0 siblings, 2 replies; 44+ messages in thread
From: Paul E. McKenney @ 2016-11-28 15:05 UTC (permalink / raw)
  To: Boris Zhmurov; +Cc: Michal Hocko, Paul Menzel, Donald Buczek, linux-mm

On Mon, Nov 28, 2016 at 05:40:48PM +0300, Boris Zhmurov wrote:
> Paul E. McKenney 28/11/16 17:34:
> 
> 
> >> So Paul, I've dropped "mm: Prevent shrink_node_memcg() RCU CPU stall
> >> warnings" patch, and stalls got back (attached).
> >>
> >> With this patch "commit 7cebc6b63bf75db48cb19a94564c39294fd40959" from
> >> your tree stalls gone. Looks like that.
> > 
> > So with only this commit and no other commit or configuration adjustment,
> > everything works?  Or it the solution this commit and some other stuff?
> > 
> > The reason I ask is that if just this commit does the trick, I should
> > drop the others.
> 
> I'd like to ask for some more time to make sure this is it.
> Approximately 2 or 3 days.

Works for me!

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`
  2016-11-28 15:05                 ` Paul E. McKenney
@ 2016-11-28 19:16                   ` Boris Zhmurov
  2016-11-29 18:59                     ` Paul E. McKenney
  2016-11-30 17:41                   ` Boris Zhmurov
  1 sibling, 1 reply; 44+ messages in thread
From: Boris Zhmurov @ 2016-11-28 19:16 UTC (permalink / raw)
  To: paulmck; +Cc: Michal Hocko, Paul Menzel, Donald Buczek, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1083 bytes --]

Paul E. McKenney 28/11/16 18:05:
> On Mon, Nov 28, 2016 at 05:40:48PM +0300, Boris Zhmurov wrote:
>> Paul E. McKenney 28/11/16 17:34:
>>
>>
>>>> So Paul, I've dropped "mm: Prevent shrink_node_memcg() RCU CPU stall
>>>> warnings" patch, and stalls got back (attached).
>>>>
>>>> With this patch "commit 7cebc6b63bf75db48cb19a94564c39294fd40959" from
>>>> your tree stalls gone. Looks like that.
>>>
>>> So with only this commit and no other commit or configuration adjustment,
>>> everything works?  Or it the solution this commit and some other stuff?
>>>
>>> The reason I ask is that if just this commit does the trick, I should
>>> drop the others.
>>
>> I'd like to ask for some more time to make sure this is it.
>> Approximately 2 or 3 days.
> 
> Works for me!
> 
> 							Thanx, Paul


FYI.
Some more stalls with mm-prevent-shrink_node-RCU-CPU-stall-warning.patch
and without mm-prevent-shrink_node_memcg-RCU-CPU-stall-warnings.patch.


-- 
Boris Zhmurov
System/Network Administrator
mailto: bb@kernelpanic.ru
"wget http://kernelpanic.ru/bb_public_key.pgp -O - | gpg --import"

[-- Attachment #2: rcustall-5.txt --]
[-- Type: text/plain, Size: 4540 bytes --]

[26327.859412] INFO: rcu_sched detected stalls on CPUs/tasks:
[26327.859466] 	18-...: (39 ticks this GP) idle=1ed/140000000000000/0 softirq=3790251/3790251 fqs=24 
[26327.859529] 	(detected by 2, t=6429 jiffies, g=1258488, c=1258487, q=6044)
[26327.859583] Task dump for CPU 18:
[26327.859584] kswapd1         R  running task        0   148      2 0x00000008
[26327.859588]  ffff9e779f411400 ffff9e779096fe68 ffff9e8fffffc000 0000000000000000
[26327.859591]  ffffffffa592404d 0000000000000000 0000000000000000 0000000000000000
[26327.859593]  0000000000000000 ffff9e779096fe58 000000000170bf2c ffff9e8fffffc000
[26327.859596] Call Trace:
[26327.859604]  [<ffffffffa592404d>] ? shrink_node+0xcd/0x2f0
[26327.859606]  [<ffffffffa5924cca>] ? kswapd+0x2ba/0x5e0
[26327.859609]  [<ffffffffa5924a10>] ? mem_cgroup_shrink_node+0x90/0x90
[26327.859612]  [<ffffffffa587bce8>] ? kthread+0xb8/0xd0
[26327.859616]  [<ffffffffa5d1311f>] ? ret_from_fork+0x1f/0x40
[26327.859618]  [<ffffffffa587bc30>] ? kthread_create_on_node+0x170/0x170
[26351.132731] INFO: rcu_sched detected stalls on CPUs/tasks:
[26351.132778] 	(detected by 2, t=6432 jiffies, g=1258490, c=1258489, q=7476)
[26351.132835] All QSes seen, last rcu_sched kthread activity 1405 (4302782902-4302781497), jiffies_till_next_fqs=2, root ->qsmask 0x0
[26351.132917] mc:writer_9     R  running task        0 28495   2101 0x00000008
[26351.132921]  ffffffffa623e600 ffffffffa58b5337 0000000000000000 0000000000000000
[26351.132923]  0000000000001d34 ffffffffa623e600 ffffffffa58be772 ffff9e8c0e54f300
[26351.132925]  0000000000000000 ffff9e78027ffb08 000017f78e9681f9 0000000000000001
[26351.132928] Call Trace:
[26351.132929]  <IRQ>  [<ffffffffa58b5337>] ? rcu_check_callbacks+0x727/0x730
[26351.132939]  [<ffffffffa58be772>] ? update_wall_time+0x382/0x710
[26351.132942]  [<ffffffffa58b8093>] ? update_process_times+0x23/0x50
[26351.132947]  [<ffffffffa58c5bad>] ? tick_sched_handle.isra.15+0x2d/0x40
[26351.132949]  [<ffffffffa58c5bf3>] ? tick_sched_timer+0x33/0x60
[26351.132950]  [<ffffffffa58b879d>] ? __hrtimer_run_queues+0x9d/0x110
[26351.132952]  [<ffffffffa58b8cb4>] ? hrtimer_interrupt+0x94/0x190
[26351.132957]  [<ffffffffa5842b74>] ? smp_apic_timer_interrupt+0x34/0x50
[26351.132961]  [<ffffffffa5d13a82>] ? apic_timer_interrupt+0x82/0x90
[26351.132961]  <EOI>  [<ffffffffa5d12b2c>] ? _raw_spin_unlock_irqrestore+0xc/0x20
[26351.132968]  [<ffffffffa591d8fb>] ? pagevec_lru_move_fn+0xab/0xe0
[26351.132969]  [<ffffffffa591cee0>] ? SyS_readahead+0x90/0x90
[26351.132971]  [<ffffffffa591d9bc>] ? __lru_cache_add+0x4c/0x60
[26351.132974]  [<ffffffffa590efa9>] ? add_to_page_cache_lru+0x59/0xc0
[26351.132976]  [<ffffffffa590f89b>] ? pagecache_get_page+0xcb/0x240
[26351.132979]  [<ffffffffa591096d>] ? grab_cache_page_write_begin+0x1d/0x40
[26351.132998]  [<ffffffffc028c3db>] ? ext4_da_write_begin+0x9b/0x330 [ext4]
[26351.133000]  [<ffffffffa5910afe>] ? generic_perform_write+0xbe/0x1a0
[26351.133003]  [<ffffffffa5998126>] ? file_update_time+0x36/0xe0
[26351.133005]  [<ffffffffa59116b0>] ? __generic_file_write_iter+0x170/0x1d0
[26351.133012]  [<ffffffffc0281d4b>] ? ext4_file_write_iter+0x11b/0x320 [ext4]
[26351.133015]  [<ffffffffa588e4ae>] ? set_next_entity+0x6e/0x770
[26351.133017]  [<ffffffffa588d9ab>] ? put_prev_entity+0x5b/0x6f0
[26351.133019]  [<ffffffffa597ea21>] ? __vfs_write+0xc1/0x120
[26351.133021]  [<ffffffffa597f5c8>] ? vfs_write+0xa8/0x1a0
[26351.133023]  [<ffffffffa598084d>] ? SyS_write+0x3d/0xa0
[26351.133025]  [<ffffffffa5d12ef6>] ? entry_SYSCALL_64_fastpath+0x1e/0xa8
[26351.133027] rcu_sched kthread starved for 1405 jiffies! g1258490 c1258489 f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0
[26351.133097] rcu_sched       R  running task        0     8      2 0x00000000
[26351.133099]  ffff9e7792d45080 0000000000000246 ffff9e7792da0000 ffff9e7792d9fe60
[26351.133102]  0000000100773c3b ffff9e7792d9fe00 ffff9e779fc0fa00 0000000100773c39
[26351.133104]  ffffffffa5d0fc8c ffff9e779fc0fa00 ffffffffa5d122b7 0000000ea58817f6
[26351.133107] Call Trace:
[26351.133112]  [<ffffffffa5d0fc8c>] ? schedule+0x2c/0x80
[26351.133114]  [<ffffffffa5d122b7>] ? schedule_timeout+0x127/0x240
[26351.133116]  [<ffffffffa58b7500>] ? del_timer_sync+0x50/0x50
[26351.133119]  [<ffffffffa58b448a>] ? rcu_gp_kthread+0x37a/0x860
[26351.133121]  [<ffffffffa58b4110>] ? force_qs_rnp+0x180/0x180
[26351.133124]  [<ffffffffa587bce8>] ? kthread+0xb8/0xd0
[26351.133126]  [<ffffffffa5d1311f>] ? ret_from_fork+0x1f/0x40
[26351.133128]  [<ffffffffa587bc30>] ? kthread_create_on_node+0x170/0x170

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`
  2016-11-28 19:16                   ` Boris Zhmurov
@ 2016-11-29 18:59                     ` Paul E. McKenney
  0 siblings, 0 replies; 44+ messages in thread
From: Paul E. McKenney @ 2016-11-29 18:59 UTC (permalink / raw)
  To: Boris Zhmurov; +Cc: Michal Hocko, Paul Menzel, Donald Buczek, linux-mm

On Mon, Nov 28, 2016 at 10:16:33PM +0300, Boris Zhmurov wrote:
> Paul E. McKenney 28/11/16 18:05:
> > On Mon, Nov 28, 2016 at 05:40:48PM +0300, Boris Zhmurov wrote:
> >> Paul E. McKenney 28/11/16 17:34:
> >>
> >>
> >>>> So Paul, I've dropped "mm: Prevent shrink_node_memcg() RCU CPU stall
> >>>> warnings" patch, and stalls got back (attached).
> >>>>
> >>>> With this patch "commit 7cebc6b63bf75db48cb19a94564c39294fd40959" from
> >>>> your tree stalls gone. Looks like that.
> >>>
> >>> So with only this commit and no other commit or configuration adjustment,
> >>> everything works?  Or it the solution this commit and some other stuff?
> >>>
> >>> The reason I ask is that if just this commit does the trick, I should
> >>> drop the others.
> >>
> >> I'd like to ask for some more time to make sure this is it.
> >> Approximately 2 or 3 days.
> > 
> > Works for me!
> > 
> > 							Thanx, Paul
> 
> 
> FYI.
> Some more stalls with mm-prevent-shrink_node-RCU-CPU-stall-warning.patch
> and without mm-prevent-shrink_node_memcg-RCU-CPU-stall-warnings.patch.

Thank you for the info!  Is this one needed?  2d66cccd7343 ("mm: Prevent
__alloc_pages_nodemask() RCU CPU stall warnings")

It is causing trouble in other tests.  If it is needed, I must fix it,
if not, I can happily drop it.  ;-)

							Thanx. Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`
  2016-11-28 12:26                     ` Paul Menzel
@ 2016-11-30 10:28                       ` Donald Buczek
  2016-11-30 11:09                         ` Michal Hocko
  0 siblings, 1 reply; 44+ messages in thread
From: Donald Buczek @ 2016-11-30 10:28 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Paul Menzel, dvteam, linux-mm, linux-kernel, Josh Triplett

On 11/28/16 13:26, Paul Menzel wrote:
> [...]
>
> On 11/28/16 12:04, Michal Hocko wrote:
>> [...]
>>
>> OK, so one of the stall is reported at
>> [118077.988410] INFO: rcu_sched detected stalls on CPUs/tasks:
>> [118077.988416]     1-...: (181 ticks this GP) 
>> idle=6d5/140000000000000/0 softirq=46417663/46417663 fqs=10691
>> [118077.988417]     (detected by 4, t=60002 jiffies, g=11845915, 
>> c=11845914, q=46475)
>> [118077.988421] Task dump for CPU 1:
>> [118077.988421] kswapd1         R  running task        0 86      2 
>> 0x00000008
>> [118077.988424]  ffff88080ad87c58 ffff88080ad87c58 ffff88080ad87cf8 
>> ffff88100c1e5200
>> [118077.988426]  0000000000000003 0000000000000000 ffff88080ad87e60 
>> ffff88080ad87d90
>> [118077.988428]  ffffffff811345f5 ffff88080ad87da0 ffff88100c1e5200 
>> ffff88080ad87dd0
>> [118077.988430] Call Trace:
>> [118077.988436]  [<ffffffff811345f5>] ? shrink_node_memcg+0x605/0x870
>> [118077.988438]  [<ffffffff8113491f>] ? shrink_node+0xbf/0x1c0
>> [118077.988440]  [<ffffffff81135642>] ? kswapd+0x342/0x6b0
>>
>> the interesting part of the traces would be around the same time:
>>         clusterd-989   [009] .... 118023.654491: 
>> mm_vmscan_direct_reclaim_end: nr_reclaimed=193
>>          kswapd1-86    [001] dN.. 118023.987475: 
>> mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 
>> nr_requested=32 nr_scanned=4239830 nr_taken=0 file=1
>>          kswapd1-86    [001] dN.. 118024.320968: 
>> mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 
>> nr_requested=32 nr_scanned=4239844 nr_taken=0 file=1
>>          kswapd1-86    [001] dN.. 118024.654375: 
>> mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 
>> nr_requested=32 nr_scanned=4239858 nr_taken=0 file=1
>>          kswapd1-86    [001] dN.. 118024.987036: 
>> mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 
>> nr_requested=32 nr_scanned=4239872 nr_taken=0 file=1
>>          kswapd1-86    [001] dN.. 118025.319651: 
>> mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 
>> nr_requested=32 nr_scanned=4239886 nr_taken=0 file=1
>>          kswapd1-86    [001] dN.. 118025.652248: 
>> mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 
>> nr_requested=32 nr_scanned=4239900 nr_taken=0 file=1
>>          kswapd1-86    [001] dN.. 118025.984870: 
>> mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 
>> nr_requested=32 nr_scanned=4239914 nr_taken=0 file=1
>> [...]
>>          kswapd1-86    [001] dN.. 118084.274403: 
>> mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 
>> nr_requested=32 nr_scanned=4241133 nr_taken=0 file=1
>>
>> Note the Need resched flag. The IRQ off part is expected because we are
>> holding the LRU lock which is IRQ safe.

Hmmm. With the lock held, preemption is disabled. If we are in that 
state for some time, I'd expect need_resched just because of time 
quantum. But... :

The call stack always has

 > [<ffffffff811345f5>] ? shrink_node_memcg+0x605/0x870

which translates to

 > (gdb) list *0xffffffff811345f5
 > 0xffffffff811345f5 is in shrink_node_memcg (mm/vmscan.c:2065).
 > 2060    static unsigned long shrink_list(enum lru_list lru, unsigned 
long nr_to_scan,
 > 2061                     struct lruvec *lruvec, struct scan_control *sc)
 > 2062    {
 > 2063        if (is_active_lru(lru)) {
 > 2064            if (inactive_list_is_low(lruvec, is_file_lru(lru), sc))
 > 2065                shrink_active_list(nr_to_scan, lruvec, sc, lru);
 > 2066            return 0;
 > 2067        }
 > 2068
 > 2069        return shrink_inactive_list(nr_to_scan, lruvec, sc, lru);

So we are in shrink_active_list. I made a small change without keeping 
the old vmlinux and the addresses are off by 16 bytes, but it can be 
verified exactly on another machine:

 > buczek@void:/scratch/local/linux-4.8.10-121.x86_64/source$ grep 
shrink_node_memcg /var/log/messages
 > [...]
 > void kernel: [508779.136016]  [<ffffffff8114833a>] ? 
shrink_node_memcg+0x60a/0x870
 > (gdb) disas 0xffffffff8114833a
 > [...]
 >   0xffffffff81148330 <+1536>:    mov    %r10,0x38(%rsp)
 >   0xffffffff81148335 <+1541>:    callq 0xffffffff81147a00 
<shrink_active_list>
 >   0xffffffff8114833a <+1546>:    mov    0x38(%rsp),%r10
 >   0xffffffff8114833f <+1551>:    jmpq 0xffffffff81147f80 
<shrink_node_memcg+592>
 >   0xffffffff81148344 <+1556>:    mov    %r13,0x78(%r12)


shrink_active_list gets and releases the spinlock and calls 
cond_resched(). This should give other tasks a chance to run. Just as an 
experiment, I'm trying

--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1921,7 +1921,7 @@ static void shrink_active_list(unsigned long 
nr_to_scan,
         spin_unlock_irq(&pgdat->lru_lock);

         while (!list_empty(&l_hold)) {
-               cond_resched();
+               cond_resched_rcu_qs();
                 page = lru_to_page(&l_hold);
                 list_del(&page->lru);

and didn't hit a rcu_sched warning for >21 hours uptime now. We'll see. 
Is preemption disabled for another reason?

Regards
   Donald

>> That is not a problem because
>> the lock is only held for SWAP_CLUSTER_MAX pages at maximum. It is also
>> interesing to see that we have scanned only 1303 pages during that 1
>> minute. That would be dead slow. None of them were good enough for the
>> reclaim but that doesn't sound like a problem. The trace simply suggests
>> that the reclaim was preempted by something else. Otherwise I cannot
>> imagine such a slow scanning.
>>
>> Is it possible that something else is hogging the CPU and the RCU just
>> happens to blame kswapd which is running in the standard user process
>> context?
>
> From looking at the monitoring graphs, there was always enough CPU 
> resources available. The machine has 12x E5-2630 @ 2.30GHz. So that 
> shouldn?t have been a problem.
>
>
> Kind regards,
>
> Paul Menzel


-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`
  2016-11-30 10:28                       ` Donald Buczek
@ 2016-11-30 11:09                         ` Michal Hocko
  2016-11-30 11:43                           ` Donald Buczek
  2016-11-30 11:53                           ` Paul E. McKenney
  0 siblings, 2 replies; 44+ messages in thread
From: Michal Hocko @ 2016-11-30 11:09 UTC (permalink / raw)
  To: Donald Buczek
  Cc: Paul Menzel, dvteam, linux-mm, linux-kernel, Josh Triplett,
	Paul E. McKenney

[CCing Paul]

On Wed 30-11-16 11:28:34, Donald Buczek wrote:
[...]
> shrink_active_list gets and releases the spinlock and calls cond_resched().
> This should give other tasks a chance to run. Just as an experiment, I'm
> trying
> 
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1921,7 +1921,7 @@ static void shrink_active_list(unsigned long
> nr_to_scan,
>         spin_unlock_irq(&pgdat->lru_lock);
> 
>         while (!list_empty(&l_hold)) {
> -               cond_resched();
> +               cond_resched_rcu_qs();
>                 page = lru_to_page(&l_hold);
>                 list_del(&page->lru);
> 
> and didn't hit a rcu_sched warning for >21 hours uptime now. We'll see.

This is really interesting! Is it possible that the RCU stall detector
is somehow confused?

> Is preemption disabled for another reason?

I do not think so. I will have to double check the code but this is a
standard sleepable context. Just wondering what is the PREEMPT
configuration here?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`
  2016-11-30 11:09                         ` Michal Hocko
@ 2016-11-30 11:43                           ` Donald Buczek
  2016-12-02  9:14                             ` Donald Buczek
  2016-11-30 11:53                           ` Paul E. McKenney
  1 sibling, 1 reply; 44+ messages in thread
From: Donald Buczek @ 2016-11-30 11:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Paul Menzel, dvteam, linux-mm, linux-kernel, Josh Triplett,
	Paul E. McKenney

On 11/30/16 12:09, Michal Hocko wrote:
> [CCing Paul]
>
> On Wed 30-11-16 11:28:34, Donald Buczek wrote:
> [...]
>> shrink_active_list gets and releases the spinlock and calls cond_resched().
>> This should give other tasks a chance to run. Just as an experiment, I'm
>> trying
>>
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1921,7 +1921,7 @@ static void shrink_active_list(unsigned long
>> nr_to_scan,
>>          spin_unlock_irq(&pgdat->lru_lock);
>>
>>          while (!list_empty(&l_hold)) {
>> -               cond_resched();
>> +               cond_resched_rcu_qs();
>>                  page = lru_to_page(&l_hold);
>>                  list_del(&page->lru);
>>
>> and didn't hit a rcu_sched warning for >21 hours uptime now. We'll see.
> This is really interesting! Is it possible that the RCU stall detector
> is somehow confused?

Wait... 21 hours is not yet a test result.

>> Is preemption disabled for another reason?
> I do not think so. I will have to double check the code but this is a
> standard sleepable context. Just wondering what is the PREEMPT
> configuration here?

buczek@null:~$ zcat /proc/config.gz |grep PREE
CONFIG_PREEMPT_NOTIFIERS=y
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set

Thanks
   Donald

-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`
  2016-11-30 11:09                         ` Michal Hocko
  2016-11-30 11:43                           ` Donald Buczek
@ 2016-11-30 11:53                           ` Paul E. McKenney
  2016-11-30 11:54                             ` Paul E. McKenney
  2016-11-30 13:19                             ` Michal Hocko
  1 sibling, 2 replies; 44+ messages in thread
From: Paul E. McKenney @ 2016-11-30 11:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Donald Buczek, Paul Menzel, dvteam, linux-mm, linux-kernel,
	Josh Triplett

On Wed, Nov 30, 2016 at 12:09:44PM +0100, Michal Hocko wrote:
> [CCing Paul]
> 
> On Wed 30-11-16 11:28:34, Donald Buczek wrote:
> [...]
> > shrink_active_list gets and releases the spinlock and calls cond_resched().
> > This should give other tasks a chance to run. Just as an experiment, I'm
> > trying
> > 
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1921,7 +1921,7 @@ static void shrink_active_list(unsigned long
> > nr_to_scan,
> >         spin_unlock_irq(&pgdat->lru_lock);
> > 
> >         while (!list_empty(&l_hold)) {
> > -               cond_resched();
> > +               cond_resched_rcu_qs();
> >                 page = lru_to_page(&l_hold);
> >                 list_del(&page->lru);
> > 
> > and didn't hit a rcu_sched warning for >21 hours uptime now. We'll see.
> 
> This is really interesting! Is it possible that the RCU stall detector
> is somehow confused?

No, it is not confused.  Again, cond_resched() is not a quiescent
state unless it does a context switch.  Therefore, if the task running
in that loop was the only runnable task on its CPU, cond_resched()
would -never- provide RCU with a quiescent state.

In contrast, cond_resched_rcu_qs() unconditionally provides RCU
with a quiescent state (hence the _rcu_qs in its name), regardless
of whether or not a context switch happens.

It is therefore expected behavior that this change might prevent
RCU CPU stall warnings.

							Thanx, Paul

> > Is preemption disabled for another reason?
> 
> I do not think so. I will have to double check the code but this is a
> standard sleepable context. Just wondering what is the PREEMPT
> configuration here?
> -- 
> Michal Hocko
> SUSE Labs
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`
  2016-11-30 11:53                           ` Paul E. McKenney
@ 2016-11-30 11:54                             ` Paul E. McKenney
  2016-11-30 12:31                               ` Paul Menzel
  2016-11-30 13:19                             ` Michal Hocko
  1 sibling, 1 reply; 44+ messages in thread
From: Paul E. McKenney @ 2016-11-30 11:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Donald Buczek, Paul Menzel, dvteam, linux-mm, linux-kernel,
	Josh Triplett

On Wed, Nov 30, 2016 at 03:53:20AM -0800, Paul E. McKenney wrote:
> On Wed, Nov 30, 2016 at 12:09:44PM +0100, Michal Hocko wrote:
> > [CCing Paul]
> > 
> > On Wed 30-11-16 11:28:34, Donald Buczek wrote:
> > [...]
> > > shrink_active_list gets and releases the spinlock and calls cond_resched().
> > > This should give other tasks a chance to run. Just as an experiment, I'm
> > > trying
> > > 
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -1921,7 +1921,7 @@ static void shrink_active_list(unsigned long
> > > nr_to_scan,
> > >         spin_unlock_irq(&pgdat->lru_lock);
> > > 
> > >         while (!list_empty(&l_hold)) {
> > > -               cond_resched();
> > > +               cond_resched_rcu_qs();
> > >                 page = lru_to_page(&l_hold);
> > >                 list_del(&page->lru);
> > > 
> > > and didn't hit a rcu_sched warning for >21 hours uptime now. We'll see.
> > 
> > This is really interesting! Is it possible that the RCU stall detector
> > is somehow confused?
> 
> No, it is not confused.  Again, cond_resched() is not a quiescent
> state unless it does a context switch.  Therefore, if the task running
> in that loop was the only runnable task on its CPU, cond_resched()
> would -never- provide RCU with a quiescent state.
> 
> In contrast, cond_resched_rcu_qs() unconditionally provides RCU
> with a quiescent state (hence the _rcu_qs in its name), regardless
> of whether or not a context switch happens.
> 
> It is therefore expected behavior that this change might prevent
> RCU CPU stall warnings.

I should add...  This assumes that CONFIG_PREEMPT=n.  So what is
CONFIG_PREEMPT?

							Thanx, Paul

> > > Is preemption disabled for another reason?
> > 
> > I do not think so. I will have to double check the code but this is a
> > standard sleepable context. Just wondering what is the PREEMPT
> > configuration here?
> > -- 
> > Michal Hocko
> > SUSE Labs
> > 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`
  2016-11-30 11:54                             ` Paul E. McKenney
@ 2016-11-30 12:31                               ` Paul Menzel
  2016-11-30 14:31                                 ` Paul E. McKenney
  0 siblings, 1 reply; 44+ messages in thread
From: Paul Menzel @ 2016-11-30 12:31 UTC (permalink / raw)
  To: Paul E. McKenney, Michal Hocko
  Cc: Donald Buczek, dvteam, linux-mm, linux-kernel, Josh Triplett

On 11/30/16 12:54, Paul E. McKenney wrote:
> On Wed, Nov 30, 2016 at 03:53:20AM -0800, Paul E. McKenney wrote:
>> On Wed, Nov 30, 2016 at 12:09:44PM +0100, Michal Hocko wrote:
>>> [CCing Paul]
>>>
>>> On Wed 30-11-16 11:28:34, Donald Buczek wrote:
>>> [...]
>>>> shrink_active_list gets and releases the spinlock and calls cond_resched().
>>>> This should give other tasks a chance to run. Just as an experiment, I'm
>>>> trying
>>>>
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -1921,7 +1921,7 @@ static void shrink_active_list(unsigned long
>>>> nr_to_scan,
>>>>         spin_unlock_irq(&pgdat->lru_lock);
>>>>
>>>>         while (!list_empty(&l_hold)) {
>>>> -               cond_resched();
>>>> +               cond_resched_rcu_qs();
>>>>                 page = lru_to_page(&l_hold);
>>>>                 list_del(&page->lru);
>>>>
>>>> and didn't hit a rcu_sched warning for >21 hours uptime now. We'll see.
>>>
>>> This is really interesting! Is it possible that the RCU stall detector
>>> is somehow confused?
>>
>> No, it is not confused.  Again, cond_resched() is not a quiescent
>> state unless it does a context switch.  Therefore, if the task running
>> in that loop was the only runnable task on its CPU, cond_resched()
>> would -never- provide RCU with a quiescent state.
>>
>> In contrast, cond_resched_rcu_qs() unconditionally provides RCU
>> with a quiescent state (hence the _rcu_qs in its name), regardless
>> of whether or not a context switch happens.
>>
>> It is therefore expected behavior that this change might prevent
>> RCU CPU stall warnings.
> 
> I should add...  This assumes that CONFIG_PREEMPT=n.  So what is
> CONFIG_PREEMPT?

It?s not selected.

```
# CONFIG_PREEMPT is not set
```

>>>> Is preemption disabled for another reason?
>>>
>>> I do not think so. I will have to double check the code but this is a
>>> standard sleepable context. Just wondering what is the PREEMPT
>>> configuration here?


Kind regards,

Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`
  2016-11-30 11:53                           ` Paul E. McKenney
  2016-11-30 11:54                             ` Paul E. McKenney
@ 2016-11-30 13:19                             ` Michal Hocko
  2016-11-30 14:29                               ` Paul E. McKenney
  1 sibling, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2016-11-30 13:19 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Donald Buczek, Paul Menzel, dvteam, linux-mm, linux-kernel,
	Josh Triplett

On Wed 30-11-16 03:53:20, Paul E. McKenney wrote:
> On Wed, Nov 30, 2016 at 12:09:44PM +0100, Michal Hocko wrote:
> > [CCing Paul]
> > 
> > On Wed 30-11-16 11:28:34, Donald Buczek wrote:
> > [...]
> > > shrink_active_list gets and releases the spinlock and calls cond_resched().
> > > This should give other tasks a chance to run. Just as an experiment, I'm
> > > trying
> > > 
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -1921,7 +1921,7 @@ static void shrink_active_list(unsigned long
> > > nr_to_scan,
> > >         spin_unlock_irq(&pgdat->lru_lock);
> > > 
> > >         while (!list_empty(&l_hold)) {
> > > -               cond_resched();
> > > +               cond_resched_rcu_qs();
> > >                 page = lru_to_page(&l_hold);
> > >                 list_del(&page->lru);
> > > 
> > > and didn't hit a rcu_sched warning for >21 hours uptime now. We'll see.
> > 
> > This is really interesting! Is it possible that the RCU stall detector
> > is somehow confused?
> 
> No, it is not confused.  Again, cond_resched() is not a quiescent
> state unless it does a context switch.  Therefore, if the task running
> in that loop was the only runnable task on its CPU, cond_resched()
> would -never- provide RCU with a quiescent state.

Sorry for being dense here. But why cannot we hide the QS handling into
cond_resched()? I mean doesn't every current usage of cond_resched
suffer from the same problem wrt RCU stalls?

> In contrast, cond_resched_rcu_qs() unconditionally provides RCU
> with a quiescent state (hence the _rcu_qs in its name), regardless
> of whether or not a context switch happens.
> 
> It is therefore expected behavior that this change might prevent
> RCU CPU stall warnings.
> 
> 							Thanx, Paul

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`
  2016-11-30 13:19                             ` Michal Hocko
@ 2016-11-30 14:29                               ` Paul E. McKenney
  2016-11-30 16:38                                 ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Paul E. McKenney @ 2016-11-30 14:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Donald Buczek, Paul Menzel, dvteam, linux-mm, linux-kernel,
	Josh Triplett, peterz

On Wed, Nov 30, 2016 at 02:19:10PM +0100, Michal Hocko wrote:
> On Wed 30-11-16 03:53:20, Paul E. McKenney wrote:
> > On Wed, Nov 30, 2016 at 12:09:44PM +0100, Michal Hocko wrote:
> > > [CCing Paul]
> > > 
> > > On Wed 30-11-16 11:28:34, Donald Buczek wrote:
> > > [...]
> > > > shrink_active_list gets and releases the spinlock and calls cond_resched().
> > > > This should give other tasks a chance to run. Just as an experiment, I'm
> > > > trying
> > > > 
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -1921,7 +1921,7 @@ static void shrink_active_list(unsigned long
> > > > nr_to_scan,
> > > >         spin_unlock_irq(&pgdat->lru_lock);
> > > > 
> > > >         while (!list_empty(&l_hold)) {
> > > > -               cond_resched();
> > > > +               cond_resched_rcu_qs();
> > > >                 page = lru_to_page(&l_hold);
> > > >                 list_del(&page->lru);
> > > > 
> > > > and didn't hit a rcu_sched warning for >21 hours uptime now. We'll see.
> > > 
> > > This is really interesting! Is it possible that the RCU stall detector
> > > is somehow confused?
> > 
> > No, it is not confused.  Again, cond_resched() is not a quiescent
> > state unless it does a context switch.  Therefore, if the task running
> > in that loop was the only runnable task on its CPU, cond_resched()
> > would -never- provide RCU with a quiescent state.
> 
> Sorry for being dense here. But why cannot we hide the QS handling into
> cond_resched()? I mean doesn't every current usage of cond_resched
> suffer from the same problem wrt RCU stalls?

We can, and you are correct that cond_resched() does not unconditionally
supply RCU quiescent states, and never has.  Last time I tried to add
cond_resched_rcu_qs() semantics to cond_resched(), I got told "no",
but perhaps it is time to try again.

One of the challenges is that there are two different timeframes.
If we want CONFIG_PREEMPT=n kernels to have millisecond-level scheduling
latencies, we need a cond_resched() more than once per millisecond, and
the usual uncertainties will mean more like once per hundred microseconds
or so.  In contrast, the occasional 100-millisecond RCU grace period when
under heavy load is normally not considered to be a problem, which means
that a cond_resched_rcu_qs() every 10 milliseconds or so is just fine.

Which means that cond_resched() is much more sensitive to overhead
than is cond_resched_rcu_qs().

No reason not to give it another try, though!  (Adding Peter Zijlstra
to CC for his reactions.)

Right now, the added overhead is a function call, two tests of per-CPU
variables, one increment of a per-CPU variable, and a barrier() before
and after.  I could probably combine the tests, but I do need at least
one test.  I cannot see how I can eliminate either barrier().  I might
be able to pull the increment under the test.

The patch below is instead very straightforward, avoiding any
optimizations.  Untested, probably does not even build.

Failing this approach, the rule is as follows:

1.	Add cond_resched() to in-kernel loops that cause excessive
	scheduling latencies.

2.	Add cond_resched_rcu_qs() to in-kernel loops that cause
	RCU CPU stall warnings.

							Thanx, Paul

> > In contrast, cond_resched_rcu_qs() unconditionally provides RCU
> > with a quiescent state (hence the _rcu_qs in its name), regardless
> > of whether or not a context switch happens.
> > 
> > It is therefore expected behavior that this change might prevent
> > RCU CPU stall warnings.

------------------------------------------------------------------------

commit d7100358d066cd7d64301a2da161390e9f4aa63f
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date:   Wed Nov 30 06:24:30 2016 -0800

    sched,rcu: Make cond_resched() provide RCU quiescent state
    
    There is some confusion as to which of cond_resched() or
    cond_resched_rcu_qs() should be added to long in-kernel loops.
    This commit therefore eliminates the decision by adding RCU
    quiescent states to cond_resched().
    
    Warning: This is a prototype.  For example, it does not correctly
    handle Tasks RCU.  Which is OK for the moment, given that no one
    actually uses Tasks RCU yet.
    
    Reported-by: Michal Hocko <mhocko@kernel.org>
    Not-yet-signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
    Cc: Peter Zijlstra <peterz@infradead.org>

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 348f51b0ec92..ccdb6064884e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -3308,10 +3308,11 @@ static inline int signal_pending_state(long state, struct task_struct *p)
  * cond_resched_lock() will drop the spinlock before scheduling,
  * cond_resched_softirq() will enable bhs before scheduling.
  */
+void rcu_all_qs(void);
 #ifndef CONFIG_PREEMPT
 extern int _cond_resched(void);
 #else
-static inline int _cond_resched(void) { return 0; }
+static inline int _cond_resched(void) { rcu_all_qs(); return 0; }
 #endif
 
 #define cond_resched() ({			\
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 94732d1ab00a..40b690813b80 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4906,6 +4906,7 @@ int __sched _cond_resched(void)
 		preempt_schedule_common();
 		return 1;
 	}
+	rcu_all_qs();
 	return 0;
 }
 EXPORT_SYMBOL(_cond_resched);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`
  2016-11-30 12:31                               ` Paul Menzel
@ 2016-11-30 14:31                                 ` Paul E. McKenney
  0 siblings, 0 replies; 44+ messages in thread
From: Paul E. McKenney @ 2016-11-30 14:31 UTC (permalink / raw)
  To: Paul Menzel
  Cc: Michal Hocko, Donald Buczek, dvteam, linux-mm, linux-kernel,
	Josh Triplett

On Wed, Nov 30, 2016 at 01:31:37PM +0100, Paul Menzel wrote:
> On 11/30/16 12:54, Paul E. McKenney wrote:
> > On Wed, Nov 30, 2016 at 03:53:20AM -0800, Paul E. McKenney wrote:
> >> On Wed, Nov 30, 2016 at 12:09:44PM +0100, Michal Hocko wrote:
> >>> [CCing Paul]
> >>>
> >>> On Wed 30-11-16 11:28:34, Donald Buczek wrote:
> >>> [...]
> >>>> shrink_active_list gets and releases the spinlock and calls cond_resched().
> >>>> This should give other tasks a chance to run. Just as an experiment, I'm
> >>>> trying
> >>>>
> >>>> --- a/mm/vmscan.c
> >>>> +++ b/mm/vmscan.c
> >>>> @@ -1921,7 +1921,7 @@ static void shrink_active_list(unsigned long
> >>>> nr_to_scan,
> >>>>         spin_unlock_irq(&pgdat->lru_lock);
> >>>>
> >>>>         while (!list_empty(&l_hold)) {
> >>>> -               cond_resched();
> >>>> +               cond_resched_rcu_qs();
> >>>>                 page = lru_to_page(&l_hold);
> >>>>                 list_del(&page->lru);
> >>>>
> >>>> and didn't hit a rcu_sched warning for >21 hours uptime now. We'll see.
> >>>
> >>> This is really interesting! Is it possible that the RCU stall detector
> >>> is somehow confused?
> >>
> >> No, it is not confused.  Again, cond_resched() is not a quiescent
> >> state unless it does a context switch.  Therefore, if the task running
> >> in that loop was the only runnable task on its CPU, cond_resched()
> >> would -never- provide RCU with a quiescent state.
> >>
> >> In contrast, cond_resched_rcu_qs() unconditionally provides RCU
> >> with a quiescent state (hence the _rcu_qs in its name), regardless
> >> of whether or not a context switch happens.
> >>
> >> It is therefore expected behavior that this change might prevent
> >> RCU CPU stall warnings.
> > 
> > I should add...  This assumes that CONFIG_PREEMPT=n.  So what is
> > CONFIG_PREEMPT?
> 
> Ita??s not selected.
> 
> ```
> # CONFIG_PREEMPT is not set
> ```

Thank you for the info!

As noted elsewhere in this thread, there are other ways to get stalls,
including the long irq-disabled execution that Michal suspects.

							Thanx, Paul

> >>>> Is preemption disabled for another reason?
> >>>
> >>> I do not think so. I will have to double check the code but this is a
> >>> standard sleepable context. Just wondering what is the PREEMPT
> >>> configuration here?
> 
> 
> Kind regards,
> 
> Paul
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`
  2016-11-30 14:29                               ` Paul E. McKenney
@ 2016-11-30 16:38                                 ` Peter Zijlstra
  2016-11-30 17:02                                   ` Paul E. McKenney
  2016-11-30 17:05                                   ` Michal Hocko
  0 siblings, 2 replies; 44+ messages in thread
From: Peter Zijlstra @ 2016-11-30 16:38 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Michal Hocko, Donald Buczek, Paul Menzel, dvteam, linux-mm,
	linux-kernel, Josh Triplett

On Wed, Nov 30, 2016 at 06:29:55AM -0800, Paul E. McKenney wrote:
> We can, and you are correct that cond_resched() does not unconditionally
> supply RCU quiescent states, and never has.  Last time I tried to add
> cond_resched_rcu_qs() semantics to cond_resched(), I got told "no",
> but perhaps it is time to try again.

Well, you got told: "ARRGH my benchmark goes all regress", or something
along those lines. Didn't we recently dig out those commits for some
reason or other?

Finding out what benchmark that was and running it against this patch
would make sense.

Also, I seem to have missed, why are we going through this again?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`
  2016-11-30 16:38                                 ` Peter Zijlstra
@ 2016-11-30 17:02                                   ` Paul E. McKenney
  2016-11-30 17:05                                   ` Michal Hocko
  1 sibling, 0 replies; 44+ messages in thread
From: Paul E. McKenney @ 2016-11-30 17:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michal Hocko, Donald Buczek, Paul Menzel, dvteam, linux-mm,
	linux-kernel, Josh Triplett

On Wed, Nov 30, 2016 at 05:38:20PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 30, 2016 at 06:29:55AM -0800, Paul E. McKenney wrote:
> > We can, and you are correct that cond_resched() does not unconditionally
> > supply RCU quiescent states, and never has.  Last time I tried to add
> > cond_resched_rcu_qs() semantics to cond_resched(), I got told "no",
> > but perhaps it is time to try again.
> 
> Well, you got told: "ARRGH my benchmark goes all regress", or something
> along those lines. Didn't we recently dig out those commits for some
> reason or other?

Were "those commits" the benchmark or putting cond_resched_rcu_qs()
functionality into cond_resched()?  Either way, no idea.

> Finding out what benchmark that was and running it against this patch
> would make sense.

Agreed, especially given that I believe cond_resched_rcu_qs() is lighter
weight than it used to be.  No idea what benchmarks they were, though.

> Also, I seem to have missed, why are we going through this again?

People are running workloads that force long-running loops in the kernel,
which get them RCU CPU stall warning messages.  My reaction has been
to insert cond_resched_rcu_qs() as needed, and Michal wondered why
cond_resched() couldn't just handle both scheduling latency and RCU
quiescent states.  I remembered trying it, but not what the issue was.

So I posted the patch assuming that I would eventually either find out
what the issue was or that the issue no longer applied.  ;-)

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`
  2016-11-30 16:38                                 ` Peter Zijlstra
  2016-11-30 17:02                                   ` Paul E. McKenney
@ 2016-11-30 17:05                                   ` Michal Hocko
  2016-11-30 17:23                                     ` Paul E. McKenney
  2016-11-30 17:50                                     ` Peter Zijlstra
  1 sibling, 2 replies; 44+ messages in thread
From: Michal Hocko @ 2016-11-30 17:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Donald Buczek, Paul Menzel, dvteam, linux-mm,
	linux-kernel, Josh Triplett

On Wed 30-11-16 17:38:20, Peter Zijlstra wrote:
> On Wed, Nov 30, 2016 at 06:29:55AM -0800, Paul E. McKenney wrote:
> > We can, and you are correct that cond_resched() does not unconditionally
> > supply RCU quiescent states, and never has.  Last time I tried to add
> > cond_resched_rcu_qs() semantics to cond_resched(), I got told "no",
> > but perhaps it is time to try again.
> 
> Well, you got told: "ARRGH my benchmark goes all regress", or something
> along those lines. Didn't we recently dig out those commits for some
> reason or other?
> 
> Finding out what benchmark that was and running it against this patch
> would make sense.
> 
> Also, I seem to have missed, why are we going through this again?

Well, the point I've brought that up is because having basically two
APIs for cond_resched is more than confusing. Basically all longer in
kernel loops do cond_resched() but it seems that this will not help the
silence RCU lockup detector in rare cases where nothing really wants to
schedule. I am really not sure whether we want to sprinkle
cond_resched_rcu_qs at random places just to silence RCU detector...

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`
  2016-11-30 17:05                                   ` Michal Hocko
@ 2016-11-30 17:23                                     ` Paul E. McKenney
  2016-11-30 17:34                                       ` Michal Hocko
  2016-11-30 17:50                                     ` Peter Zijlstra
  1 sibling, 1 reply; 44+ messages in thread
From: Paul E. McKenney @ 2016-11-30 17:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Peter Zijlstra, Donald Buczek, Paul Menzel, dvteam, linux-mm,
	linux-kernel, Josh Triplett

On Wed, Nov 30, 2016 at 06:05:57PM +0100, Michal Hocko wrote:
> On Wed 30-11-16 17:38:20, Peter Zijlstra wrote:
> > On Wed, Nov 30, 2016 at 06:29:55AM -0800, Paul E. McKenney wrote:
> > > We can, and you are correct that cond_resched() does not unconditionally
> > > supply RCU quiescent states, and never has.  Last time I tried to add
> > > cond_resched_rcu_qs() semantics to cond_resched(), I got told "no",
> > > but perhaps it is time to try again.
> > 
> > Well, you got told: "ARRGH my benchmark goes all regress", or something
> > along those lines. Didn't we recently dig out those commits for some
> > reason or other?
> > 
> > Finding out what benchmark that was and running it against this patch
> > would make sense.
> > 
> > Also, I seem to have missed, why are we going through this again?
> 
> Well, the point I've brought that up is because having basically two
> APIs for cond_resched is more than confusing. Basically all longer in
> kernel loops do cond_resched() but it seems that this will not help the
> silence RCU lockup detector in rare cases where nothing really wants to
> schedule. I am really not sure whether we want to sprinkle
> cond_resched_rcu_qs at random places just to silence RCU detector...

Just in case there is any doubt on this point, any patch of mine adding
cond_resched_rcu_qs() functionality to cond_resched() cannot go upstream
without Peter's Acked-by.

Or did you have some other solution in mind?

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`
  2016-11-30 17:23                                     ` Paul E. McKenney
@ 2016-11-30 17:34                                       ` Michal Hocko
  0 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2016-11-30 17:34 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Donald Buczek, Paul Menzel, dvteam, linux-mm,
	linux-kernel, Josh Triplett

On Wed 30-11-16 09:23:55, Paul E. McKenney wrote:
> On Wed, Nov 30, 2016 at 06:05:57PM +0100, Michal Hocko wrote:
> > On Wed 30-11-16 17:38:20, Peter Zijlstra wrote:
> > > On Wed, Nov 30, 2016 at 06:29:55AM -0800, Paul E. McKenney wrote:
> > > > We can, and you are correct that cond_resched() does not unconditionally
> > > > supply RCU quiescent states, and never has.  Last time I tried to add
> > > > cond_resched_rcu_qs() semantics to cond_resched(), I got told "no",
> > > > but perhaps it is time to try again.
> > > 
> > > Well, you got told: "ARRGH my benchmark goes all regress", or something
> > > along those lines. Didn't we recently dig out those commits for some
> > > reason or other?
> > > 
> > > Finding out what benchmark that was and running it against this patch
> > > would make sense.
> > > 
> > > Also, I seem to have missed, why are we going through this again?
> > 
> > Well, the point I've brought that up is because having basically two
> > APIs for cond_resched is more than confusing. Basically all longer in
> > kernel loops do cond_resched() but it seems that this will not help the
> > silence RCU lockup detector in rare cases where nothing really wants to
> > schedule. I am really not sure whether we want to sprinkle
> > cond_resched_rcu_qs at random places just to silence RCU detector...
> 
> Just in case there is any doubt on this point, any patch of mine adding
> cond_resched_rcu_qs() functionality to cond_resched() cannot go upstream
> without Peter's Acked-by.

Yeah, that is clear to me. I just wanted to clarify the "why are we
going through this again" part ;)
 
> Or did you have some other solution in mind?

Not really. The fact that cond_resched() cannot silence RCU stall
detector under some circumstances is sad. I believe we shouldn't
have two different APIs to control scheduling and RCU latencies because
that just asks for whack a mole games and some level of confusion...
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`
  2016-11-28 15:05                 ` Paul E. McKenney
  2016-11-28 19:16                   ` Boris Zhmurov
@ 2016-11-30 17:41                   ` Boris Zhmurov
  2016-11-30 17:48                     ` Michal Hocko
  1 sibling, 1 reply; 44+ messages in thread
From: Boris Zhmurov @ 2016-11-30 17:41 UTC (permalink / raw)
  To: paulmck; +Cc: Michal Hocko, Paul Menzel, Donald Buczek, linux-mm

Paul E. McKenney 28/11/16 18:05:

>>>> So Paul, I've dropped "mm: Prevent shrink_node_memcg() RCU CPU stall
>>>> warnings" patch, and stalls got back (attached).
>>>>
>>>> With this patch "commit 7cebc6b63bf75db48cb19a94564c39294fd40959" from
>>>> your tree stalls gone. Looks like that.
>>>
>>> So with only this commit and no other commit or configuration adjustment,
>>> everything works?  Or it the solution this commit and some other stuff?
>>>
>>> The reason I ask is that if just this commit does the trick, I should
>>> drop the others.
>>
>> I'd like to ask for some more time to make sure this is it.
>> Approximately 2 or 3 days.
> 
> Works for me!


Well, after some testing I may say, that your patch:
---------------------8<-----------------------------------
commit 7cebc6b63bf75db48cb19a94564c39294fd40959
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date:   Fri Nov 25 12:48:10 2016 -0800

   mm: Prevent shrink_node_memcg() RCU CPU stall warnings
---------------------8<-----------------------------------

fixes stall warning and dmesg is clean now.

-- 
Boris Zhmurov
System/Network Administrator
mailto: bb@kernelpanic.ru
"wget http://kernelpanic.ru/bb_public_key.pgp -O - | gpg --import"

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`
  2016-11-30 17:41                   ` Boris Zhmurov
@ 2016-11-30 17:48                     ` Michal Hocko
  2016-11-30 18:12                       ` Boris Zhmurov
  0 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2016-11-30 17:48 UTC (permalink / raw)
  To: Boris Zhmurov; +Cc: paulmck, Paul Menzel, Donald Buczek, linux-mm

On Wed 30-11-16 20:41:20, Boris Zhmurov wrote:
> Paul E. McKenney 28/11/16 18:05:
> 
> >>>> So Paul, I've dropped "mm: Prevent shrink_node_memcg() RCU CPU stall
> >>>> warnings" patch, and stalls got back (attached).
> >>>>
> >>>> With this patch "commit 7cebc6b63bf75db48cb19a94564c39294fd40959" from
> >>>> your tree stalls gone. Looks like that.
> >>>
> >>> So with only this commit and no other commit or configuration adjustment,
> >>> everything works?  Or it the solution this commit and some other stuff?
> >>>
> >>> The reason I ask is that if just this commit does the trick, I should
> >>> drop the others.
> >>
> >> I'd like to ask for some more time to make sure this is it.
> >> Approximately 2 or 3 days.
> > 
> > Works for me!
> 
> 
> Well, after some testing I may say, that your patch:
> ---------------------8<-----------------------------------
> commit 7cebc6b63bf75db48cb19a94564c39294fd40959
> Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Date:   Fri Nov 25 12:48:10 2016 -0800
> 
>    mm: Prevent shrink_node_memcg() RCU CPU stall warnings
> ---------------------8<-----------------------------------
> 
> fixes stall warning and dmesg is clean now.

Do I get it right that s@cond_resched_rcu_qs@cond_resched@ didn't help?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`
  2016-11-30 17:05                                   ` Michal Hocko
  2016-11-30 17:23                                     ` Paul E. McKenney
@ 2016-11-30 17:50                                     ` Peter Zijlstra
  2016-11-30 19:40                                       ` Paul E. McKenney
  1 sibling, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2016-11-30 17:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Paul E. McKenney, Donald Buczek, Paul Menzel, dvteam, linux-mm,
	linux-kernel, Josh Triplett

On Wed, Nov 30, 2016 at 06:05:57PM +0100, Michal Hocko wrote:
> On Wed 30-11-16 17:38:20, Peter Zijlstra wrote:
> > On Wed, Nov 30, 2016 at 06:29:55AM -0800, Paul E. McKenney wrote:
> > > We can, and you are correct that cond_resched() does not unconditionally
> > > supply RCU quiescent states, and never has.  Last time I tried to add
> > > cond_resched_rcu_qs() semantics to cond_resched(), I got told "no",
> > > but perhaps it is time to try again.
> > 
> > Well, you got told: "ARRGH my benchmark goes all regress", or something
> > along those lines. Didn't we recently dig out those commits for some
> > reason or other?
> > 
> > Finding out what benchmark that was and running it against this patch
> > would make sense.

See commit:

  4a81e8328d37 ("rcu: Reduce overhead of cond_resched() checks for RCU")

Someone actually wrote down what the problem was.

> > Also, I seem to have missed, why are we going through this again?
> 
> Well, the point I've brought that up is because having basically two
> APIs for cond_resched is more than confusing. Basically all longer in
> kernel loops do cond_resched() but it seems that this will not help the
> silence RCU lockup detector in rare cases where nothing really wants to
> schedule. I am really not sure whether we want to sprinkle
> cond_resched_rcu_qs at random places just to silence RCU detector...

Right.. now, this is obviously all PREEMPT=n code, which therefore also
implies this is rcu-sched.

Paul, now doesn't rcu-sched, when the grace-period has been long in
coming, try and force it? And doesn't that forcing include prodding CPUs
with resched_cpu() ?

I'm thinking not, because if it did, that would make cond_resched()
actually schedule, which would then call into rcu_note_context_switch()
which would then make RCU progress, no?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`
  2016-11-30 17:48                     ` Michal Hocko
@ 2016-11-30 18:12                       ` Boris Zhmurov
  2016-11-30 18:25                         ` Michal Hocko
  2016-11-30 19:42                         ` Paul E. McKenney
  0 siblings, 2 replies; 44+ messages in thread
From: Boris Zhmurov @ 2016-11-30 18:12 UTC (permalink / raw)
  To: Michal Hocko; +Cc: paulmck, Paul Menzel, Donald Buczek, linux-mm

Michal Hocko 30/11/16 20:48:

>> Well, after some testing I may say, that your patch:
>> ---------------------8<-----------------------------------
>> commit 7cebc6b63bf75db48cb19a94564c39294fd40959
>> Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>> Date:   Fri Nov 25 12:48:10 2016 -0800
>>
>>    mm: Prevent shrink_node_memcg() RCU CPU stall warnings
>> ---------------------8<-----------------------------------
>>
>> fixes stall warning and dmesg is clean now.
> 
> Do I get it right that s@cond_resched_rcu_qs@cond_resched@ didn't help?

I didn't try that. I've tried 4 patches from Paul's linux-rcu tree.
I can try another portion of patches, no problem :)

-- 
Boris Zhmurov
System/Network Administrator
mailto: bb@kernelpanic.ru
"wget http://kernelpanic.ru/bb_public_key.pgp -O - | gpg --import"

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`
  2016-11-30 18:12                       ` Boris Zhmurov
@ 2016-11-30 18:25                         ` Michal Hocko
  2016-11-30 18:26                           ` Boris Zhmurov
  2016-12-01 18:10                           ` Boris Zhmurov
  2016-11-30 19:42                         ` Paul E. McKenney
  1 sibling, 2 replies; 44+ messages in thread
From: Michal Hocko @ 2016-11-30 18:25 UTC (permalink / raw)
  To: Boris Zhmurov; +Cc: paulmck, Paul Menzel, Donald Buczek, linux-mm

On Wed 30-11-16 21:12:52, Boris Zhmurov wrote:
> Michal Hocko 30/11/16 20:48:
> 
> >> Well, after some testing I may say, that your patch:
> >> ---------------------8<-----------------------------------
> >> commit 7cebc6b63bf75db48cb19a94564c39294fd40959
> >> Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> >> Date:   Fri Nov 25 12:48:10 2016 -0800
> >>
> >>    mm: Prevent shrink_node_memcg() RCU CPU stall warnings
> >> ---------------------8<-----------------------------------
> >>
> >> fixes stall warning and dmesg is clean now.
> > 
> > Do I get it right that s@cond_resched_rcu_qs@cond_resched@ didn't help?
> 
> I didn't try that. I've tried 4 patches from Paul's linux-rcu tree.
> I can try another portion of patches, no problem :)

Replacing cond_resched_rcu_qs in shrink_node_memcg by cond_resched would
be really helpful to tell whether we are missing a real scheduling point
or whether something more serious is going on here.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`
  2016-11-30 18:25                         ` Michal Hocko
@ 2016-11-30 18:26                           ` Boris Zhmurov
  2016-12-01 18:10                           ` Boris Zhmurov
  1 sibling, 0 replies; 44+ messages in thread
From: Boris Zhmurov @ 2016-11-30 18:26 UTC (permalink / raw)
  To: Michal Hocko; +Cc: paulmck, Paul Menzel, Donald Buczek, linux-mm

Michal Hocko 30/11/16 21:25:
>> I didn't try that. I've tried 4 patches from Paul's linux-rcu tree.
>> I can try another portion of patches, no problem :)
> 
> Replacing cond_resched_rcu_qs in shrink_node_memcg by cond_resched would
> be really helpful to tell whether we are missing a real scheduling point
> or whether something more serious is going on here.

Ok, I'll try that.


-- 
Boris Zhmurov
System/Network Administrator
mailto: bb@kernelpanic.ru
"wget http://kernelpanic.ru/bb_public_key.pgp -O - | gpg --import"

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`
  2016-11-30 17:50                                     ` Peter Zijlstra
@ 2016-11-30 19:40                                       ` Paul E. McKenney
  2016-12-01  5:30                                         ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Paul E. McKenney @ 2016-11-30 19:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michal Hocko, Donald Buczek, Paul Menzel, dvteam, linux-mm,
	linux-kernel, Josh Triplett

On Wed, Nov 30, 2016 at 06:50:16PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 30, 2016 at 06:05:57PM +0100, Michal Hocko wrote:
> > On Wed 30-11-16 17:38:20, Peter Zijlstra wrote:
> > > On Wed, Nov 30, 2016 at 06:29:55AM -0800, Paul E. McKenney wrote:
> > > > We can, and you are correct that cond_resched() does not unconditionally
> > > > supply RCU quiescent states, and never has.  Last time I tried to add
> > > > cond_resched_rcu_qs() semantics to cond_resched(), I got told "no",
> > > > but perhaps it is time to try again.
> > > 
> > > Well, you got told: "ARRGH my benchmark goes all regress", or something
> > > along those lines. Didn't we recently dig out those commits for some
> > > reason or other?
> > > 
> > > Finding out what benchmark that was and running it against this patch
> > > would make sense.
> 
> See commit:
> 
>   4a81e8328d37 ("rcu: Reduce overhead of cond_resched() checks for RCU")
> 
> Someone actually wrote down what the problem was.

Don't worry, it won't happen again.  ;-)

OK, so the regressions were in the "open1" test of Anton Blanchard's
"will it scale" suite, and were due to faster (and thus more) grace
periods rather than path length.

I could likely counter the grace-period speedup by regulating the rate
at which the grace-period machinery pays attention to the rcu_qs_ctr
per-CPU variable.  Actually, this looks pretty straightforward (famous
last words).  But see patch below, which is untested and probably
completely bogus.

> > > Also, I seem to have missed, why are we going through this again?
> > 
> > Well, the point I've brought that up is because having basically two
> > APIs for cond_resched is more than confusing. Basically all longer in
> > kernel loops do cond_resched() but it seems that this will not help the
> > silence RCU lockup detector in rare cases where nothing really wants to
> > schedule. I am really not sure whether we want to sprinkle
> > cond_resched_rcu_qs at random places just to silence RCU detector...
> 
> Right.. now, this is obviously all PREEMPT=n code, which therefore also
> implies this is rcu-sched.
> 
> Paul, now doesn't rcu-sched, when the grace-period has been long in
> coming, try and force it? And doesn't that forcing include prodding CPUs
> with resched_cpu() ?

It does in the v4.8.4 kernel that Boris is running.  It still does in my
-rcu tree, but only after an RCU CPU stall (something about people not
liking IPIs).  I may need to do a resched_cpu() halfway to stall-warning
time or some such.

> I'm thinking not, because if it did, that would make cond_resched()
> actually schedule, which would then call into rcu_note_context_switch()
> which would then make RCU progress, no?

Sounds plausible, but from what I can see some of the loops pointed
out by Boris's stall-warning messages don't have cond_resched().
There was another workload that apparently worked better when moved from
cond_resched() to cond_resched_rcu_qs(), but I don't know what kernel
version was running.

							Thanx, Paul

------------------------------------------------------------------------

commit 42b4ae9cb79479d2f922620fd696a0532019799c
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date:   Wed Nov 30 11:21:21 2016 -0800

    rcu: Check cond_resched_rcu_qs() state less often to reduce GP overhead
    
    Commit 4a81e8328d37 ("rcu: Reduce overhead of cond_resched() checks
    for RCU") moved quiescent-state generation out of cond_resched()
    and commit bde6c3aa9930 ("rcu: Provide cond_resched_rcu_qs() to force
    quiescent states in long loops") introduced cond_resched_rcu_qs(), and
    commit 5cd37193ce85 ("rcu: Make cond_resched_rcu_qs() apply to normal RCU
    flavors") introduced the per-CPU rcu_qs_ctr variable, which is frequently
    polled by the RCU core state machine.
    
    This frequent polling can increase grace-period rate, which in turn
    increases grace-period overhead, which is visible in some benchmarks
    (for example, the "open1" benchmark in Anton Blanchard's "will it scale"
    suite).  This commit therefore reduces the rate at which rcu_qs_ctr
    is polled by moving that polling into the force-quiescent-state (FQS)
    machinery, and by further polling it only on the second and subsequent
    FQS passes of a given grace period.
    
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h
index 9d4f9b3a2b7b..e3facb356838 100644
--- a/include/trace/events/rcu.h
+++ b/include/trace/events/rcu.h
@@ -385,11 +385,11 @@ TRACE_EVENT(rcu_quiescent_state_report,
 
 /*
  * Tracepoint for quiescent states detected by force_quiescent_state().
- * These trace events include the type of RCU, the grace-period number
- * that was blocked by the CPU, the CPU itself, and the type of quiescent
- * state, which can be "dti" for dyntick-idle mode, "ofl" for CPU offline,
- * or "kick" when kicking a CPU that has been in dyntick-idle mode for
- * too long.
+ * These trace events include the type of RCU, the grace-period number that
+ * was blocked by the CPU, the CPU itself, and the type of quiescent state,
+ * which can be "dti" for dyntick-idle mode, "ofl" for CPU offline, "kick"
+ * when kicking a CPU that has been in dyntick-idle mode for too long, or
+ * "rqc" if the CPU got a quiescent state via its rcu_qs_ctr.
  */
 TRACE_EVENT(rcu_fqs,
 
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index b546c959c854..6745f1899ad9 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1275,6 +1275,7 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp,
 				    bool *isidle, unsigned long *maxj)
 {
 	int *rcrmp;
+	struct rcu_node *rnp;
 
 	/*
 	 * If the CPU passed through or entered a dynticks idle phase with
@@ -1291,6 +1292,19 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp,
 	}
 
 	/*
+	 * Has this CPU encountered a cond_resched_rcu_qs() since the
+	 * beginning of the grace period?  For this to be the case,
+	 * the CPU has to have noticed the current grace period.  This
+	 * might not be the case for nohz_full CPUs looping in the kernel.
+	 */
+	rnp = rdp->mynode;
+	if (READ_ONCE(rdp->rcu_qs_ctr_snap) != __this_cpu_read(rcu_qs_ctr) &&
+	    READ_ONCE(rdp->gpnum) == rnp->gpnum && !rdp->gpwrap) {
+		trace_rcu_fqs(rdp->rsp->name, rdp->gpnum, rdp->cpu, TPS("rqc"));
+		return 1;
+	}
+
+	/*
 	 * Check for the CPU being offline, but only if the grace period
 	 * is old enough.  We don't need to worry about the CPU changing
 	 * state: If we see it offline even once, it has been through a
@@ -2588,10 +2602,8 @@ rcu_report_qs_rdp(int cpu, struct rcu_state *rsp, struct rcu_data *rdp)
 
 	rnp = rdp->mynode;
 	raw_spin_lock_irqsave_rcu_node(rnp, flags);
-	if ((rdp->cpu_no_qs.b.norm &&
-	     rdp->rcu_qs_ctr_snap == __this_cpu_read(rcu_qs_ctr)) ||
-	    rdp->gpnum != rnp->gpnum || rnp->completed == rnp->gpnum ||
-	    rdp->gpwrap) {
+	if (rdp->cpu_no_qs.b.norm || rdp->gpnum != rnp->gpnum ||
+	    rnp->completed == rnp->gpnum || rdp->gpwrap) {
 
 		/*
 		 * The grace period in which this quiescent state was
@@ -2646,8 +2658,7 @@ rcu_check_quiescent_state(struct rcu_state *rsp, struct rcu_data *rdp)
 	 * Was there a quiescent state since the beginning of the grace
 	 * period? If no, then exit and wait for the next call.
 	 */
-	if (rdp->cpu_no_qs.b.norm &&
-	    rdp->rcu_qs_ctr_snap == __this_cpu_read(rcu_qs_ctr))
+	if (rdp->cpu_no_qs.b.norm)
 		return;
 
 	/*
@@ -3625,9 +3636,7 @@ static int __rcu_pending(struct rcu_state *rsp, struct rcu_data *rdp)
 	    rdp->core_needs_qs && rdp->cpu_no_qs.b.norm &&
 	    rdp->rcu_qs_ctr_snap == __this_cpu_read(rcu_qs_ctr)) {
 		rdp->n_rp_core_needs_qs++;
-	} else if (rdp->core_needs_qs &&
-		   (!rdp->cpu_no_qs.b.norm ||
-		    rdp->rcu_qs_ctr_snap != __this_cpu_read(rcu_qs_ctr))) {
+	} else if (rdp->core_needs_qs && !rdp->cpu_no_qs.b.norm) {
 		rdp->n_rp_report_qs++;
 		return 1;
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`
  2016-11-30 18:12                       ` Boris Zhmurov
  2016-11-30 18:25                         ` Michal Hocko
@ 2016-11-30 19:42                         ` Paul E. McKenney
  1 sibling, 0 replies; 44+ messages in thread
From: Paul E. McKenney @ 2016-11-30 19:42 UTC (permalink / raw)
  To: Boris Zhmurov; +Cc: Michal Hocko, Paul Menzel, Donald Buczek, linux-mm

On Wed, Nov 30, 2016 at 09:12:52PM +0300, Boris Zhmurov wrote:
> Michal Hocko 30/11/16 20:48:
> 
> >> Well, after some testing I may say, that your patch:
> >> ---------------------8<-----------------------------------
> >> commit 7cebc6b63bf75db48cb19a94564c39294fd40959
> >> Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> >> Date:   Fri Nov 25 12:48:10 2016 -0800
> >>
> >>    mm: Prevent shrink_node_memcg() RCU CPU stall warnings
> >> ---------------------8<-----------------------------------
> >>
> >> fixes stall warning and dmesg is clean now.
> > 
> > Do I get it right that s@cond_resched_rcu_qs@cond_resched@ didn't help?
> 
> I didn't try that. I've tried 4 patches from Paul's linux-rcu tree.
> I can try another portion of patches, no problem :)

OK, I will keep the above patch and drop the shrink_node() patch.

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`
  2016-11-30 19:40                                       ` Paul E. McKenney
@ 2016-12-01  5:30                                         ` Peter Zijlstra
  2016-12-01 12:40                                           ` Paul E. McKenney
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2016-12-01  5:30 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Michal Hocko, Donald Buczek, Paul Menzel, dvteam, linux-mm,
	linux-kernel, Josh Triplett

On Wed, Nov 30, 2016 at 11:40:19AM -0800, Paul E. McKenney wrote:

> > See commit:
> > 
> >   4a81e8328d37 ("rcu: Reduce overhead of cond_resched() checks for RCU")
> > 
> > Someone actually wrote down what the problem was.
> 
> Don't worry, it won't happen again.  ;-)
> 
> OK, so the regressions were in the "open1" test of Anton Blanchard's
> "will it scale" suite, and were due to faster (and thus more) grace
> periods rather than path length.
> 
> I could likely counter the grace-period speedup by regulating the rate
> at which the grace-period machinery pays attention to the rcu_qs_ctr
> per-CPU variable.  Actually, this looks pretty straightforward (famous
> last words).  But see patch below, which is untested and probably
> completely bogus.

Possible I suppose. Didn't look too hard at it.

> > > > Also, I seem to have missed, why are we going through this again?
> > > 
> > > Well, the point I've brought that up is because having basically two
> > > APIs for cond_resched is more than confusing. Basically all longer in
> > > kernel loops do cond_resched() but it seems that this will not help the
> > > silence RCU lockup detector in rare cases where nothing really wants to
> > > schedule. I am really not sure whether we want to sprinkle
> > > cond_resched_rcu_qs at random places just to silence RCU detector...
> > 
> > Right.. now, this is obviously all PREEMPT=n code, which therefore also
> > implies this is rcu-sched.
> > 
> > Paul, now doesn't rcu-sched, when the grace-period has been long in
> > coming, try and force it? And doesn't that forcing include prodding CPUs
> > with resched_cpu() ?
> 
> It does in the v4.8.4 kernel that Boris is running.  It still does in my
> -rcu tree, but only after an RCU CPU stall (something about people not
> liking IPIs).  I may need to do a resched_cpu() halfway to stall-warning
> time or some such.

Sure, we all dislike IPIs, but I'm thinking this half-way point is
sensible, no point in issuing user visible annoyance if indeed we can
prod things back to life, no?

Only if we utterly fail to make it respond should we bug the user with
our failure..

> > I'm thinking not, because if it did, that would make cond_resched()
> > actually schedule, which would then call into rcu_note_context_switch()
> > which would then make RCU progress, no?
> 
> Sounds plausible, but from what I can see some of the loops pointed
> out by Boris's stall-warning messages don't have cond_resched().
> There was another workload that apparently worked better when moved from
> cond_resched() to cond_resched_rcu_qs(), but I don't know what kernel
> version was running.

Egads.. cursed if you do, cursed if you dont eh..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`
  2016-12-01  5:30                                         ` Peter Zijlstra
@ 2016-12-01 12:40                                           ` Paul E. McKenney
  2016-12-01 16:36                                             ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Paul E. McKenney @ 2016-12-01 12:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michal Hocko, Donald Buczek, Paul Menzel, dvteam, linux-mm,
	linux-kernel, Josh Triplett

On Thu, Dec 01, 2016 at 06:30:35AM +0100, Peter Zijlstra wrote:
> On Wed, Nov 30, 2016 at 11:40:19AM -0800, Paul E. McKenney wrote:
> 
> > > See commit:
> > > 
> > >   4a81e8328d37 ("rcu: Reduce overhead of cond_resched() checks for RCU")
> > > 
> > > Someone actually wrote down what the problem was.
> > 
> > Don't worry, it won't happen again.  ;-)
> > 
> > OK, so the regressions were in the "open1" test of Anton Blanchard's
> > "will it scale" suite, and were due to faster (and thus more) grace
> > periods rather than path length.
> > 
> > I could likely counter the grace-period speedup by regulating the rate
> > at which the grace-period machinery pays attention to the rcu_qs_ctr
> > per-CPU variable.  Actually, this looks pretty straightforward (famous
> > last words).  But see patch below, which is untested and probably
> > completely bogus.
> 
> Possible I suppose. Didn't look too hard at it.
> 
> > > > > Also, I seem to have missed, why are we going through this again?
> > > > 
> > > > Well, the point I've brought that up is because having basically two
> > > > APIs for cond_resched is more than confusing. Basically all longer in
> > > > kernel loops do cond_resched() but it seems that this will not help the
> > > > silence RCU lockup detector in rare cases where nothing really wants to
> > > > schedule. I am really not sure whether we want to sprinkle
> > > > cond_resched_rcu_qs at random places just to silence RCU detector...
> > > 
> > > Right.. now, this is obviously all PREEMPT=n code, which therefore also
> > > implies this is rcu-sched.
> > > 
> > > Paul, now doesn't rcu-sched, when the grace-period has been long in
> > > coming, try and force it? And doesn't that forcing include prodding CPUs
> > > with resched_cpu() ?
> > 
> > It does in the v4.8.4 kernel that Boris is running.  It still does in my
> > -rcu tree, but only after an RCU CPU stall (something about people not
> > liking IPIs).  I may need to do a resched_cpu() halfway to stall-warning
> > time or some such.
> 
> Sure, we all dislike IPIs, but I'm thinking this half-way point is
> sensible, no point in issuing user visible annoyance if indeed we can
> prod things back to life, no?
> 
> Only if we utterly fail to make it respond should we bug the user with
> our failure..

Sold!  ;-)

I will put together a patch later today.

My intent is to hold off on the "upgrade cond_resched()" patch, one
step at a time.  Longer term, I do very much like the idea of having
cond_resched() do both scheduling and RCU quiescent states, assuming
that this avoids performance pitfalls.

> > > I'm thinking not, because if it did, that would make cond_resched()
> > > actually schedule, which would then call into rcu_note_context_switch()
> > > which would then make RCU progress, no?
> > 
> > Sounds plausible, but from what I can see some of the loops pointed
> > out by Boris's stall-warning messages don't have cond_resched().
> > There was another workload that apparently worked better when moved from
> > cond_resched() to cond_resched_rcu_qs(), but I don't know what kernel
> > version was running.
> 
> Egads.. cursed if you do, cursed if you dont eh..

Almost like this was real life!  ;-)

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`
  2016-12-01 12:40                                           ` Paul E. McKenney
@ 2016-12-01 16:36                                             ` Peter Zijlstra
  2016-12-01 16:59                                               ` Paul E. McKenney
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2016-12-01 16:36 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Michal Hocko, Donald Buczek, Paul Menzel, dvteam, linux-mm,
	linux-kernel, Josh Triplett

On Thu, Dec 01, 2016 at 04:40:24AM -0800, Paul E. McKenney wrote:
> On Thu, Dec 01, 2016 at 06:30:35AM +0100, Peter Zijlstra wrote:

> > Sure, we all dislike IPIs, but I'm thinking this half-way point is
> > sensible, no point in issuing user visible annoyance if indeed we can
> > prod things back to life, no?
> > 
> > Only if we utterly fail to make it respond should we bug the user with
> > our failure..
> 
> Sold!  ;-)
> 
> I will put together a patch later today.
> 
> My intent is to hold off on the "upgrade cond_resched()" patch, one
> step at a time.  Longer term, I do very much like the idea of having
> cond_resched() do both scheduling and RCU quiescent states, assuming
> that this avoids performance pitfalls.

Well, with the above change cond_resched() is already sufficient, no?

In fact, by doing the IPI thing we get the entire cond_resched*()
family, and we could add the should_resched() guard to
cond_resched_rcu().


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`
  2016-12-01 16:36                                             ` Peter Zijlstra
@ 2016-12-01 16:59                                               ` Paul E. McKenney
  2016-12-01 18:09                                                 ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Paul E. McKenney @ 2016-12-01 16:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michal Hocko, Donald Buczek, Paul Menzel, dvteam, linux-mm,
	linux-kernel, Josh Triplett

On Thu, Dec 01, 2016 at 05:36:14PM +0100, Peter Zijlstra wrote:
> On Thu, Dec 01, 2016 at 04:40:24AM -0800, Paul E. McKenney wrote:
> > On Thu, Dec 01, 2016 at 06:30:35AM +0100, Peter Zijlstra wrote:
> 
> > > Sure, we all dislike IPIs, but I'm thinking this half-way point is
> > > sensible, no point in issuing user visible annoyance if indeed we can
> > > prod things back to life, no?
> > > 
> > > Only if we utterly fail to make it respond should we bug the user with
> > > our failure..
> > 
> > Sold!  ;-)
> > 
> > I will put together a patch later today.
> > 
> > My intent is to hold off on the "upgrade cond_resched()" patch, one
> > step at a time.  Longer term, I do very much like the idea of having
> > cond_resched() do both scheduling and RCU quiescent states, assuming
> > that this avoids performance pitfalls.
> 
> Well, with the above change cond_resched() is already sufficient, no?

Maybe.  Right now, cond_resched_rcu_qs() gets a quiescent state to
the RCU core in less than one jiffy, with my other change, this becomes
a handful of jiffies depending on HZ and NR_CPUS.  I expect this
increase to a handful of jiffies to be a non-event.

After my upcoming patch, cond_resched() will get a quiescent state to
the RCU core in about ten seconds.  While I am am not all that nervous
about the increase from less than a jiffy to a handful of jiffies,
increasing to ten seconds via cond_resched() does make me quite nervous.
Past experience indicates that someone's kernel will likely be fatally
inconvenienced by this magnitude of change.

Or am I misunderstanding what you are proposing?

> In fact, by doing the IPI thing we get the entire cond_resched*()
> family, and we could add the should_resched() guard to
> cond_resched_rcu().

So that cond_resched_rcu_qs() looks something like this, in order
to avoid the function call in the case where the scheduler has nothing
to do?

#define cond_resched_rcu_qs() \
do { \
	if (!should_resched(current) || !cond_resched()) \
		rcu_note_voluntary_context_switch(current); \
} while (0)

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`
  2016-12-01 16:59                                               ` Paul E. McKenney
@ 2016-12-01 18:09                                                 ` Peter Zijlstra
  2016-12-01 18:42                                                   ` Paul E. McKenney
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2016-12-01 18:09 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Michal Hocko, Donald Buczek, Paul Menzel, dvteam, linux-mm,
	linux-kernel, Josh Triplett

On Thu, Dec 01, 2016 at 08:59:18AM -0800, Paul E. McKenney wrote:
> On Thu, Dec 01, 2016 at 05:36:14PM +0100, Peter Zijlstra wrote:
> > Well, with the above change cond_resched() is already sufficient, no?
> 
> Maybe.  Right now, cond_resched_rcu_qs() gets a quiescent state to
> the RCU core in less than one jiffy, with my other change, this becomes
> a handful of jiffies depending on HZ and NR_CPUS.  I expect this
> increase to a handful of jiffies to be a non-event.
> 
> After my upcoming patch, cond_resched() will get a quiescent state to
> the RCU core in about ten seconds.  While I am am not all that nervous
> about the increase from less than a jiffy to a handful of jiffies,
> increasing to ten seconds via cond_resched() does make me quite nervous.
> Past experience indicates that someone's kernel will likely be fatally
> inconvenienced by this magnitude of change.
> 
> Or am I misunderstanding what you are proposing?

No, that is indeed what I was proposing. Hurm.. OK let me ponder that a
bit. There might be a few games we can play with !PREEMPT to avoid IPIs.

Thing is, I'm slightly uncomfortable with de-coupling rcu-sched from
actual schedule() calls.

> > In fact, by doing the IPI thing we get the entire cond_resched*()
> > family, and we could add the should_resched() guard to
> > cond_resched_rcu().
> 
> So that cond_resched_rcu_qs() looks something like this, in order
> to avoid the function call in the case where the scheduler has nothing
> to do?

I was actually thinking of this:

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2d0c82e1d348..2dc7d8056b2a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -3374,9 +3374,11 @@ static inline int signal_pending_state(long state, struct task_struct *p)
 static inline void cond_resched_rcu(void)
 {
 #if defined(CONFIG_DEBUG_ATOMIC_SLEEP) || !defined(CONFIG_PREEMPT_RCU)
-	rcu_read_unlock();
-	cond_resched();
-	rcu_read_lock();
+	if (should_resched(1)) {
+		rcu_read_unlock();
+		cond_resched();
+		rcu_read_lock();
+	}
 #endif
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`
  2016-11-30 18:25                         ` Michal Hocko
  2016-11-30 18:26                           ` Boris Zhmurov
@ 2016-12-01 18:10                           ` Boris Zhmurov
  2016-12-01 19:39                             ` Paul E. McKenney
                                               ` (2 more replies)
  1 sibling, 3 replies; 44+ messages in thread
From: Boris Zhmurov @ 2016-12-01 18:10 UTC (permalink / raw)
  To: Michal Hocko; +Cc: paulmck, Paul Menzel, Donald Buczek, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 921 bytes --]

Michal Hocko 30/11/16 21:25:

>>> Do I get it right that s@cond_resched_rcu_qs@cond_resched@ didn't help?
>>
>> I didn't try that. I've tried 4 patches from Paul's linux-rcu tree.
>> I can try another portion of patches, no problem :)
> 
> Replacing cond_resched_rcu_qs in shrink_node_memcg by cond_resched would
> be really helpful to tell whether we are missing a real scheduling point
> or whether something more serious is going on here.

Well, I can confirm, that replacing cond_resched_rcu_qs in
shrink_node_memcg by cond_resched also makes dmesg clean from RCU CPU
stall warnings.

I've attached patch (just modification of Paul's patch), that fixes RCU
stall messages in situations, when all memory is used by
couchbase/memcached + fs cache and linux starts to use swap.


-- 
Boris Zhmurov
System/Network Administrator
mailto: bb@kernelpanic.ru
"wget http://kernelpanic.ru/bb_public_key.pgp -O - | gpg --import"

[-- Attachment #2: linux-4.8-mm-prevent-shrink_node_memcg-RCU-CPU-stall-warnings.patch --]
[-- Type: text/x-patch, Size: 297 bytes --]

--- a/mm/vmscan.c.orig	2016-11-30 21:52:58.314895320 +0300
+++ b/mm/vmscan.c	2016-11-30 21:53:29.502895320 +0300
@@ -2352,6 +2352,7 @@
 				nr_reclaimed += shrink_list(lru, nr_to_scan,
 							    lruvec, sc);
 			}
+			cond_resched();
 		}
 
 		if (nr_reclaimed < nr_to_reclaim || scan_adjusted)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`
  2016-12-01 18:09                                                 ` Peter Zijlstra
@ 2016-12-01 18:42                                                   ` Paul E. McKenney
  2016-12-01 18:49                                                     ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Paul E. McKenney @ 2016-12-01 18:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michal Hocko, Donald Buczek, Paul Menzel, dvteam, linux-mm,
	linux-kernel, Josh Triplett

On Thu, Dec 01, 2016 at 07:09:53PM +0100, Peter Zijlstra wrote:
> On Thu, Dec 01, 2016 at 08:59:18AM -0800, Paul E. McKenney wrote:
> > On Thu, Dec 01, 2016 at 05:36:14PM +0100, Peter Zijlstra wrote:
> > > Well, with the above change cond_resched() is already sufficient, no?
> > 
> > Maybe.  Right now, cond_resched_rcu_qs() gets a quiescent state to
> > the RCU core in less than one jiffy, with my other change, this becomes
> > a handful of jiffies depending on HZ and NR_CPUS.  I expect this
> > increase to a handful of jiffies to be a non-event.
> > 
> > After my upcoming patch, cond_resched() will get a quiescent state to
> > the RCU core in about ten seconds.  While I am am not all that nervous
> > about the increase from less than a jiffy to a handful of jiffies,
> > increasing to ten seconds via cond_resched() does make me quite nervous.
> > Past experience indicates that someone's kernel will likely be fatally
> > inconvenienced by this magnitude of change.
> > 
> > Or am I misunderstanding what you are proposing?
> 
> No, that is indeed what I was proposing. Hurm.. OK let me ponder that a
> bit. There might be a few games we can play with !PREEMPT to avoid IPIs.
> 
> Thing is, I'm slightly uncomfortable with de-coupling rcu-sched from
> actual schedule() calls.

OK, what is the source of your discomfort?

There are several intermediate levels of evasive action:

0.	If there is another runnable task and certain other conditions
	are met, cond_resched() will invoke schedule(), which will
	provide an RCU quiescent state.

1.	All cond_resched_rcu_qs() invocations increment the CPU's
	rcu_qs_ctr per-CPU variable, which is treated by later
	invocations of RCU core as a quiescent state.  (I have
	a patch queued that causes RCU to ignore changes to this
	counter until the grace period is a few jiffies old.)

	In this case, the rcu_node locks plus smp_mb__after_unlock_lock()
	provide the needed ordering.

2.	If any cond_resched_rcu_qs() sees that an expedited grace
	period is waiting on the current CPU, it invokes rcu_sched_qs()
	to force RCU to see the quiescent state.  (To your point,
	rcu_sched_qs() is normally called from schedule(), but also
	from the scheduling-clock interrupt when it interrupts
	usermode or idle.)

	Again, the rcu_node locks plus smp_mb__after_unlock_lock()
	provide the needed ordering.

3.	If the grace period extends for more than 50 milliseconds
	(by default, tunable), all subsequent cond_resched_rcu_qs()
	invocations on that CPU turn into momentary periods of
	idleness from RCU's viewpoint.  (Atomically add 2 to the
	dyntick-idle counter.)

	Here, the atomic increment is surrounded by smp_mb__*_atomic()
	to provide the needed ordering, which should be a good substitute
	for actually passing through schedule().

4.	If the grace period extends for more than 21 seconds (by default),
	we emit an RCU CPU stall warning and then do a resched_cpu().
	I am proposing also doing a resched_cpu() halfway to RCU CPU
	stall-warning time.

5.	An RCU-sched expedited grace period does a local resched_cpu()
	from its IPI handler to force the CPU through a quiescent
	state.  (Yes, I could just invoke resched_cpu() from the
	task orchestrating the expedited grace period, but this approach
	allows more common code between RCU-preempt and RCU-sched
	expedited grace periods.)

> > > In fact, by doing the IPI thing we get the entire cond_resched*()
> > > family, and we could add the should_resched() guard to
> > > cond_resched_rcu().
> > 
> > So that cond_resched_rcu_qs() looks something like this, in order
> > to avoid the function call in the case where the scheduler has nothing
> > to do?
> 
> I was actually thinking of this:

Oh!  I had forgotten about cond_resched_rcu(), and thought you did a typo.

Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 2d0c82e1d348..2dc7d8056b2a 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -3374,9 +3374,11 @@ static inline int signal_pending_state(long state, struct task_struct *p)
>  static inline void cond_resched_rcu(void)
>  {
>  #if defined(CONFIG_DEBUG_ATOMIC_SLEEP) || !defined(CONFIG_PREEMPT_RCU)
> -	rcu_read_unlock();
> -	cond_resched();
> -	rcu_read_lock();
> +	if (should_resched(1)) {
> +		rcu_read_unlock();
> +		cond_resched();
> +		rcu_read_lock();
> +	}
>  #endif
>  }
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`
  2016-12-01 18:42                                                   ` Paul E. McKenney
@ 2016-12-01 18:49                                                     ` Peter Zijlstra
  0 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2016-12-01 18:49 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Michal Hocko, Donald Buczek, Paul Menzel, dvteam, linux-mm,
	linux-kernel, Josh Triplett

On Thu, Dec 01, 2016 at 10:42:52AM -0800, Paul E. McKenney wrote:
> On Thu, Dec 01, 2016 at 07:09:53PM +0100, Peter Zijlstra wrote:
> > Thing is, I'm slightly uncomfortable with de-coupling rcu-sched from
> > actual schedule() calls.
> 
> OK, what is the source of your discomfort?

Good question; after a little thought its not much different from other
cases. So let me ponder this a bit more..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`
  2016-12-01 18:10                           ` Boris Zhmurov
@ 2016-12-01 19:39                             ` Paul E. McKenney
  2016-12-02  9:37                             ` Michal Hocko
  2016-12-02 16:39                             ` Boris Zhmurov
  2 siblings, 0 replies; 44+ messages in thread
From: Paul E. McKenney @ 2016-12-01 19:39 UTC (permalink / raw)
  To: Boris Zhmurov
  Cc: Michal Hocko, Paul Menzel, Donald Buczek, linux-mm, linux-kernel

On Thu, Dec 01, 2016 at 09:10:01PM +0300, Boris Zhmurov wrote:
> Michal Hocko 30/11/16 21:25:
> 
> >>> Do I get it right that s@cond_resched_rcu_qs@cond_resched@ didn't help?
> >>
> >> I didn't try that. I've tried 4 patches from Paul's linux-rcu tree.
> >> I can try another portion of patches, no problem :)
> > 
> > Replacing cond_resched_rcu_qs in shrink_node_memcg by cond_resched would
> > be really helpful to tell whether we are missing a real scheduling point
> > or whether something more serious is going on here.
> 
> Well, I can confirm, that replacing cond_resched_rcu_qs in
> shrink_node_memcg by cond_resched also makes dmesg clean from RCU CPU
> stall warnings.
> 
> I've attached patch (just modification of Paul's patch), that fixes RCU
> stall messages in situations, when all memory is used by
> couchbase/memcached + fs cache and linux starts to use swap.
> 
> 
> -- 
> Boris Zhmurov
> System/Network Administrator
> mailto: bb@kernelpanic.ru
> "wget http://kernelpanic.ru/bb_public_key.pgp -O - | gpg --import"

> --- a/mm/vmscan.c.orig	2016-11-30 21:52:58.314895320 +0300
> +++ b/mm/vmscan.c	2016-11-30 21:53:29.502895320 +0300
> @@ -2352,6 +2352,7 @@
>  				nr_reclaimed += shrink_list(lru, nr_to_scan,
>  							    lruvec, sc);
>  			}
> +			cond_resched();
>  		}
> 
>  		if (nr_reclaimed < nr_to_reclaim || scan_adjusted)

Nice!

Just to double-check, could you please also test your patch above with
these two commits from -rcu?

d2db185bfee8 ("rcu: Remove short-term CPU kicking")
f8f127e738e3 ("rcu: Add long-term CPU kicking")

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`
  2016-11-30 11:43                           ` Donald Buczek
@ 2016-12-02  9:14                             ` Donald Buczek
  2016-12-06  8:32                               ` Donald Buczek
  0 siblings, 1 reply; 44+ messages in thread
From: Donald Buczek @ 2016-12-02  9:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Paul Menzel, dvteam, linux-mm, linux-kernel, Josh Triplett,
	Paul E. McKenney, Peter Zijlstra

On 11/30/16 12:43, Donald Buczek wrote:
> On 11/30/16 12:09, Michal Hocko wrote:
>> [CCing Paul]
>>
>> On Wed 30-11-16 11:28:34, Donald Buczek wrote:
>> [...]
>>> shrink_active_list gets and releases the spinlock and calls 
>>> cond_resched().
>>> This should give other tasks a chance to run. Just as an experiment, 
>>> I'm
>>> trying
>>>
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -1921,7 +1921,7 @@ static void shrink_active_list(unsigned long
>>> nr_to_scan,
>>>          spin_unlock_irq(&pgdat->lru_lock);
>>>
>>>          while (!list_empty(&l_hold)) {
>>> -               cond_resched();
>>> +               cond_resched_rcu_qs();
>>>                  page = lru_to_page(&l_hold);
>>>                  list_del(&page->lru);
>>>
>>> and didn't hit a rcu_sched warning for >21 hours uptime now. We'll see.
>> This is really interesting! Is it possible that the RCU stall detector
>> is somehow confused?
>
> Wait... 21 hours is not yet a test result.

For the records: We didn't have any stall warnings after 2 days and 20 
hours now and so I'm quite confident, that my above patch fixed the 
problem for v4.8.0. On previous boots the rcu warnings started after 
37,0.2,1,2,0.8 hours uptime.

Now I've applied this patch to stable latest (v4.8.11) on another backup 
machine which suffered even more rcu stalls.

Donald

> [...]

-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`
  2016-12-01 18:10                           ` Boris Zhmurov
  2016-12-01 19:39                             ` Paul E. McKenney
@ 2016-12-02  9:37                             ` Michal Hocko
  2016-12-02 13:52                               ` Paul E. McKenney
  2016-12-02 16:39                             ` Boris Zhmurov
  2 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2016-12-02  9:37 UTC (permalink / raw)
  To: Boris Zhmurov; +Cc: paulmck, Paul Menzel, Donald Buczek, linux-mm, linux-kernel

On Thu 01-12-16 21:10:01, Boris Zhmurov wrote:
> Michal Hocko 30/11/16 21:25:
> 
> >>> Do I get it right that s@cond_resched_rcu_qs@cond_resched@ didn't help?
> >>
> >> I didn't try that. I've tried 4 patches from Paul's linux-rcu tree.
> >> I can try another portion of patches, no problem :)
> > 
> > Replacing cond_resched_rcu_qs in shrink_node_memcg by cond_resched would
> > be really helpful to tell whether we are missing a real scheduling point
> > or whether something more serious is going on here.
> 
> Well, I can confirm, that replacing cond_resched_rcu_qs in
> shrink_node_memcg by cond_resched also makes dmesg clean from RCU CPU
> stall warnings.
> 
> I've attached patch (just modification of Paul's patch), that fixes RCU
> stall messages in situations, when all memory is used by
> couchbase/memcached + fs cache and linux starts to use swap.

OK, thanks for the confirmation! I will send a patch because it is true
that we do not have any scheduling point if no pages can be isolated
fromm the LRU. This might be what you are seeing.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`
  2016-12-02  9:37                             ` Michal Hocko
@ 2016-12-02 13:52                               ` Paul E. McKenney
  0 siblings, 0 replies; 44+ messages in thread
From: Paul E. McKenney @ 2016-12-02 13:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Boris Zhmurov, Paul Menzel, Donald Buczek, linux-mm, linux-kernel

On Fri, Dec 02, 2016 at 10:37:35AM +0100, Michal Hocko wrote:
> On Thu 01-12-16 21:10:01, Boris Zhmurov wrote:
> > Michal Hocko 30/11/16 21:25:
> > 
> > >>> Do I get it right that s@cond_resched_rcu_qs@cond_resched@ didn't help?
> > >>
> > >> I didn't try that. I've tried 4 patches from Paul's linux-rcu tree.
> > >> I can try another portion of patches, no problem :)
> > > 
> > > Replacing cond_resched_rcu_qs in shrink_node_memcg by cond_resched would
> > > be really helpful to tell whether we are missing a real scheduling point
> > > or whether something more serious is going on here.
> > 
> > Well, I can confirm, that replacing cond_resched_rcu_qs in
> > shrink_node_memcg by cond_resched also makes dmesg clean from RCU CPU
> > stall warnings.
> > 
> > I've attached patch (just modification of Paul's patch), that fixes RCU
> > stall messages in situations, when all memory is used by
> > couchbase/memcached + fs cache and linux starts to use swap.
> 
> OK, thanks for the confirmation! I will send a patch because it is true
> that we do not have any scheduling point if no pages can be isolated
> fromm the LRU. This might be what you are seeing.

Thank you both!

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`
  2016-12-01 18:10                           ` Boris Zhmurov
  2016-12-01 19:39                             ` Paul E. McKenney
  2016-12-02  9:37                             ` Michal Hocko
@ 2016-12-02 16:39                             ` Boris Zhmurov
  2016-12-02 16:44                               ` Paul E. McKenney
  2 siblings, 1 reply; 44+ messages in thread
From: Boris Zhmurov @ 2016-12-02 16:39 UTC (permalink / raw)
  To: paulmck; +Cc: Michal Hocko, Paul Menzel, Donald Buczek, linux-mm, linux-kernel

Paul E. McKenney Thu Dec 01 2016 - 14:39:21 EST:

>> Well, I can confirm, that replacing cond_resched_rcu_qs in 
>> shrink_node_memcg by cond_resched also makes dmesg clean from RCU 
>> CPU stall warnings.
>> 
>> I've attached patch (just modification of Paul's patch), that
>> fixes RCU stall messages in situations, when all memory is used by
>>  couchbase/memcached + fs cache and linux starts to use swap.

> Nice! Just to double-check, could you please also test your patch
> above with these two commits from -rcu?
> 
> d2db185bfee8 ("rcu: Remove short-term CPU kicking") f8f127e738e3
> ("rcu: Add long-term CPU kicking")
> 
> Thanx, Paul


Looks like patches d2db185bfee8 and f8f127e738e3 change nothing.

With cond_resched() in shrink_node_memcg and these two patches dmesg is
clean. No any RCU CPU stall messages.

Thanks.

-- 
Boris Zhmurov
mailto: bb@kernelpanic.ru
"wget http://kernelpanic.ru/bb_public_key.pgp -O - | gpg --import"

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`
  2016-12-02 16:39                             ` Boris Zhmurov
@ 2016-12-02 16:44                               ` Paul E. McKenney
  2016-12-02 17:02                                 ` Michal Hocko
  0 siblings, 1 reply; 44+ messages in thread
From: Paul E. McKenney @ 2016-12-02 16:44 UTC (permalink / raw)
  To: Boris Zhmurov
  Cc: Michal Hocko, Paul Menzel, Donald Buczek, linux-mm, linux-kernel

On Fri, Dec 02, 2016 at 07:39:24PM +0300, Boris Zhmurov wrote:
> Paul E. McKenney Thu Dec 01 2016 - 14:39:21 EST:
> 
> >> Well, I can confirm, that replacing cond_resched_rcu_qs in 
> >> shrink_node_memcg by cond_resched also makes dmesg clean from RCU 
> >> CPU stall warnings.
> >> 
> >> I've attached patch (just modification of Paul's patch), that
> >> fixes RCU stall messages in situations, when all memory is used by
> >>  couchbase/memcached + fs cache and linux starts to use swap.
> 
> > Nice! Just to double-check, could you please also test your patch
> > above with these two commits from -rcu?
> > 
> > d2db185bfee8 ("rcu: Remove short-term CPU kicking") f8f127e738e3
> > ("rcu: Add long-term CPU kicking")
> > 
> > Thanx, Paul
> 
> 
> Looks like patches d2db185bfee8 and f8f127e738e3 change nothing.
> 
> With cond_resched() in shrink_node_memcg and these two patches dmesg is
> clean. No any RCU CPU stall messages.

Very good!  I have these two patches queued for 4.11.

And thank you again for all the testing!!!

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`
  2016-12-02 16:44                               ` Paul E. McKenney
@ 2016-12-02 17:02                                 ` Michal Hocko
  2016-12-02 17:15                                   ` Paul E. McKenney
  0 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2016-12-02 17:02 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Boris Zhmurov, Paul Menzel, Donald Buczek, linux-mm, linux-kernel

On Fri 02-12-16 08:44:08, Paul E. McKenney wrote:
> On Fri, Dec 02, 2016 at 07:39:24PM +0300, Boris Zhmurov wrote:
> > Paul E. McKenney Thu Dec 01 2016 - 14:39:21 EST:
> > 
> > >> Well, I can confirm, that replacing cond_resched_rcu_qs in 
> > >> shrink_node_memcg by cond_resched also makes dmesg clean from RCU 
> > >> CPU stall warnings.
> > >> 
> > >> I've attached patch (just modification of Paul's patch), that
> > >> fixes RCU stall messages in situations, when all memory is used by
> > >>  couchbase/memcached + fs cache and linux starts to use swap.
> > 
> > > Nice! Just to double-check, could you please also test your patch
> > > above with these two commits from -rcu?
> > > 
> > > d2db185bfee8 ("rcu: Remove short-term CPU kicking") f8f127e738e3
> > > ("rcu: Add long-term CPU kicking")
> > > 
> > > Thanx, Paul
> > 
> > 
> > Looks like patches d2db185bfee8 and f8f127e738e3 change nothing.
> > 
> > With cond_resched() in shrink_node_memcg and these two patches dmesg is
> > clean. No any RCU CPU stall messages.
> 
> Very good!  I have these two patches queued for 4.11.

FWIW I have posted the cond_resched patch to Andrew [1]. I didn't CC you
Paul to save you from emails as this is more MM than anything else
related ;)

[1] http://lkml.kernel.org/r/20161202095841.16648-1-mhocko@kernel.org
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`
  2016-12-02 17:02                                 ` Michal Hocko
@ 2016-12-02 17:15                                   ` Paul E. McKenney
  0 siblings, 0 replies; 44+ messages in thread
From: Paul E. McKenney @ 2016-12-02 17:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Boris Zhmurov, Paul Menzel, Donald Buczek, linux-mm, linux-kernel

On Fri, Dec 02, 2016 at 06:02:49PM +0100, Michal Hocko wrote:
> On Fri 02-12-16 08:44:08, Paul E. McKenney wrote:
> > On Fri, Dec 02, 2016 at 07:39:24PM +0300, Boris Zhmurov wrote:
> > > Paul E. McKenney Thu Dec 01 2016 - 14:39:21 EST:
> > > 
> > > >> Well, I can confirm, that replacing cond_resched_rcu_qs in 
> > > >> shrink_node_memcg by cond_resched also makes dmesg clean from RCU 
> > > >> CPU stall warnings.
> > > >> 
> > > >> I've attached patch (just modification of Paul's patch), that
> > > >> fixes RCU stall messages in situations, when all memory is used by
> > > >>  couchbase/memcached + fs cache and linux starts to use swap.
> > > 
> > > > Nice! Just to double-check, could you please also test your patch
> > > > above with these two commits from -rcu?
> > > > 
> > > > d2db185bfee8 ("rcu: Remove short-term CPU kicking") f8f127e738e3
> > > > ("rcu: Add long-term CPU kicking")
> > > > 
> > > > Thanx, Paul
> > > 
> > > 
> > > Looks like patches d2db185bfee8 and f8f127e738e3 change nothing.
> > > 
> > > With cond_resched() in shrink_node_memcg and these two patches dmesg is
> > > clean. No any RCU CPU stall messages.
> > 
> > Very good!  I have these two patches queued for 4.11.
> 
> FWIW I have posted the cond_resched patch to Andrew [1]. I didn't CC you
> Paul to save you from emails as this is more MM than anything else
> related ;)
> 
> [1] http://lkml.kernel.org/r/20161202095841.16648-1-mhocko@kernel.org

Feel free to apply:

Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`
  2016-12-02  9:14                             ` Donald Buczek
@ 2016-12-06  8:32                               ` Donald Buczek
  0 siblings, 0 replies; 44+ messages in thread
From: Donald Buczek @ 2016-12-06  8:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Paul Menzel, dvteam, linux-mm, linux-kernel, Josh Triplett,
	Paul E. McKenney, Peter Zijlstra

On 12/02/16 10:14, Donald Buczek wrote:
> On 11/30/16 12:43, Donald Buczek wrote:
>> On 11/30/16 12:09, Michal Hocko wrote:
>>> [CCing Paul]
>>>
>>> On Wed 30-11-16 11:28:34, Donald Buczek wrote:
>>> [...]
>>>> shrink_active_list gets and releases the spinlock and calls 
>>>> cond_resched().
>>>> This should give other tasks a chance to run. Just as an 
>>>> experiment, I'm
>>>> trying
>>>>
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -1921,7 +1921,7 @@ static void shrink_active_list(unsigned long
>>>> nr_to_scan,
>>>>          spin_unlock_irq(&pgdat->lru_lock);
>>>>
>>>>          while (!list_empty(&l_hold)) {
>>>> -               cond_resched();
>>>> +               cond_resched_rcu_qs();
>>>>                  page = lru_to_page(&l_hold);
>>>>                  list_del(&page->lru);
>>>>
>>>> and didn't hit a rcu_sched warning for >21 hours uptime now. We'll 
>>>> see.
>>> This is really interesting! Is it possible that the RCU stall detector
>>> is somehow confused?
>>
>> Wait... 21 hours is not yet a test result.
>
> For the records: We didn't have any stall warnings after 2 days and 20 
> hours now and so I'm quite confident, that my above patch fixed the 
> problem for v4.8.0. On previous boots the rcu warnings started after 
> 37,0.2,1,2,0.8 hours uptime.
>
> Now I've applied this patch to stable latest (v4.8.11) on another 
> backup machine which suffered even more rcu stalls.
>
> Donald
>
>> [...]

For the records: After 3 days and 21 hours we've got a rcu stall warning 
again [1]. So my patch didn't fix it.

Trying "[PATCH] mm, vmscan: add cond_resched into shrink_node_memcg" 
from Michal Hocko [2] on top of v4.8.12 on both servers now.

[1] https://owww.molgen.mpg.de/~buczek/321322/2016-12-06.dmesg.txt
[2] https://marc.info/?i=20161202095841.16648-1-mhocko%40kernel.org

-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2016-12-06  8:32 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <d6981bac-8e97-b482-98c0-40949db03ca3@kernelpanic.ru>
     [not found] ` <20161124133019.GE3612@linux.vnet.ibm.com>
     [not found]   ` <de88a72a-f861-b51f-9fb3-4265378702f1@kernelpanic.ru>
     [not found]     ` <20161125212000.GI31360@linux.vnet.ibm.com>
     [not found]       ` <20161128095825.GI14788@dhcp22.suse.cz>
     [not found]         ` <20161128105425.GY31360@linux.vnet.ibm.com>
     [not found]           ` <3a4242cb-0198-0a3b-97ae-536fb5ff83ec@kernelpanic.ru>
     [not found]             ` <20161128143435.GC3924@linux.vnet.ibm.com>
2016-11-28 14:40               ` INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node` Boris Zhmurov
2016-11-28 15:05                 ` Paul E. McKenney
2016-11-28 19:16                   ` Boris Zhmurov
2016-11-29 18:59                     ` Paul E. McKenney
2016-11-30 17:41                   ` Boris Zhmurov
2016-11-30 17:48                     ` Michal Hocko
2016-11-30 18:12                       ` Boris Zhmurov
2016-11-30 18:25                         ` Michal Hocko
2016-11-30 18:26                           ` Boris Zhmurov
2016-12-01 18:10                           ` Boris Zhmurov
2016-12-01 19:39                             ` Paul E. McKenney
2016-12-02  9:37                             ` Michal Hocko
2016-12-02 13:52                               ` Paul E. McKenney
2016-12-02 16:39                             ` Boris Zhmurov
2016-12-02 16:44                               ` Paul E. McKenney
2016-12-02 17:02                                 ` Michal Hocko
2016-12-02 17:15                                   ` Paul E. McKenney
2016-11-30 19:42                         ` Paul E. McKenney
     [not found] <20161108183938.GD4127@linux.vnet.ibm.com>
     [not found] ` <9f87f8f0-9d0f-f78f-8dca-993b09b19a69@molgen.mpg.de>
     [not found]   ` <20161116173036.GK3612@linux.vnet.ibm.com>
     [not found]     ` <20161121134130.GB18112@dhcp22.suse.cz>
     [not found]       ` <20161121140122.GU3612@linux.vnet.ibm.com>
     [not found]         ` <20161121141818.GD18112@dhcp22.suse.cz>
     [not found]           ` <20161121142901.GV3612@linux.vnet.ibm.com>
     [not found]             ` <68025f6c-6801-ab46-b0fc-a9407353d8ce@molgen.mpg.de>
     [not found]               ` <20161124101525.GB20668@dhcp22.suse.cz>
     [not found]                 ` <583AA50A.9010608@molgen.mpg.de>
     [not found]                   ` <20161128110449.GK14788@dhcp22.suse.cz>
2016-11-28 12:26                     ` Paul Menzel
2016-11-30 10:28                       ` Donald Buczek
2016-11-30 11:09                         ` Michal Hocko
2016-11-30 11:43                           ` Donald Buczek
2016-12-02  9:14                             ` Donald Buczek
2016-12-06  8:32                               ` Donald Buczek
2016-11-30 11:53                           ` Paul E. McKenney
2016-11-30 11:54                             ` Paul E. McKenney
2016-11-30 12:31                               ` Paul Menzel
2016-11-30 14:31                                 ` Paul E. McKenney
2016-11-30 13:19                             ` Michal Hocko
2016-11-30 14:29                               ` Paul E. McKenney
2016-11-30 16:38                                 ` Peter Zijlstra
2016-11-30 17:02                                   ` Paul E. McKenney
2016-11-30 17:05                                   ` Michal Hocko
2016-11-30 17:23                                     ` Paul E. McKenney
2016-11-30 17:34                                       ` Michal Hocko
2016-11-30 17:50                                     ` Peter Zijlstra
2016-11-30 19:40                                       ` Paul E. McKenney
2016-12-01  5:30                                         ` Peter Zijlstra
2016-12-01 12:40                                           ` Paul E. McKenney
2016-12-01 16:36                                             ` Peter Zijlstra
2016-12-01 16:59                                               ` Paul E. McKenney
2016-12-01 18:09                                                 ` Peter Zijlstra
2016-12-01 18:42                                                   ` Paul E. McKenney
2016-12-01 18:49                                                     ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).