All of lore.kernel.org
 help / color / mirror / Atom feed
* Patch "mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress" has been added to the 4.3-stable tree
@ 2016-02-12 21:01 gregkh
  2016-02-15  9:35 ` Michal Hocko
  0 siblings, 1 reply; 3+ messages in thread
From: gregkh @ 2016-02-12 21:01 UTC (permalink / raw)
  To: mhocko, akpm, arekm, clameter, gregkh, js1304, penguin-kernel, tj,
	torvalds
  Cc: stable, stable-commits


This is a note to let you know that I've just added the patch titled

    mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress

to the 4.3-stable tree which can be found at:
    http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
     mm-vmstat-allow-wq-concurrency-to-discover-memory-reclaim-doesn-t-make-any-progress.patch
and it can be found in the queue-4.3 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let <stable@vger.kernel.org> know about it.


>From 373ccbe5927034b55bdc80b0f8b54d6e13fe8d12 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Fri, 11 Dec 2015 13:40:32 -0800
Subject: mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress

From: Michal Hocko <mhocko@suse.com>

commit 373ccbe5927034b55bdc80b0f8b54d6e13fe8d12 upstream.

Tetsuo Handa has reported that the system might basically livelock in
OOM condition without triggering the OOM killer.

The issue is caused by internal dependency of the direct reclaim on
vmstat counter updates (via zone_reclaimable) which are performed from
the workqueue context.  If all the current workers get assigned to an
allocation request, though, they will be looping inside the allocator
trying to reclaim memory but zone_reclaimable can see stalled numbers so
it will consider a zone reclaimable even though it has been scanned way
too much.  WQ concurrency logic will not consider this situation as a
congested workqueue because it relies that worker would have to sleep in
such a situation.  This also means that it doesn't try to spawn new
workers or invoke the rescuer thread if the one is assigned to the
queue.

In order to fix this issue we need to do two things.  First we have to
let wq concurrency code know that we are in trouble so we have to do a
short sleep.  In order to prevent from issues handled by 0e093d99763e
("writeback: do not sleep on the congestion queue if there are no
congested BDIs or if significant congestion is not being encountered in
the current zone") we limit the sleep only to worker threads which are
the ones of the interest anyway.

The second thing to do is to create a dedicated workqueue for vmstat and
mark it WQ_MEM_RECLAIM to note it participates in the reclaim and to
have a spare worker thread for it.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Tejun Heo <tj@kernel.org>
Cc: Cristopher Lameter <clameter@sgi.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Arkadiusz Miskiewicz <arekm@maven.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 mm/backing-dev.c |   19 ++++++++++++++++---
 mm/vmstat.c      |    6 ++++--
 2 files changed, 20 insertions(+), 5 deletions(-)

--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -957,8 +957,9 @@ EXPORT_SYMBOL(congestion_wait);
  * jiffies for either a BDI to exit congestion of the given @sync queue
  * or a write to complete.
  *
- * In the absence of zone congestion, cond_resched() is called to yield
- * the processor if necessary but otherwise does not sleep.
+ * In the absence of zone congestion, a short sleep or a cond_resched is
+ * performed to yield the processor and to allow other subsystems to make
+ * a forward progress.
  *
  * The return value is 0 if the sleep is for the full timeout. Otherwise,
  * it is the number of jiffies that were still remaining when the function
@@ -978,7 +979,19 @@ long wait_iff_congested(struct zone *zon
 	 */
 	if (atomic_read(&nr_wb_congested[sync]) == 0 ||
 	    !test_bit(ZONE_CONGESTED, &zone->flags)) {
-		cond_resched();
+
+		/*
+		 * Memory allocation/reclaim might be called from a WQ
+		 * context and the current implementation of the WQ
+		 * concurrency control doesn't recognize that a particular
+		 * WQ is congested if the worker thread is looping without
+		 * ever sleeping. Therefore we have to do a short sleep
+		 * here rather than calling cond_resched().
+		 */
+		if (current->flags & PF_WQ_WORKER)
+			schedule_timeout(1);
+		else
+			cond_resched();
 
 		/* In case we scheduled, work out time remaining */
 		ret = timeout - (jiffies - start);
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1357,6 +1357,7 @@ static const struct file_operations proc
 #endif /* CONFIG_PROC_FS */
 
 #ifdef CONFIG_SMP
+static struct workqueue_struct *vmstat_wq;
 static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
 int sysctl_stat_interval __read_mostly = HZ;
 static cpumask_var_t cpu_stat_off;
@@ -1369,7 +1370,7 @@ static void vmstat_update(struct work_st
 		 * to occur in the future. Keep on running the
 		 * update worker thread.
 		 */
-		schedule_delayed_work_on(smp_processor_id(),
+		queue_delayed_work_on(smp_processor_id(), vmstat_wq,
 			this_cpu_ptr(&vmstat_work),
 			round_jiffies_relative(sysctl_stat_interval));
 	} else {
@@ -1438,7 +1439,7 @@ static void vmstat_shepherd(struct work_
 		if (need_update(cpu) &&
 			cpumask_test_and_clear_cpu(cpu, cpu_stat_off))
 
-			schedule_delayed_work_on(cpu,
+			queue_delayed_work_on(cpu, vmstat_wq,
 				&per_cpu(vmstat_work, cpu), 0);
 
 	put_online_cpus();
@@ -1527,6 +1528,7 @@ static int __init setup_vmstat(void)
 
 	start_shepherd_timer();
 	cpu_notifier_register_done();
+	vmstat_wq = alloc_workqueue("vmstat", WQ_FREEZABLE|WQ_MEM_RECLAIM, 0);
 #endif
 #ifdef CONFIG_PROC_FS
 	proc_create("buddyinfo", S_IRUGO, NULL, &fragmentation_file_operations);


Patches currently in stable-queue which might be from mhocko@suse.com are

queue-4.3/mm-oom_kill.c-reverse-the-order-of-setting-tif_memdie-and-sending-sigkill.patch
queue-4.3/mm-vmstat-allow-wq-concurrency-to-discover-memory-reclaim-doesn-t-make-any-progress.patch
queue-4.3/memcg-fix-thresholds-for-32b-architectures.patch

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Patch "mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress" has been added to the 4.3-stable tree
  2016-02-12 21:01 Patch "mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress" has been added to the 4.3-stable tree gregkh
@ 2016-02-15  9:35 ` Michal Hocko
  2016-02-15 18:09   ` Greg KH
  0 siblings, 1 reply; 3+ messages in thread
From: Michal Hocko @ 2016-02-15  9:35 UTC (permalink / raw)
  To: gregkh
  Cc: akpm, arekm, clameter, js1304, penguin-kernel, tj, torvalds,
	stable, stable-commits

Hi Greg,
it turned out that this patch is incomplete and 564e81a57f97 ("mm,
vmstat: fix wrong WQ sleep when memory reclaim doesn't make any
progress") is needed as a follow up fix. With just this patch backported
the described issue still persists.

On Fri 12-02-16 13:01:21, Greg KH wrote:
> >From 373ccbe5927034b55bdc80b0f8b54d6e13fe8d12 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Fri, 11 Dec 2015 13:40:32 -0800
> Subject: mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress
> 
> From: Michal Hocko <mhocko@suse.com>
> 
> commit 373ccbe5927034b55bdc80b0f8b54d6e13fe8d12 upstream.
> 
> Tetsuo Handa has reported that the system might basically livelock in
> OOM condition without triggering the OOM killer.
> 
> The issue is caused by internal dependency of the direct reclaim on
> vmstat counter updates (via zone_reclaimable) which are performed from
> the workqueue context.  If all the current workers get assigned to an
> allocation request, though, they will be looping inside the allocator
> trying to reclaim memory but zone_reclaimable can see stalled numbers so
> it will consider a zone reclaimable even though it has been scanned way
> too much.  WQ concurrency logic will not consider this situation as a
> congested workqueue because it relies that worker would have to sleep in
> such a situation.  This also means that it doesn't try to spawn new
> workers or invoke the rescuer thread if the one is assigned to the
> queue.
> 
> In order to fix this issue we need to do two things.  First we have to
> let wq concurrency code know that we are in trouble so we have to do a
> short sleep.  In order to prevent from issues handled by 0e093d99763e
> ("writeback: do not sleep on the congestion queue if there are no
> congested BDIs or if significant congestion is not being encountered in
> the current zone") we limit the sleep only to worker threads which are
> the ones of the interest anyway.
> 
> The second thing to do is to create a dedicated workqueue for vmstat and
> mark it WQ_MEM_RECLAIM to note it participates in the reclaim and to
> have a spare worker thread for it.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Cristopher Lameter <clameter@sgi.com>
> Cc: Joonsoo Kim <js1304@gmail.com>
> Cc: Arkadiusz Miskiewicz <arekm@maven.pl>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> 
> ---
>  mm/backing-dev.c |   19 ++++++++++++++++---
>  mm/vmstat.c      |    6 ++++--
>  2 files changed, 20 insertions(+), 5 deletions(-)
> 
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -957,8 +957,9 @@ EXPORT_SYMBOL(congestion_wait);
>   * jiffies for either a BDI to exit congestion of the given @sync queue
>   * or a write to complete.
>   *
> - * In the absence of zone congestion, cond_resched() is called to yield
> - * the processor if necessary but otherwise does not sleep.
> + * In the absence of zone congestion, a short sleep or a cond_resched is
> + * performed to yield the processor and to allow other subsystems to make
> + * a forward progress.
>   *
>   * The return value is 0 if the sleep is for the full timeout. Otherwise,
>   * it is the number of jiffies that were still remaining when the function
> @@ -978,7 +979,19 @@ long wait_iff_congested(struct zone *zon
>  	 */
>  	if (atomic_read(&nr_wb_congested[sync]) == 0 ||
>  	    !test_bit(ZONE_CONGESTED, &zone->flags)) {
> -		cond_resched();
> +
> +		/*
> +		 * Memory allocation/reclaim might be called from a WQ
> +		 * context and the current implementation of the WQ
> +		 * concurrency control doesn't recognize that a particular
> +		 * WQ is congested if the worker thread is looping without
> +		 * ever sleeping. Therefore we have to do a short sleep
> +		 * here rather than calling cond_resched().
> +		 */
> +		if (current->flags & PF_WQ_WORKER)
> +			schedule_timeout(1);
> +		else
> +			cond_resched();
>  
>  		/* In case we scheduled, work out time remaining */
>  		ret = timeout - (jiffies - start);
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1357,6 +1357,7 @@ static const struct file_operations proc
>  #endif /* CONFIG_PROC_FS */
>  
>  #ifdef CONFIG_SMP
> +static struct workqueue_struct *vmstat_wq;
>  static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
>  int sysctl_stat_interval __read_mostly = HZ;
>  static cpumask_var_t cpu_stat_off;
> @@ -1369,7 +1370,7 @@ static void vmstat_update(struct work_st
>  		 * to occur in the future. Keep on running the
>  		 * update worker thread.
>  		 */
> -		schedule_delayed_work_on(smp_processor_id(),
> +		queue_delayed_work_on(smp_processor_id(), vmstat_wq,
>  			this_cpu_ptr(&vmstat_work),
>  			round_jiffies_relative(sysctl_stat_interval));
>  	} else {
> @@ -1438,7 +1439,7 @@ static void vmstat_shepherd(struct work_
>  		if (need_update(cpu) &&
>  			cpumask_test_and_clear_cpu(cpu, cpu_stat_off))
>  
> -			schedule_delayed_work_on(cpu,
> +			queue_delayed_work_on(cpu, vmstat_wq,
>  				&per_cpu(vmstat_work, cpu), 0);
>  
>  	put_online_cpus();
> @@ -1527,6 +1528,7 @@ static int __init setup_vmstat(void)
>  
>  	start_shepherd_timer();
>  	cpu_notifier_register_done();
> +	vmstat_wq = alloc_workqueue("vmstat", WQ_FREEZABLE|WQ_MEM_RECLAIM, 0);
>  #endif
>  #ifdef CONFIG_PROC_FS
>  	proc_create("buddyinfo", S_IRUGO, NULL, &fragmentation_file_operations);
> 
> 
> Patches currently in stable-queue which might be from mhocko@suse.com are
> 
> queue-4.3/mm-oom_kill.c-reverse-the-order-of-setting-tif_memdie-and-sending-sigkill.patch
> queue-4.3/mm-vmstat-allow-wq-concurrency-to-discover-memory-reclaim-doesn-t-make-any-progress.patch
> queue-4.3/memcg-fix-thresholds-for-32b-architectures.patch

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Patch "mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress" has been added to the 4.3-stable tree
  2016-02-15  9:35 ` Michal Hocko
@ 2016-02-15 18:09   ` Greg KH
  0 siblings, 0 replies; 3+ messages in thread
From: Greg KH @ 2016-02-15 18:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, arekm, clameter, js1304, penguin-kernel, tj, torvalds,
	stable, stable-commits

On Mon, Feb 15, 2016 at 10:35:11AM +0100, Michal Hocko wrote:
> Hi Greg,
> it turned out that this patch is incomplete and 564e81a57f97 ("mm,
> vmstat: fix wrong WQ sleep when memory reclaim doesn't make any
> progress") is needed as a follow up fix. With just this patch backported
> the described issue still persists.

Thanks for letting me know, I'll go queue this one up as well.

greg k-h

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2016-02-15 18:09 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-02-12 21:01 Patch "mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress" has been added to the 4.3-stable tree gregkh
2016-02-15  9:35 ` Michal Hocko
2016-02-15 18:09   ` Greg KH

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.