[PATCH] mm: Fix kswapd livelock on single core, no preempt kernel

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] mm: Fix kswapd livelock on single core, no preempt kernel
@ 2011-12-13 17:44 ` Mike Waychison
  0 siblings, 0 replies; 16+ messages in thread
From: Mike Waychison @ 2011-12-13 17:44 UTC (permalink / raw)
  To: Andrew Morton, Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki,
	Johannes Weiner
  Cc: linux-mm, linux-kernel, Hugh Dickens, Greg Thelen, Mike Waychison

On a single core system with kernel preemption disabled, it is possible
for the memory system to be so taxed that kswapd cannot make any forward
progress.  This can happen when most of system memory is tied up as
anonymous memory without swap enabled, causing kswapd to consistently
fail to achieve its watermark goals.  In turn, sleeping_prematurely()
will consistently return true and kswapd_try_to_sleep() to never invoke
schedule().  This causes the kswapd thread to stay on the CPU in
perpetuity and keeps other threads from processing oom-kills to reclaim
memory.

The cond_resched() instance in balance_pgdat() is never called as the
loop that iterates from DEF_PRIORITY down to 0 will always set
all_zones_ok to true, and not set it to false once we've passed
DEF_PRIORITY as zones that are marked ->all_unreclaimable are not
considered in the "all_zones_ok" evaluation.

This change modifies kswapd_try_to_sleep to ensure that we enter
scheduler at least once per invocation if needed.  This allows kswapd to
get off the CPU and allows other threads to die off from the OOM killer
(freeing memory that is otherwise unavailable in the process).

Signed-off-by: Mike Waychison <mikew@google.com>
---
 mm/vmscan.c |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f54a05b..aad70c7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2794,6 +2794,7 @@ out:
 static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 {
 	long remaining = 0;
+	bool slept = false;
 	DEFINE_WAIT(wait);

 	if (freezing(current) || kthread_should_stop())
@@ -2806,6 +2807,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 		remaining = schedule_timeout(HZ/10);
 		finish_wait(&pgdat->kswapd_wait, &wait);
 		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
+		slept = true;
 	}

 	/*
@@ -2826,6 +2828,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 		set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold);
 		schedule();
 		set_pgdat_percpu_threshold(pgdat, calculate_pressure_threshold);
+		slept = true;
 	} else {
 		if (remaining)
 			count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
@@ -2833,6 +2836,14 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 			count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
 	}
 	finish_wait(&pgdat->kswapd_wait, &wait);
+	/*
+	 * If we did not sleep already, there is a chance that we will sit on
+	 * the CPU trashing without making any forward progress.  This can
+	 * lead to a livelock on a single CPU system without kernel pre-emption,
+	 * so introduce a voluntary context switch.
+	 */
+	if (!slept)
+		cond_resched();
 }

 /*
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH] mm: Fix kswapd livelock on single core, no preempt kernel
@ 2011-12-13 17:44 ` Mike Waychison
  0 siblings, 0 replies; 16+ messages in thread
From: Mike Waychison @ 2011-12-13 17:44 UTC (permalink / raw)
  To: Andrew Morton, Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki,
	Johannes Weiner
  Cc: linux-mm, linux-kernel, Hugh Dickens, Greg Thelen, Mike Waychison

On a single core system with kernel preemption disabled, it is possible
for the memory system to be so taxed that kswapd cannot make any forward
progress.  This can happen when most of system memory is tied up as
anonymous memory without swap enabled, causing kswapd to consistently
fail to achieve its watermark goals.  In turn, sleeping_prematurely()
will consistently return true and kswapd_try_to_sleep() to never invoke
schedule().  This causes the kswapd thread to stay on the CPU in
perpetuity and keeps other threads from processing oom-kills to reclaim
memory.

The cond_resched() instance in balance_pgdat() is never called as the
loop that iterates from DEF_PRIORITY down to 0 will always set
all_zones_ok to true, and not set it to false once we've passed
DEF_PRIORITY as zones that are marked ->all_unreclaimable are not
considered in the "all_zones_ok" evaluation.

This change modifies kswapd_try_to_sleep to ensure that we enter
scheduler at least once per invocation if needed.  This allows kswapd to
get off the CPU and allows other threads to die off from the OOM killer
(freeing memory that is otherwise unavailable in the process).

Signed-off-by: Mike Waychison <mikew@google.com>
---
 mm/vmscan.c |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f54a05b..aad70c7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2794,6 +2794,7 @@ out:
 static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 {
 	long remaining = 0;
+	bool slept = false;
 	DEFINE_WAIT(wait);

 	if (freezing(current) || kthread_should_stop())
@@ -2806,6 +2807,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 		remaining = schedule_timeout(HZ/10);
 		finish_wait(&pgdat->kswapd_wait, &wait);
 		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
+		slept = true;
 	}

 	/*
@@ -2826,6 +2828,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 		set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold);
 		schedule();
 		set_pgdat_percpu_threshold(pgdat, calculate_pressure_threshold);
+		slept = true;
 	} else {
 		if (remaining)
 			count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
@@ -2833,6 +2836,14 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 			count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
 	}
 	finish_wait(&pgdat->kswapd_wait, &wait);
+	/*
+	 * If we did not sleep already, there is a chance that we will sit on
+	 * the CPU trashing without making any forward progress.  This can
+	 * lead to a livelock on a single CPU system without kernel pre-emption,
+	 * so introduce a voluntary context switch.
+	 */
+	if (!slept)
+		cond_resched();
 }

 /*
-- 
1.7.3.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: Fix kswapd livelock on single core, no preempt kernel
  2011-12-13 17:44 ` Mike Waychison
@ 2011-12-14  2:24   ` Shaohua Li
  -1 siblings, 0 replies; 16+ messages in thread
From: Shaohua Li @ 2011-12-14  2:24 UTC (permalink / raw)
  To: Mike Waychison
  Cc: Andrew Morton, Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki,
	Johannes Weiner, linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Hugh Dickens, Greg Thelen

On Wed, 2011-12-14 at 01:44 +0800, Mike Waychison wrote:
> On a single core system with kernel preemption disabled, it is possible
> for the memory system to be so taxed that kswapd cannot make any forward
> progress.  This can happen when most of system memory is tied up as
> anonymous memory without swap enabled, causing kswapd to consistently
> fail to achieve its watermark goals.  In turn, sleeping_prematurely()
> will consistently return true and kswapd_try_to_sleep() to never invoke
> schedule().  This causes the kswapd thread to stay on the CPU in
> perpetuity and keeps other threads from processing oom-kills to reclaim
> memory.
> 
> The cond_resched() instance in balance_pgdat() is never called as the
> loop that iterates from DEF_PRIORITY down to 0 will always set
> all_zones_ok to true, and not set it to false once we've passed
> DEF_PRIORITY as zones that are marked ->all_unreclaimable are not
> considered in the "all_zones_ok" evaluation.
> 
> This change modifies kswapd_try_to_sleep to ensure that we enter
> scheduler at least once per invocation if needed.  This allows kswapd to
> get off the CPU and allows other threads to die off from the OOM killer
> (freeing memory that is otherwise unavailable in the process).
your description suggests zones with all_unreclaimable set. but in this
case sleeping_prematurely() will return false instead of true, kswapd
will do sleep then. is there anything I missed?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: Fix kswapd livelock on single core, no preempt kernel
@ 2011-12-14  2:24   ` Shaohua Li
  0 siblings, 0 replies; 16+ messages in thread
From: Shaohua Li @ 2011-12-14  2:24 UTC (permalink / raw)
  To: Mike Waychison
  Cc: Andrew Morton, Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki,
	Johannes Weiner, linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Hugh Dickens, Greg Thelen

On Wed, 2011-12-14 at 01:44 +0800, Mike Waychison wrote:
> On a single core system with kernel preemption disabled, it is possible
> for the memory system to be so taxed that kswapd cannot make any forward
> progress.  This can happen when most of system memory is tied up as
> anonymous memory without swap enabled, causing kswapd to consistently
> fail to achieve its watermark goals.  In turn, sleeping_prematurely()
> will consistently return true and kswapd_try_to_sleep() to never invoke
> schedule().  This causes the kswapd thread to stay on the CPU in
> perpetuity and keeps other threads from processing oom-kills to reclaim
> memory.
> 
> The cond_resched() instance in balance_pgdat() is never called as the
> loop that iterates from DEF_PRIORITY down to 0 will always set
> all_zones_ok to true, and not set it to false once we've passed
> DEF_PRIORITY as zones that are marked ->all_unreclaimable are not
> considered in the "all_zones_ok" evaluation.
> 
> This change modifies kswapd_try_to_sleep to ensure that we enter
> scheduler at least once per invocation if needed.  This allows kswapd to
> get off the CPU and allows other threads to die off from the OOM killer
> (freeing memory that is otherwise unavailable in the process).
your description suggests zones with all_unreclaimable set. but in this
case sleeping_prematurely() will return false instead of true, kswapd
will do sleep then. is there anything I missed?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: Fix kswapd livelock on single core, no preempt kernel
  2011-12-14  2:24   ` Shaohua Li
@ 2011-12-14  4:36     ` Mike Waychison
  -1 siblings, 0 replies; 16+ messages in thread
From: Mike Waychison @ 2011-12-14  4:36 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Andrew Morton, Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki,
	Johannes Weiner, linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Hugh Dickens, Greg Thelen

On Tue, Dec 13, 2011 at 6:24 PM, Shaohua Li <shaohua.li@intel.com> wrote:
> On Wed, 2011-12-14 at 01:44 +0800, Mike Waychison wrote:
>> On a single core system with kernel preemption disabled, it is possible
>> for the memory system to be so taxed that kswapd cannot make any forward
>> progress.  This can happen when most of system memory is tied up as
>> anonymous memory without swap enabled, causing kswapd to consistently
>> fail to achieve its watermark goals.  In turn, sleeping_prematurely()
>> will consistently return true and kswapd_try_to_sleep() to never invoke
>> schedule().  This causes the kswapd thread to stay on the CPU in
>> perpetuity and keeps other threads from processing oom-kills to reclaim
>> memory.
>>
>> The cond_resched() instance in balance_pgdat() is never called as the
>> loop that iterates from DEF_PRIORITY down to 0 will always set
>> all_zones_ok to true, and not set it to false once we've passed
>> DEF_PRIORITY as zones that are marked ->all_unreclaimable are not
>> considered in the "all_zones_ok" evaluation.
>>
>> This change modifies kswapd_try_to_sleep to ensure that we enter
>> scheduler at least once per invocation if needed.  This allows kswapd to
>> get off the CPU and allows other threads to die off from the OOM killer
>> (freeing memory that is otherwise unavailable in the process).
> your description suggests zones with all_unreclaimable set. but in this
> case sleeping_prematurely() will return false instead of true, kswapd
> will do sleep then. is there anything I missed?

Debugging this, I didn't get a dump from oom-kill as it never ran
(until I binary patched in a cond_resched() into live hung machines --
this reproduced in a VM).

I was however able to capture the following data while it was hung:


/cloud/vmm/host/backend/perfmetric/node0/zone0/active_anon : long long = 773
/cloud/vmm/host/backend/perfmetric/node0/zone0/active_file : long long = 6
/cloud/vmm/host/backend/perfmetric/node0/zone0/anon_pages : long long = 1,329
/cloud/vmm/host/backend/perfmetric/node0/zone0/bounce : long long = 0
/cloud/vmm/host/backend/perfmetric/node0/zone0/dirtied : long long = 4,425
/cloud/vmm/host/backend/perfmetric/node0/zone0/file_dirty : long long = 0
/cloud/vmm/host/backend/perfmetric/node0/zone0/file_mapped : long long = 5
/cloud/vmm/host/backend/perfmetric/node0/zone0/file_pages : long long = 330
/cloud/vmm/host/backend/perfmetric/node0/zone0/free_pages : long long = 2,018
/cloud/vmm/host/backend/perfmetric/node0/zone0/inactive_anon : long long = 865
/cloud/vmm/host/backend/perfmetric/node0/zone0/inactive_file : long long = 13
/cloud/vmm/host/backend/perfmetric/node0/zone0/kernel_stack : long long = 10
/cloud/vmm/host/backend/perfmetric/node0/zone0/mlock : long long = 0
/cloud/vmm/host/backend/perfmetric/node0/zone0/pagetable : long long = 74
/cloud/vmm/host/backend/perfmetric/node0/zone0/shmem : long long = 0
/cloud/vmm/host/backend/perfmetric/node0/zone0/slab_reclaimable : long long = 54
/cloud/vmm/host/backend/perfmetric/node0/zone0/slab_unreclaimable :
long long = 130
/cloud/vmm/host/backend/perfmetric/node0/zone0/unevictable : long long = 0
/cloud/vmm/host/backend/perfmetric/node0/zone0/writeback : long long = 0
/cloud/vmm/host/backend/perfmetric/node0/zone0/written : long long = 47,184

/cloud/vmm/host/backend/perfmetric/node0/zone1/active_anon : long long = 359,251
/cloud/vmm/host/backend/perfmetric/node0/zone1/active_file : long long = 67
/cloud/vmm/host/backend/perfmetric/node0/zone1/anon_pages : long long = 441,180
/cloud/vmm/host/backend/perfmetric/node0/zone1/bounce : long long = 0
/cloud/vmm/host/backend/perfmetric/node0/zone1/dirtied : long long = 6,457,125
/cloud/vmm/host/backend/perfmetric/node0/zone1/file_dirty : long long = 0
/cloud/vmm/host/backend/perfmetric/node0/zone1/file_mapped : long long = 134
/cloud/vmm/host/backend/perfmetric/node0/zone1/file_pages : long long = 38,090
/cloud/vmm/host/backend/perfmetric/node0/zone1/free_pages : long long = 1,630
/cloud/vmm/host/backend/perfmetric/node0/zone1/inactive_anon : long
long = 119,779
/cloud/vmm/host/backend/perfmetric/node0/zone1/inactive_file : long long = 81
/cloud/vmm/host/backend/perfmetric/node0/zone1/kernel_stack : long long = 173
/cloud/vmm/host/backend/perfmetric/node0/zone1/mlock : long long = 0
/cloud/vmm/host/backend/perfmetric/node0/zone1/pagetable : long long = 15,222
/cloud/vmm/host/backend/perfmetric/node0/zone1/shmem : long long = 1
/cloud/vmm/host/backend/perfmetric/node0/zone1/slab_reclaimable : long
long = 1,677
/cloud/vmm/host/backend/perfmetric/node0/zone1/slab_unreclaimable :
long long = 7,152
/cloud/vmm/host/backend/perfmetric/node0/zone1/unevictable : long long = 0
/cloud/vmm/host/backend/perfmetric/node0/zone1/writeback : long long = 8
/cloud/vmm/host/backend/perfmetric/node0/zone1/written : long long = 16,639,708

These value were static while the machine was hung up in kswapd.  I
unfortunately don't have the low/min/max or lowmem watermarks handy.

>From stepping through with gdb, I was able to determine that
ZONE_DMA32 would fail zone_watermark_ok_safe(), causing a scan   up to
end_zone == 1.  If memory serves, it would not get the
->all_unreclaimable flag.  I didn't get the chance to root cause this
internal inconsistency though.

FYI, this was seen with a 2.6.39-based kernel with no-numa, no-memcg
and swap-enabled.

If I get the chance, I can reproduce and look at this closer to try
and root cause why zone_reclaimable() would return true, but I won't
be able to do that until after the holidays -- sometime in January.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: Fix kswapd livelock on single core, no preempt kernel
@ 2011-12-14  4:36     ` Mike Waychison
  0 siblings, 0 replies; 16+ messages in thread
From: Mike Waychison @ 2011-12-14  4:36 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Andrew Morton, Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki,
	Johannes Weiner, linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Hugh Dickens, Greg Thelen

On Tue, Dec 13, 2011 at 6:24 PM, Shaohua Li <shaohua.li@intel.com> wrote:
> On Wed, 2011-12-14 at 01:44 +0800, Mike Waychison wrote:
>> On a single core system with kernel preemption disabled, it is possible
>> for the memory system to be so taxed that kswapd cannot make any forward
>> progress.  This can happen when most of system memory is tied up as
>> anonymous memory without swap enabled, causing kswapd to consistently
>> fail to achieve its watermark goals.  In turn, sleeping_prematurely()
>> will consistently return true and kswapd_try_to_sleep() to never invoke
>> schedule().  This causes the kswapd thread to stay on the CPU in
>> perpetuity and keeps other threads from processing oom-kills to reclaim
>> memory.
>>
>> The cond_resched() instance in balance_pgdat() is never called as the
>> loop that iterates from DEF_PRIORITY down to 0 will always set
>> all_zones_ok to true, and not set it to false once we've passed
>> DEF_PRIORITY as zones that are marked ->all_unreclaimable are not
>> considered in the "all_zones_ok" evaluation.
>>
>> This change modifies kswapd_try_to_sleep to ensure that we enter
>> scheduler at least once per invocation if needed.  This allows kswapd to
>> get off the CPU and allows other threads to die off from the OOM killer
>> (freeing memory that is otherwise unavailable in the process).
> your description suggests zones with all_unreclaimable set. but in this
> case sleeping_prematurely() will return false instead of true, kswapd
> will do sleep then. is there anything I missed?

Debugging this, I didn't get a dump from oom-kill as it never ran
(until I binary patched in a cond_resched() into live hung machines --
this reproduced in a VM).

I was however able to capture the following data while it was hung:


/cloud/vmm/host/backend/perfmetric/node0/zone0/active_anon : long long = 773
/cloud/vmm/host/backend/perfmetric/node0/zone0/active_file : long long = 6
/cloud/vmm/host/backend/perfmetric/node0/zone0/anon_pages : long long = 1,329
/cloud/vmm/host/backend/perfmetric/node0/zone0/bounce : long long = 0
/cloud/vmm/host/backend/perfmetric/node0/zone0/dirtied : long long = 4,425
/cloud/vmm/host/backend/perfmetric/node0/zone0/file_dirty : long long = 0
/cloud/vmm/host/backend/perfmetric/node0/zone0/file_mapped : long long = 5
/cloud/vmm/host/backend/perfmetric/node0/zone0/file_pages : long long = 330
/cloud/vmm/host/backend/perfmetric/node0/zone0/free_pages : long long = 2,018
/cloud/vmm/host/backend/perfmetric/node0/zone0/inactive_anon : long long = 865
/cloud/vmm/host/backend/perfmetric/node0/zone0/inactive_file : long long = 13
/cloud/vmm/host/backend/perfmetric/node0/zone0/kernel_stack : long long = 10
/cloud/vmm/host/backend/perfmetric/node0/zone0/mlock : long long = 0
/cloud/vmm/host/backend/perfmetric/node0/zone0/pagetable : long long = 74
/cloud/vmm/host/backend/perfmetric/node0/zone0/shmem : long long = 0
/cloud/vmm/host/backend/perfmetric/node0/zone0/slab_reclaimable : long long = 54
/cloud/vmm/host/backend/perfmetric/node0/zone0/slab_unreclaimable :
long long = 130
/cloud/vmm/host/backend/perfmetric/node0/zone0/unevictable : long long = 0
/cloud/vmm/host/backend/perfmetric/node0/zone0/writeback : long long = 0
/cloud/vmm/host/backend/perfmetric/node0/zone0/written : long long = 47,184

/cloud/vmm/host/backend/perfmetric/node0/zone1/active_anon : long long = 359,251
/cloud/vmm/host/backend/perfmetric/node0/zone1/active_file : long long = 67
/cloud/vmm/host/backend/perfmetric/node0/zone1/anon_pages : long long = 441,180
/cloud/vmm/host/backend/perfmetric/node0/zone1/bounce : long long = 0
/cloud/vmm/host/backend/perfmetric/node0/zone1/dirtied : long long = 6,457,125
/cloud/vmm/host/backend/perfmetric/node0/zone1/file_dirty : long long = 0
/cloud/vmm/host/backend/perfmetric/node0/zone1/file_mapped : long long = 134
/cloud/vmm/host/backend/perfmetric/node0/zone1/file_pages : long long = 38,090
/cloud/vmm/host/backend/perfmetric/node0/zone1/free_pages : long long = 1,630
/cloud/vmm/host/backend/perfmetric/node0/zone1/inactive_anon : long
long = 119,779
/cloud/vmm/host/backend/perfmetric/node0/zone1/inactive_file : long long = 81
/cloud/vmm/host/backend/perfmetric/node0/zone1/kernel_stack : long long = 173
/cloud/vmm/host/backend/perfmetric/node0/zone1/mlock : long long = 0
/cloud/vmm/host/backend/perfmetric/node0/zone1/pagetable : long long = 15,222
/cloud/vmm/host/backend/perfmetric/node0/zone1/shmem : long long = 1
/cloud/vmm/host/backend/perfmetric/node0/zone1/slab_reclaimable : long
long = 1,677
/cloud/vmm/host/backend/perfmetric/node0/zone1/slab_unreclaimable :
long long = 7,152
/cloud/vmm/host/backend/perfmetric/node0/zone1/unevictable : long long = 0
/cloud/vmm/host/backend/perfmetric/node0/zone1/writeback : long long = 8
/cloud/vmm/host/backend/perfmetric/node0/zone1/written : long long = 16,639,708

These value were static while the machine was hung up in kswapd.  I
unfortunately don't have the low/min/max or lowmem watermarks handy.

>From stepping through with gdb, I was able to determine that
ZONE_DMA32 would fail zone_watermark_ok_safe(), causing a scan   up to
end_zone == 1.  If memory serves, it would not get the
->all_unreclaimable flag.  I didn't get the chance to root cause this
internal inconsistency though.

FYI, this was seen with a 2.6.39-based kernel with no-numa, no-memcg
and swap-enabled.

If I get the chance, I can reproduce and look at this closer to try
and root cause why zone_reclaimable() would return true, but I won't
be able to do that until after the holidays -- sometime in January.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: Fix kswapd livelock on single core, no preempt kernel
  2011-12-14  4:36     ` Mike Waychison
@ 2011-12-14  4:45       ` Mike Waychison
  -1 siblings, 0 replies; 16+ messages in thread
From: Mike Waychison @ 2011-12-14  4:45 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Andrew Morton, Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki,
	Johannes Weiner, linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Hugh Dickens, Greg Thelen

On Tue, Dec 13, 2011 at 8:36 PM, Mike Waychison <mikew@google.com> wrote:
> On Tue, Dec 13, 2011 at 6:24 PM, Shaohua Li <shaohua.li@intel.com> wrote:
>> On Wed, 2011-12-14 at 01:44 +0800, Mike Waychison wrote:
>>> On a single core system with kernel preemption disabled, it is possible
>>> for the memory system to be so taxed that kswapd cannot make any forward
>>> progress.  This can happen when most of system memory is tied up as
>>> anonymous memory without swap enabled, causing kswapd to consistently
>>> fail to achieve its watermark goals.  In turn, sleeping_prematurely()
>>> will consistently return true and kswapd_try_to_sleep() to never invoke
>>> schedule().  This causes the kswapd thread to stay on the CPU in
>>> perpetuity and keeps other threads from processing oom-kills to reclaim
>>> memory.
>>>
>>> The cond_resched() instance in balance_pgdat() is never called as the
>>> loop that iterates from DEF_PRIORITY down to 0 will always set
>>> all_zones_ok to true, and not set it to false once we've passed
>>> DEF_PRIORITY as zones that are marked ->all_unreclaimable are not
>>> considered in the "all_zones_ok" evaluation.
>>>
>>> This change modifies kswapd_try_to_sleep to ensure that we enter
>>> scheduler at least once per invocation if needed.  This allows kswapd to
>>> get off the CPU and allows other threads to die off from the OOM killer
>>> (freeing memory that is otherwise unavailable in the process).
>> your description suggests zones with all_unreclaimable set. but in this
>> case sleeping_prematurely() will return false instead of true, kswapd
>> will do sleep then. is there anything I missed?

Actually, I don't see where sleeping_prematurely() would return false
if any zone has ->all_unreclaimable set.   In this case, the order was
0, so we return !all_zones_ok, which is false because
!zone_watermark_ok_safe(ZONE_DMA32).

>
> Debugging this, I didn't get a dump from oom-kill as it never ran
> (until I binary patched in a cond_resched() into live hung machines --
> this reproduced in a VM).
>
> I was however able to capture the following data while it was hung:
>
>
> /cloud/vmm/host/backend/perfmetric/node0/zone0/active_anon : long long = 773
> /cloud/vmm/host/backend/perfmetric/node0/zone0/active_file : long long = 6
> /cloud/vmm/host/backend/perfmetric/node0/zone0/anon_pages : long long = 1,329
> /cloud/vmm/host/backend/perfmetric/node0/zone0/bounce : long long = 0
> /cloud/vmm/host/backend/perfmetric/node0/zone0/dirtied : long long = 4,425
> /cloud/vmm/host/backend/perfmetric/node0/zone0/file_dirty : long long = 0
> /cloud/vmm/host/backend/perfmetric/node0/zone0/file_mapped : long long = 5
> /cloud/vmm/host/backend/perfmetric/node0/zone0/file_pages : long long = 330
> /cloud/vmm/host/backend/perfmetric/node0/zone0/free_pages : long long = 2,018
> /cloud/vmm/host/backend/perfmetric/node0/zone0/inactive_anon : long long = 865
> /cloud/vmm/host/backend/perfmetric/node0/zone0/inactive_file : long long = 13
> /cloud/vmm/host/backend/perfmetric/node0/zone0/kernel_stack : long long = 10
> /cloud/vmm/host/backend/perfmetric/node0/zone0/mlock : long long = 0
> /cloud/vmm/host/backend/perfmetric/node0/zone0/pagetable : long long = 74
> /cloud/vmm/host/backend/perfmetric/node0/zone0/shmem : long long = 0
> /cloud/vmm/host/backend/perfmetric/node0/zone0/slab_reclaimable : long long = 54
> /cloud/vmm/host/backend/perfmetric/node0/zone0/slab_unreclaimable :
> long long = 130
> /cloud/vmm/host/backend/perfmetric/node0/zone0/unevictable : long long = 0
> /cloud/vmm/host/backend/perfmetric/node0/zone0/writeback : long long = 0
> /cloud/vmm/host/backend/perfmetric/node0/zone0/written : long long = 47,184
>
> /cloud/vmm/host/backend/perfmetric/node0/zone1/active_anon : long long = 359,251
> /cloud/vmm/host/backend/perfmetric/node0/zone1/active_file : long long = 67
> /cloud/vmm/host/backend/perfmetric/node0/zone1/anon_pages : long long = 441,180
> /cloud/vmm/host/backend/perfmetric/node0/zone1/bounce : long long = 0
> /cloud/vmm/host/backend/perfmetric/node0/zone1/dirtied : long long = 6,457,125
> /cloud/vmm/host/backend/perfmetric/node0/zone1/file_dirty : long long = 0
> /cloud/vmm/host/backend/perfmetric/node0/zone1/file_mapped : long long = 134
> /cloud/vmm/host/backend/perfmetric/node0/zone1/file_pages : long long = 38,090
> /cloud/vmm/host/backend/perfmetric/node0/zone1/free_pages : long long = 1,630
> /cloud/vmm/host/backend/perfmetric/node0/zone1/inactive_anon : long
> long = 119,779
> /cloud/vmm/host/backend/perfmetric/node0/zone1/inactive_file : long long = 81
> /cloud/vmm/host/backend/perfmetric/node0/zone1/kernel_stack : long long = 173
> /cloud/vmm/host/backend/perfmetric/node0/zone1/mlock : long long = 0
> /cloud/vmm/host/backend/perfmetric/node0/zone1/pagetable : long long = 15,222
> /cloud/vmm/host/backend/perfmetric/node0/zone1/shmem : long long = 1
> /cloud/vmm/host/backend/perfmetric/node0/zone1/slab_reclaimable : long
> long = 1,677
> /cloud/vmm/host/backend/perfmetric/node0/zone1/slab_unreclaimable :
> long long = 7,152
> /cloud/vmm/host/backend/perfmetric/node0/zone1/unevictable : long long = 0
> /cloud/vmm/host/backend/perfmetric/node0/zone1/writeback : long long = 8
> /cloud/vmm/host/backend/perfmetric/node0/zone1/written : long long = 16,639,708
>
> These value were static while the machine was hung up in kswapd.  I
> unfortunately don't have the low/min/max or lowmem watermarks handy.
>
> From stepping through with gdb, I was able to determine that
> ZONE_DMA32 would fail zone_watermark_ok_safe(), causing a scan   up to
> end_zone == 1.  If memory serves, it would not get the
> ->all_unreclaimable flag.  I didn't get the chance to root cause this
> internal inconsistency though.
>
> FYI, this was seen with a 2.6.39-based kernel with no-numa, no-memcg
> and swap-enabled.
>
> If I get the chance, I can reproduce and look at this closer to try
> and root cause why zone_reclaimable() would return true, but I won't
> be able to do that until after the holidays -- sometime in January.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: Fix kswapd livelock on single core, no preempt kernel
@ 2011-12-14  4:45       ` Mike Waychison
  0 siblings, 0 replies; 16+ messages in thread
From: Mike Waychison @ 2011-12-14  4:45 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Andrew Morton, Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki,
	Johannes Weiner, linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Hugh Dickens, Greg Thelen

On Tue, Dec 13, 2011 at 8:36 PM, Mike Waychison <mikew@google.com> wrote:
> On Tue, Dec 13, 2011 at 6:24 PM, Shaohua Li <shaohua.li@intel.com> wrote:
>> On Wed, 2011-12-14 at 01:44 +0800, Mike Waychison wrote:
>>> On a single core system with kernel preemption disabled, it is possible
>>> for the memory system to be so taxed that kswapd cannot make any forward
>>> progress.  This can happen when most of system memory is tied up as
>>> anonymous memory without swap enabled, causing kswapd to consistently
>>> fail to achieve its watermark goals.  In turn, sleeping_prematurely()
>>> will consistently return true and kswapd_try_to_sleep() to never invoke
>>> schedule().  This causes the kswapd thread to stay on the CPU in
>>> perpetuity and keeps other threads from processing oom-kills to reclaim
>>> memory.
>>>
>>> The cond_resched() instance in balance_pgdat() is never called as the
>>> loop that iterates from DEF_PRIORITY down to 0 will always set
>>> all_zones_ok to true, and not set it to false once we've passed
>>> DEF_PRIORITY as zones that are marked ->all_unreclaimable are not
>>> considered in the "all_zones_ok" evaluation.
>>>
>>> This change modifies kswapd_try_to_sleep to ensure that we enter
>>> scheduler at least once per invocation if needed.  This allows kswapd to
>>> get off the CPU and allows other threads to die off from the OOM killer
>>> (freeing memory that is otherwise unavailable in the process).
>> your description suggests zones with all_unreclaimable set. but in this
>> case sleeping_prematurely() will return false instead of true, kswapd
>> will do sleep then. is there anything I missed?

Actually, I don't see where sleeping_prematurely() would return false
if any zone has ->all_unreclaimable set.   In this case, the order was
0, so we return !all_zones_ok, which is false because
!zone_watermark_ok_safe(ZONE_DMA32).

>
> Debugging this, I didn't get a dump from oom-kill as it never ran
> (until I binary patched in a cond_resched() into live hung machines --
> this reproduced in a VM).
>
> I was however able to capture the following data while it was hung:
>
>
> /cloud/vmm/host/backend/perfmetric/node0/zone0/active_anon : long long = 773
> /cloud/vmm/host/backend/perfmetric/node0/zone0/active_file : long long = 6
> /cloud/vmm/host/backend/perfmetric/node0/zone0/anon_pages : long long = 1,329
> /cloud/vmm/host/backend/perfmetric/node0/zone0/bounce : long long = 0
> /cloud/vmm/host/backend/perfmetric/node0/zone0/dirtied : long long = 4,425
> /cloud/vmm/host/backend/perfmetric/node0/zone0/file_dirty : long long = 0
> /cloud/vmm/host/backend/perfmetric/node0/zone0/file_mapped : long long = 5
> /cloud/vmm/host/backend/perfmetric/node0/zone0/file_pages : long long = 330
> /cloud/vmm/host/backend/perfmetric/node0/zone0/free_pages : long long = 2,018
> /cloud/vmm/host/backend/perfmetric/node0/zone0/inactive_anon : long long = 865
> /cloud/vmm/host/backend/perfmetric/node0/zone0/inactive_file : long long = 13
> /cloud/vmm/host/backend/perfmetric/node0/zone0/kernel_stack : long long = 10
> /cloud/vmm/host/backend/perfmetric/node0/zone0/mlock : long long = 0
> /cloud/vmm/host/backend/perfmetric/node0/zone0/pagetable : long long = 74
> /cloud/vmm/host/backend/perfmetric/node0/zone0/shmem : long long = 0
> /cloud/vmm/host/backend/perfmetric/node0/zone0/slab_reclaimable : long long = 54
> /cloud/vmm/host/backend/perfmetric/node0/zone0/slab_unreclaimable :
> long long = 130
> /cloud/vmm/host/backend/perfmetric/node0/zone0/unevictable : long long = 0
> /cloud/vmm/host/backend/perfmetric/node0/zone0/writeback : long long = 0
> /cloud/vmm/host/backend/perfmetric/node0/zone0/written : long long = 47,184
>
> /cloud/vmm/host/backend/perfmetric/node0/zone1/active_anon : long long = 359,251
> /cloud/vmm/host/backend/perfmetric/node0/zone1/active_file : long long = 67
> /cloud/vmm/host/backend/perfmetric/node0/zone1/anon_pages : long long = 441,180
> /cloud/vmm/host/backend/perfmetric/node0/zone1/bounce : long long = 0
> /cloud/vmm/host/backend/perfmetric/node0/zone1/dirtied : long long = 6,457,125
> /cloud/vmm/host/backend/perfmetric/node0/zone1/file_dirty : long long = 0
> /cloud/vmm/host/backend/perfmetric/node0/zone1/file_mapped : long long = 134
> /cloud/vmm/host/backend/perfmetric/node0/zone1/file_pages : long long = 38,090
> /cloud/vmm/host/backend/perfmetric/node0/zone1/free_pages : long long = 1,630
> /cloud/vmm/host/backend/perfmetric/node0/zone1/inactive_anon : long
> long = 119,779
> /cloud/vmm/host/backend/perfmetric/node0/zone1/inactive_file : long long = 81
> /cloud/vmm/host/backend/perfmetric/node0/zone1/kernel_stack : long long = 173
> /cloud/vmm/host/backend/perfmetric/node0/zone1/mlock : long long = 0
> /cloud/vmm/host/backend/perfmetric/node0/zone1/pagetable : long long = 15,222
> /cloud/vmm/host/backend/perfmetric/node0/zone1/shmem : long long = 1
> /cloud/vmm/host/backend/perfmetric/node0/zone1/slab_reclaimable : long
> long = 1,677
> /cloud/vmm/host/backend/perfmetric/node0/zone1/slab_unreclaimable :
> long long = 7,152
> /cloud/vmm/host/backend/perfmetric/node0/zone1/unevictable : long long = 0
> /cloud/vmm/host/backend/perfmetric/node0/zone1/writeback : long long = 8
> /cloud/vmm/host/backend/perfmetric/node0/zone1/written : long long = 16,639,708
>
> These value were static while the machine was hung up in kswapd.  I
> unfortunately don't have the low/min/max or lowmem watermarks handy.
>
> From stepping through with gdb, I was able to determine that
> ZONE_DMA32 would fail zone_watermark_ok_safe(), causing a scan   up to
> end_zone == 1.  If memory serves, it would not get the
> ->all_unreclaimable flag.  I didn't get the chance to root cause this
> internal inconsistency though.
>
> FYI, this was seen with a 2.6.39-based kernel with no-numa, no-memcg
> and swap-enabled.
>
> If I get the chance, I can reproduce and look at this closer to try
> and root cause why zone_reclaimable() would return true, but I won't
> be able to do that until after the holidays -- sometime in January.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: Fix kswapd livelock on single core, no preempt kernel
  2011-12-14  4:45       ` Mike Waychison
@ 2011-12-15  1:06         ` Shaohua Li
  -1 siblings, 0 replies; 16+ messages in thread
From: Shaohua Li @ 2011-12-15  1:06 UTC (permalink / raw)
  To: Mike Waychison
  Cc: Andrew Morton, Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki,
	Johannes Weiner, linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Hugh Dickens, Greg Thelen

On Wed, 2011-12-14 at 12:45 +0800, Mike Waychison wrote:
> On Tue, Dec 13, 2011 at 8:36 PM, Mike Waychison <mikew@google.com> wrote:
> > On Tue, Dec 13, 2011 at 6:24 PM, Shaohua Li <shaohua.li@intel.com> wrote:
> >> On Wed, 2011-12-14 at 01:44 +0800, Mike Waychison wrote:
> >>> On a single core system with kernel preemption disabled, it is possible
> >>> for the memory system to be so taxed that kswapd cannot make any forward
> >>> progress.  This can happen when most of system memory is tied up as
> >>> anonymous memory without swap enabled, causing kswapd to consistently
> >>> fail to achieve its watermark goals.  In turn, sleeping_prematurely()
> >>> will consistently return true and kswapd_try_to_sleep() to never invoke
> >>> schedule().  This causes the kswapd thread to stay on the CPU in
> >>> perpetuity and keeps other threads from processing oom-kills to reclaim
> >>> memory.
> >>>
> >>> The cond_resched() instance in balance_pgdat() is never called as the
> >>> loop that iterates from DEF_PRIORITY down to 0 will always set
> >>> all_zones_ok to true, and not set it to false once we've passed
> >>> DEF_PRIORITY as zones that are marked ->all_unreclaimable are not
> >>> considered in the "all_zones_ok" evaluation.
> >>>
> >>> This change modifies kswapd_try_to_sleep to ensure that we enter
> >>> scheduler at least once per invocation if needed.  This allows kswapd to
> >>> get off the CPU and allows other threads to die off from the OOM killer
> >>> (freeing memory that is otherwise unavailable in the process).
> >> your description suggests zones with all_unreclaimable set. but in this
> >> case sleeping_prematurely() will return false instead of true, kswapd
> >> will do sleep then. is there anything I missed?
> 
> Actually, I don't see where sleeping_prematurely() would return false
> if any zone has ->all_unreclaimable set.   In this case, the order was
> 0, so we return !all_zones_ok, which is false because
> !zone_watermark_ok_safe(ZONE_DMA32).
so the ZONE_DMA32 hasn't all_unreclaimable set, right? if all zones have
all_unreclaimable set, all_zones_ok clearly is true. this means kswapd
can reclaim some pages in the zone, which looks sane.

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: Fix kswapd livelock on single core, no preempt kernel
@ 2011-12-15  1:06         ` Shaohua Li
  0 siblings, 0 replies; 16+ messages in thread
From: Shaohua Li @ 2011-12-15  1:06 UTC (permalink / raw)
  To: Mike Waychison
  Cc: Andrew Morton, Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki,
	Johannes Weiner, linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Hugh Dickens, Greg Thelen

On Wed, 2011-12-14 at 12:45 +0800, Mike Waychison wrote:
> On Tue, Dec 13, 2011 at 8:36 PM, Mike Waychison <mikew@google.com> wrote:
> > On Tue, Dec 13, 2011 at 6:24 PM, Shaohua Li <shaohua.li@intel.com> wrote:
> >> On Wed, 2011-12-14 at 01:44 +0800, Mike Waychison wrote:
> >>> On a single core system with kernel preemption disabled, it is possible
> >>> for the memory system to be so taxed that kswapd cannot make any forward
> >>> progress.  This can happen when most of system memory is tied up as
> >>> anonymous memory without swap enabled, causing kswapd to consistently
> >>> fail to achieve its watermark goals.  In turn, sleeping_prematurely()
> >>> will consistently return true and kswapd_try_to_sleep() to never invoke
> >>> schedule().  This causes the kswapd thread to stay on the CPU in
> >>> perpetuity and keeps other threads from processing oom-kills to reclaim
> >>> memory.
> >>>
> >>> The cond_resched() instance in balance_pgdat() is never called as the
> >>> loop that iterates from DEF_PRIORITY down to 0 will always set
> >>> all_zones_ok to true, and not set it to false once we've passed
> >>> DEF_PRIORITY as zones that are marked ->all_unreclaimable are not
> >>> considered in the "all_zones_ok" evaluation.
> >>>
> >>> This change modifies kswapd_try_to_sleep to ensure that we enter
> >>> scheduler at least once per invocation if needed.  This allows kswapd to
> >>> get off the CPU and allows other threads to die off from the OOM killer
> >>> (freeing memory that is otherwise unavailable in the process).
> >> your description suggests zones with all_unreclaimable set. but in this
> >> case sleeping_prematurely() will return false instead of true, kswapd
> >> will do sleep then. is there anything I missed?
> 
> Actually, I don't see where sleeping_prematurely() would return false
> if any zone has ->all_unreclaimable set.   In this case, the order was
> 0, so we return !all_zones_ok, which is false because
> !zone_watermark_ok_safe(ZONE_DMA32).
so the ZONE_DMA32 hasn't all_unreclaimable set, right? if all zones have
all_unreclaimable set, all_zones_ok clearly is true. this means kswapd
can reclaim some pages in the zone, which looks sane.

Thanks,
Shaohua


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: Fix kswapd livelock on single core, no preempt kernel
  2011-12-14  4:36     ` Mike Waychison
@ 2011-12-14 12:20       ` Mel Gorman
  -1 siblings, 0 replies; 16+ messages in thread
From: Mel Gorman @ 2011-12-14 12:20 UTC (permalink / raw)
  To: Mike Waychison
  Cc: Shaohua Li, Andrew Morton, Minchan Kim, KAMEZAWA Hiroyuki,
	Johannes Weiner, linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Hugh Dickens, Greg Thelen

On Tue, Dec 13, 2011 at 08:36:43PM -0800, Mike Waychison wrote:
> FYI, this was seen with a 2.6.39-based kernel with no-numa, no-memcg
> and swap-enabled.
> 

If this is 2.6.39, can you try applying the commit
[f06590bd: mm: vmscan: correctly check if reclaimer should schedule during shrink_slab]

There have been a few fixes around kswapd hogging the CPU since 2.6.39.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: Fix kswapd livelock on single core, no preempt kernel
@ 2011-12-14 12:20       ` Mel Gorman
  0 siblings, 0 replies; 16+ messages in thread
From: Mel Gorman @ 2011-12-14 12:20 UTC (permalink / raw)
  To: Mike Waychison
  Cc: Shaohua Li, Andrew Morton, Minchan Kim, KAMEZAWA Hiroyuki,
	Johannes Weiner, linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Hugh Dickens, Greg Thelen

On Tue, Dec 13, 2011 at 08:36:43PM -0800, Mike Waychison wrote:
> FYI, this was seen with a 2.6.39-based kernel with no-numa, no-memcg
> and swap-enabled.
> 

If this is 2.6.39, can you try applying the commit
[f06590bd: mm: vmscan: correctly check if reclaimer should schedule during shrink_slab]

There have been a few fixes around kswapd hogging the CPU since 2.6.39.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: Fix kswapd livelock on single core, no preempt kernel
  2011-12-14 12:20       ` Mel Gorman
@ 2011-12-14 15:37         ` Mike Waychison
  -1 siblings, 0 replies; 16+ messages in thread
From: Mike Waychison @ 2011-12-14 15:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Shaohua Li, Andrew Morton, Minchan Kim, KAMEZAWA Hiroyuki,
	Johannes Weiner, linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Hugh Dickens, Greg Thelen

On Wed, Dec 14, 2011 at 4:20 AM, Mel Gorman <mgorman@suse.de> wrote:
> On Tue, Dec 13, 2011 at 08:36:43PM -0800, Mike Waychison wrote:
>> FYI, this was seen with a 2.6.39-based kernel with no-numa, no-memcg
>> and swap-enabled.
>>
>
> If this is 2.6.39, can you try applying the commit
> [f06590bd: mm: vmscan: correctly check if reclaimer should schedule during shrink_slab]
>
> There have been a few fixes around kswapd hogging the CPU since 2.6.39.

In this particular case, I didn't see any problem acquiring
shrinker_rwsem (the shrinkers should up in the cpu profile I
gathered).  I think this patch would fix my issue though as it happens
to drop in a cond_resched() into the path.  It isn't obvious that this
cond_resched() really belongs in shrink_slab() though.  Thanks :)

>
> --
> Mel Gorman
> SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: Fix kswapd livelock on single core, no preempt kernel
@ 2011-12-14 15:37         ` Mike Waychison
  0 siblings, 0 replies; 16+ messages in thread
From: Mike Waychison @ 2011-12-14 15:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Shaohua Li, Andrew Morton, Minchan Kim, KAMEZAWA Hiroyuki,
	Johannes Weiner, linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Hugh Dickens, Greg Thelen

On Wed, Dec 14, 2011 at 4:20 AM, Mel Gorman <mgorman@suse.de> wrote:
> On Tue, Dec 13, 2011 at 08:36:43PM -0800, Mike Waychison wrote:
>> FYI, this was seen with a 2.6.39-based kernel with no-numa, no-memcg
>> and swap-enabled.
>>
>
> If this is 2.6.39, can you try applying the commit
> [f06590bd: mm: vmscan: correctly check if reclaimer should schedule during shrink_slab]
>
> There have been a few fixes around kswapd hogging the CPU since 2.6.39.

In this particular case, I didn't see any problem acquiring
shrinker_rwsem (the shrinkers should up in the cpu profile I
gathered).  I think this patch would fix my issue though as it happens
to drop in a cond_resched() into the path.  It isn't obvious that this
cond_resched() really belongs in shrink_slab() though.  Thanks :)

>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: Fix kswapd livelock on single core, no preempt kernel
  2011-12-13 17:44 ` Mike Waychison
@ 2011-12-14 10:51   ` James Bottomley
  -1 siblings, 0 replies; 16+ messages in thread
From: James Bottomley @ 2011-12-14 10:51 UTC (permalink / raw)
  To: Mike Waychison
  Cc: Andrew Morton, Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki,
	Johannes Weiner, linux-mm, linux-kernel, Hugh Dickens,
	Greg Thelen

On Tue, 2011-12-13 at 09:44 -0800, Mike Waychison wrote:
> On a single core system with kernel preemption disabled, it is possible
> for the memory system to be so taxed that kswapd cannot make any forward
> progress.  This can happen when most of system memory is tied up as
> anonymous memory without swap enabled, causing kswapd to consistently
> fail to achieve its watermark goals.  In turn, sleeping_prematurely()
> will consistently return true and kswapd_try_to_sleep() to never invoke
> schedule().  This causes the kswapd thread to stay on the CPU in
> perpetuity and keeps other threads from processing oom-kills to reclaim
> memory.
> 
> The cond_resched() instance in balance_pgdat() is never called as the
> loop that iterates from DEF_PRIORITY down to 0 will always set
> all_zones_ok to true, and not set it to false once we've passed
> DEF_PRIORITY as zones that are marked ->all_unreclaimable are not
> considered in the "all_zones_ok" evaluation.
> 
> This change modifies kswapd_try_to_sleep to ensure that we enter
> scheduler at least once per invocation if needed.  This allows kswapd to
> get off the CPU and allows other threads to die off from the OOM killer
> (freeing memory that is otherwise unavailable in the process).

This keeps cropping up.  I think it's not the same as the last time I
saw it (which was on a multi-core system) but it was definitely caused
by an issue with sleeping_prematurely().  For reference, this is the
thread:

http://marc.info/?t=130436700400001

And this was the eventual fix that worked for me:

http://marc.info/?t=130892304300003

James


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: Fix kswapd livelock on single core, no preempt kernel
@ 2011-12-14 10:51   ` James Bottomley
  0 siblings, 0 replies; 16+ messages in thread
From: James Bottomley @ 2011-12-14 10:51 UTC (permalink / raw)
  To: Mike Waychison
  Cc: Andrew Morton, Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki,
	Johannes Weiner, linux-mm, linux-kernel, Hugh Dickens,
	Greg Thelen

On Tue, 2011-12-13 at 09:44 -0800, Mike Waychison wrote:
> On a single core system with kernel preemption disabled, it is possible
> for the memory system to be so taxed that kswapd cannot make any forward
> progress.  This can happen when most of system memory is tied up as
> anonymous memory without swap enabled, causing kswapd to consistently
> fail to achieve its watermark goals.  In turn, sleeping_prematurely()
> will consistently return true and kswapd_try_to_sleep() to never invoke
> schedule().  This causes the kswapd thread to stay on the CPU in
> perpetuity and keeps other threads from processing oom-kills to reclaim
> memory.
> 
> The cond_resched() instance in balance_pgdat() is never called as the
> loop that iterates from DEF_PRIORITY down to 0 will always set
> all_zones_ok to true, and not set it to false once we've passed
> DEF_PRIORITY as zones that are marked ->all_unreclaimable are not
> considered in the "all_zones_ok" evaluation.
> 
> This change modifies kswapd_try_to_sleep to ensure that we enter
> scheduler at least once per invocation if needed.  This allows kswapd to
> get off the CPU and allows other threads to die off from the OOM killer
> (freeing memory that is otherwise unavailable in the process).

This keeps cropping up.  I think it's not the same as the last time I
saw it (which was on a multi-core system) but it was definitely caused
by an issue with sleeping_prematurely().  For reference, this is the
thread:

http://marc.info/?t=130436700400001

And this was the eventual fix that worked for me:

http://marc.info/?t=130892304300003

James



^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2011-12-15  0:53 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-13 17:44 [PATCH] mm: Fix kswapd livelock on single core, no preempt kernel Mike Waychison
2011-12-13 17:44 ` Mike Waychison
2011-12-14  2:24 ` Shaohua Li
2011-12-14  2:24   ` Shaohua Li
2011-12-14  4:36   ` Mike Waychison
2011-12-14  4:36     ` Mike Waychison
2011-12-14  4:45     ` Mike Waychison
2011-12-14  4:45       ` Mike Waychison
2011-12-15  1:06       ` Shaohua Li
2011-12-15  1:06         ` Shaohua Li
2011-12-14 12:20     ` Mel Gorman
2011-12-14 12:20       ` Mel Gorman
2011-12-14 15:37       ` Mike Waychison
2011-12-14 15:37         ` Mike Waychison
2011-12-14 10:51 ` James Bottomley
2011-12-14 10:51   ` James Bottomley

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.