* Re: [PATCH] mm: Fix kswapd livelock on single core, no preempt kernel
@ 2011-12-14 4:36 ` Mike Waychison
0 siblings, 0 replies; 16+ messages in thread
From: Mike Waychison @ 2011-12-14 4:36 UTC (permalink / raw)
To: Shaohua Li
Cc: Andrew Morton, Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki,
Johannes Weiner, linux-mm@kvack.org, linux-kernel@vger.kernel.org,
Hugh Dickens, Greg Thelen
On Tue, Dec 13, 2011 at 6:24 PM, Shaohua Li <shaohua.li@intel.com> wrote:
> On Wed, 2011-12-14 at 01:44 +0800, Mike Waychison wrote:
>> On a single core system with kernel preemption disabled, it is possible
>> for the memory system to be so taxed that kswapd cannot make any forward
>> progress. This can happen when most of system memory is tied up as
>> anonymous memory without swap enabled, causing kswapd to consistently
>> fail to achieve its watermark goals. In turn, sleeping_prematurely()
>> will consistently return true and kswapd_try_to_sleep() to never invoke
>> schedule(). This causes the kswapd thread to stay on the CPU in
>> perpetuity and keeps other threads from processing oom-kills to reclaim
>> memory.
>>
>> The cond_resched() instance in balance_pgdat() is never called as the
>> loop that iterates from DEF_PRIORITY down to 0 will always set
>> all_zones_ok to true, and not set it to false once we've passed
>> DEF_PRIORITY as zones that are marked ->all_unreclaimable are not
>> considered in the "all_zones_ok" evaluation.
>>
>> This change modifies kswapd_try_to_sleep to ensure that we enter
>> scheduler at least once per invocation if needed. This allows kswapd to
>> get off the CPU and allows other threads to die off from the OOM killer
>> (freeing memory that is otherwise unavailable in the process).
> your description suggests zones with all_unreclaimable set. but in this
> case sleeping_prematurely() will return false instead of true, kswapd
> will do sleep then. is there anything I missed?
Debugging this, I didn't get a dump from oom-kill as it never ran
(until I binary patched in a cond_resched() into live hung machines --
this reproduced in a VM).
I was however able to capture the following data while it was hung:
/cloud/vmm/host/backend/perfmetric/node0/zone0/active_anon : long long = 773
/cloud/vmm/host/backend/perfmetric/node0/zone0/active_file : long long = 6
/cloud/vmm/host/backend/perfmetric/node0/zone0/anon_pages : long long = 1,329
/cloud/vmm/host/backend/perfmetric/node0/zone0/bounce : long long = 0
/cloud/vmm/host/backend/perfmetric/node0/zone0/dirtied : long long = 4,425
/cloud/vmm/host/backend/perfmetric/node0/zone0/file_dirty : long long = 0
/cloud/vmm/host/backend/perfmetric/node0/zone0/file_mapped : long long = 5
/cloud/vmm/host/backend/perfmetric/node0/zone0/file_pages : long long = 330
/cloud/vmm/host/backend/perfmetric/node0/zone0/free_pages : long long = 2,018
/cloud/vmm/host/backend/perfmetric/node0/zone0/inactive_anon : long long = 865
/cloud/vmm/host/backend/perfmetric/node0/zone0/inactive_file : long long = 13
/cloud/vmm/host/backend/perfmetric/node0/zone0/kernel_stack : long long = 10
/cloud/vmm/host/backend/perfmetric/node0/zone0/mlock : long long = 0
/cloud/vmm/host/backend/perfmetric/node0/zone0/pagetable : long long = 74
/cloud/vmm/host/backend/perfmetric/node0/zone0/shmem : long long = 0
/cloud/vmm/host/backend/perfmetric/node0/zone0/slab_reclaimable : long long = 54
/cloud/vmm/host/backend/perfmetric/node0/zone0/slab_unreclaimable :
long long = 130
/cloud/vmm/host/backend/perfmetric/node0/zone0/unevictable : long long = 0
/cloud/vmm/host/backend/perfmetric/node0/zone0/writeback : long long = 0
/cloud/vmm/host/backend/perfmetric/node0/zone0/written : long long = 47,184
/cloud/vmm/host/backend/perfmetric/node0/zone1/active_anon : long long = 359,251
/cloud/vmm/host/backend/perfmetric/node0/zone1/active_file : long long = 67
/cloud/vmm/host/backend/perfmetric/node0/zone1/anon_pages : long long = 441,180
/cloud/vmm/host/backend/perfmetric/node0/zone1/bounce : long long = 0
/cloud/vmm/host/backend/perfmetric/node0/zone1/dirtied : long long = 6,457,125
/cloud/vmm/host/backend/perfmetric/node0/zone1/file_dirty : long long = 0
/cloud/vmm/host/backend/perfmetric/node0/zone1/file_mapped : long long = 134
/cloud/vmm/host/backend/perfmetric/node0/zone1/file_pages : long long = 38,090
/cloud/vmm/host/backend/perfmetric/node0/zone1/free_pages : long long = 1,630
/cloud/vmm/host/backend/perfmetric/node0/zone1/inactive_anon : long
long = 119,779
/cloud/vmm/host/backend/perfmetric/node0/zone1/inactive_file : long long = 81
/cloud/vmm/host/backend/perfmetric/node0/zone1/kernel_stack : long long = 173
/cloud/vmm/host/backend/perfmetric/node0/zone1/mlock : long long = 0
/cloud/vmm/host/backend/perfmetric/node0/zone1/pagetable : long long = 15,222
/cloud/vmm/host/backend/perfmetric/node0/zone1/shmem : long long = 1
/cloud/vmm/host/backend/perfmetric/node0/zone1/slab_reclaimable : long
long = 1,677
/cloud/vmm/host/backend/perfmetric/node0/zone1/slab_unreclaimable :
long long = 7,152
/cloud/vmm/host/backend/perfmetric/node0/zone1/unevictable : long long = 0
/cloud/vmm/host/backend/perfmetric/node0/zone1/writeback : long long = 8
/cloud/vmm/host/backend/perfmetric/node0/zone1/written : long long = 16,639,708
These value were static while the machine was hung up in kswapd. I
unfortunately don't have the low/min/max or lowmem watermarks handy.
>From stepping through with gdb, I was able to determine that
ZONE_DMA32 would fail zone_watermark_ok_safe(), causing a scan up to
end_zone == 1. If memory serves, it would not get the
->all_unreclaimable flag. I didn't get the chance to root cause this
internal inconsistency though.
FYI, this was seen with a 2.6.39-based kernel with no-numa, no-memcg
and swap-enabled.
If I get the chance, I can reproduce and look at this closer to try
and root cause why zone_reclaimable() would return true, but I won't
be able to do that until after the holidays -- sometime in January.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm: Fix kswapd livelock on single core, no preempt kernel
2011-12-14 4:36 ` Mike Waychison
@ 2011-12-14 4:45 ` Mike Waychison
-1 siblings, 0 replies; 16+ messages in thread
From: Mike Waychison @ 2011-12-14 4:45 UTC (permalink / raw)
To: Shaohua Li
Cc: Andrew Morton, Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki,
Johannes Weiner, linux-mm@kvack.org, linux-kernel@vger.kernel.org,
Hugh Dickens, Greg Thelen
On Tue, Dec 13, 2011 at 8:36 PM, Mike Waychison <mikew@google.com> wrote:
> On Tue, Dec 13, 2011 at 6:24 PM, Shaohua Li <shaohua.li@intel.com> wrote:
>> On Wed, 2011-12-14 at 01:44 +0800, Mike Waychison wrote:
>>> On a single core system with kernel preemption disabled, it is possible
>>> for the memory system to be so taxed that kswapd cannot make any forward
>>> progress. This can happen when most of system memory is tied up as
>>> anonymous memory without swap enabled, causing kswapd to consistently
>>> fail to achieve its watermark goals. In turn, sleeping_prematurely()
>>> will consistently return true and kswapd_try_to_sleep() to never invoke
>>> schedule(). This causes the kswapd thread to stay on the CPU in
>>> perpetuity and keeps other threads from processing oom-kills to reclaim
>>> memory.
>>>
>>> The cond_resched() instance in balance_pgdat() is never called as the
>>> loop that iterates from DEF_PRIORITY down to 0 will always set
>>> all_zones_ok to true, and not set it to false once we've passed
>>> DEF_PRIORITY as zones that are marked ->all_unreclaimable are not
>>> considered in the "all_zones_ok" evaluation.
>>>
>>> This change modifies kswapd_try_to_sleep to ensure that we enter
>>> scheduler at least once per invocation if needed. This allows kswapd to
>>> get off the CPU and allows other threads to die off from the OOM killer
>>> (freeing memory that is otherwise unavailable in the process).
>> your description suggests zones with all_unreclaimable set. but in this
>> case sleeping_prematurely() will return false instead of true, kswapd
>> will do sleep then. is there anything I missed?
Actually, I don't see where sleeping_prematurely() would return false
if any zone has ->all_unreclaimable set. In this case, the order was
0, so we return !all_zones_ok, which is false because
!zone_watermark_ok_safe(ZONE_DMA32).
>
> Debugging this, I didn't get a dump from oom-kill as it never ran
> (until I binary patched in a cond_resched() into live hung machines --
> this reproduced in a VM).
>
> I was however able to capture the following data while it was hung:
>
>
> /cloud/vmm/host/backend/perfmetric/node0/zone0/active_anon : long long = 773
> /cloud/vmm/host/backend/perfmetric/node0/zone0/active_file : long long = 6
> /cloud/vmm/host/backend/perfmetric/node0/zone0/anon_pages : long long = 1,329
> /cloud/vmm/host/backend/perfmetric/node0/zone0/bounce : long long = 0
> /cloud/vmm/host/backend/perfmetric/node0/zone0/dirtied : long long = 4,425
> /cloud/vmm/host/backend/perfmetric/node0/zone0/file_dirty : long long = 0
> /cloud/vmm/host/backend/perfmetric/node0/zone0/file_mapped : long long = 5
> /cloud/vmm/host/backend/perfmetric/node0/zone0/file_pages : long long = 330
> /cloud/vmm/host/backend/perfmetric/node0/zone0/free_pages : long long = 2,018
> /cloud/vmm/host/backend/perfmetric/node0/zone0/inactive_anon : long long = 865
> /cloud/vmm/host/backend/perfmetric/node0/zone0/inactive_file : long long = 13
> /cloud/vmm/host/backend/perfmetric/node0/zone0/kernel_stack : long long = 10
> /cloud/vmm/host/backend/perfmetric/node0/zone0/mlock : long long = 0
> /cloud/vmm/host/backend/perfmetric/node0/zone0/pagetable : long long = 74
> /cloud/vmm/host/backend/perfmetric/node0/zone0/shmem : long long = 0
> /cloud/vmm/host/backend/perfmetric/node0/zone0/slab_reclaimable : long long = 54
> /cloud/vmm/host/backend/perfmetric/node0/zone0/slab_unreclaimable :
> long long = 130
> /cloud/vmm/host/backend/perfmetric/node0/zone0/unevictable : long long = 0
> /cloud/vmm/host/backend/perfmetric/node0/zone0/writeback : long long = 0
> /cloud/vmm/host/backend/perfmetric/node0/zone0/written : long long = 47,184
>
> /cloud/vmm/host/backend/perfmetric/node0/zone1/active_anon : long long = 359,251
> /cloud/vmm/host/backend/perfmetric/node0/zone1/active_file : long long = 67
> /cloud/vmm/host/backend/perfmetric/node0/zone1/anon_pages : long long = 441,180
> /cloud/vmm/host/backend/perfmetric/node0/zone1/bounce : long long = 0
> /cloud/vmm/host/backend/perfmetric/node0/zone1/dirtied : long long = 6,457,125
> /cloud/vmm/host/backend/perfmetric/node0/zone1/file_dirty : long long = 0
> /cloud/vmm/host/backend/perfmetric/node0/zone1/file_mapped : long long = 134
> /cloud/vmm/host/backend/perfmetric/node0/zone1/file_pages : long long = 38,090
> /cloud/vmm/host/backend/perfmetric/node0/zone1/free_pages : long long = 1,630
> /cloud/vmm/host/backend/perfmetric/node0/zone1/inactive_anon : long
> long = 119,779
> /cloud/vmm/host/backend/perfmetric/node0/zone1/inactive_file : long long = 81
> /cloud/vmm/host/backend/perfmetric/node0/zone1/kernel_stack : long long = 173
> /cloud/vmm/host/backend/perfmetric/node0/zone1/mlock : long long = 0
> /cloud/vmm/host/backend/perfmetric/node0/zone1/pagetable : long long = 15,222
> /cloud/vmm/host/backend/perfmetric/node0/zone1/shmem : long long = 1
> /cloud/vmm/host/backend/perfmetric/node0/zone1/slab_reclaimable : long
> long = 1,677
> /cloud/vmm/host/backend/perfmetric/node0/zone1/slab_unreclaimable :
> long long = 7,152
> /cloud/vmm/host/backend/perfmetric/node0/zone1/unevictable : long long = 0
> /cloud/vmm/host/backend/perfmetric/node0/zone1/writeback : long long = 8
> /cloud/vmm/host/backend/perfmetric/node0/zone1/written : long long = 16,639,708
>
> These value were static while the machine was hung up in kswapd. I
> unfortunately don't have the low/min/max or lowmem watermarks handy.
>
> From stepping through with gdb, I was able to determine that
> ZONE_DMA32 would fail zone_watermark_ok_safe(), causing a scan up to
> end_zone == 1. If memory serves, it would not get the
> ->all_unreclaimable flag. I didn't get the chance to root cause this
> internal inconsistency though.
>
> FYI, this was seen with a 2.6.39-based kernel with no-numa, no-memcg
> and swap-enabled.
>
> If I get the chance, I can reproduce and look at this closer to try
> and root cause why zone_reclaimable() would return true, but I won't
> be able to do that until after the holidays -- sometime in January.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm: Fix kswapd livelock on single core, no preempt kernel
@ 2011-12-14 4:45 ` Mike Waychison
0 siblings, 0 replies; 16+ messages in thread
From: Mike Waychison @ 2011-12-14 4:45 UTC (permalink / raw)
To: Shaohua Li
Cc: Andrew Morton, Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki,
Johannes Weiner, linux-mm@kvack.org, linux-kernel@vger.kernel.org,
Hugh Dickens, Greg Thelen
On Tue, Dec 13, 2011 at 8:36 PM, Mike Waychison <mikew@google.com> wrote:
> On Tue, Dec 13, 2011 at 6:24 PM, Shaohua Li <shaohua.li@intel.com> wrote:
>> On Wed, 2011-12-14 at 01:44 +0800, Mike Waychison wrote:
>>> On a single core system with kernel preemption disabled, it is possible
>>> for the memory system to be so taxed that kswapd cannot make any forward
>>> progress. This can happen when most of system memory is tied up as
>>> anonymous memory without swap enabled, causing kswapd to consistently
>>> fail to achieve its watermark goals. In turn, sleeping_prematurely()
>>> will consistently return true and kswapd_try_to_sleep() to never invoke
>>> schedule(). This causes the kswapd thread to stay on the CPU in
>>> perpetuity and keeps other threads from processing oom-kills to reclaim
>>> memory.
>>>
>>> The cond_resched() instance in balance_pgdat() is never called as the
>>> loop that iterates from DEF_PRIORITY down to 0 will always set
>>> all_zones_ok to true, and not set it to false once we've passed
>>> DEF_PRIORITY as zones that are marked ->all_unreclaimable are not
>>> considered in the "all_zones_ok" evaluation.
>>>
>>> This change modifies kswapd_try_to_sleep to ensure that we enter
>>> scheduler at least once per invocation if needed. This allows kswapd to
>>> get off the CPU and allows other threads to die off from the OOM killer
>>> (freeing memory that is otherwise unavailable in the process).
>> your description suggests zones with all_unreclaimable set. but in this
>> case sleeping_prematurely() will return false instead of true, kswapd
>> will do sleep then. is there anything I missed?
Actually, I don't see where sleeping_prematurely() would return false
if any zone has ->all_unreclaimable set. In this case, the order was
0, so we return !all_zones_ok, which is false because
!zone_watermark_ok_safe(ZONE_DMA32).
>
> Debugging this, I didn't get a dump from oom-kill as it never ran
> (until I binary patched in a cond_resched() into live hung machines --
> this reproduced in a VM).
>
> I was however able to capture the following data while it was hung:
>
>
> /cloud/vmm/host/backend/perfmetric/node0/zone0/active_anon : long long = 773
> /cloud/vmm/host/backend/perfmetric/node0/zone0/active_file : long long = 6
> /cloud/vmm/host/backend/perfmetric/node0/zone0/anon_pages : long long = 1,329
> /cloud/vmm/host/backend/perfmetric/node0/zone0/bounce : long long = 0
> /cloud/vmm/host/backend/perfmetric/node0/zone0/dirtied : long long = 4,425
> /cloud/vmm/host/backend/perfmetric/node0/zone0/file_dirty : long long = 0
> /cloud/vmm/host/backend/perfmetric/node0/zone0/file_mapped : long long = 5
> /cloud/vmm/host/backend/perfmetric/node0/zone0/file_pages : long long = 330
> /cloud/vmm/host/backend/perfmetric/node0/zone0/free_pages : long long = 2,018
> /cloud/vmm/host/backend/perfmetric/node0/zone0/inactive_anon : long long = 865
> /cloud/vmm/host/backend/perfmetric/node0/zone0/inactive_file : long long = 13
> /cloud/vmm/host/backend/perfmetric/node0/zone0/kernel_stack : long long = 10
> /cloud/vmm/host/backend/perfmetric/node0/zone0/mlock : long long = 0
> /cloud/vmm/host/backend/perfmetric/node0/zone0/pagetable : long long = 74
> /cloud/vmm/host/backend/perfmetric/node0/zone0/shmem : long long = 0
> /cloud/vmm/host/backend/perfmetric/node0/zone0/slab_reclaimable : long long = 54
> /cloud/vmm/host/backend/perfmetric/node0/zone0/slab_unreclaimable :
> long long = 130
> /cloud/vmm/host/backend/perfmetric/node0/zone0/unevictable : long long = 0
> /cloud/vmm/host/backend/perfmetric/node0/zone0/writeback : long long = 0
> /cloud/vmm/host/backend/perfmetric/node0/zone0/written : long long = 47,184
>
> /cloud/vmm/host/backend/perfmetric/node0/zone1/active_anon : long long = 359,251
> /cloud/vmm/host/backend/perfmetric/node0/zone1/active_file : long long = 67
> /cloud/vmm/host/backend/perfmetric/node0/zone1/anon_pages : long long = 441,180
> /cloud/vmm/host/backend/perfmetric/node0/zone1/bounce : long long = 0
> /cloud/vmm/host/backend/perfmetric/node0/zone1/dirtied : long long = 6,457,125
> /cloud/vmm/host/backend/perfmetric/node0/zone1/file_dirty : long long = 0
> /cloud/vmm/host/backend/perfmetric/node0/zone1/file_mapped : long long = 134
> /cloud/vmm/host/backend/perfmetric/node0/zone1/file_pages : long long = 38,090
> /cloud/vmm/host/backend/perfmetric/node0/zone1/free_pages : long long = 1,630
> /cloud/vmm/host/backend/perfmetric/node0/zone1/inactive_anon : long
> long = 119,779
> /cloud/vmm/host/backend/perfmetric/node0/zone1/inactive_file : long long = 81
> /cloud/vmm/host/backend/perfmetric/node0/zone1/kernel_stack : long long = 173
> /cloud/vmm/host/backend/perfmetric/node0/zone1/mlock : long long = 0
> /cloud/vmm/host/backend/perfmetric/node0/zone1/pagetable : long long = 15,222
> /cloud/vmm/host/backend/perfmetric/node0/zone1/shmem : long long = 1
> /cloud/vmm/host/backend/perfmetric/node0/zone1/slab_reclaimable : long
> long = 1,677
> /cloud/vmm/host/backend/perfmetric/node0/zone1/slab_unreclaimable :
> long long = 7,152
> /cloud/vmm/host/backend/perfmetric/node0/zone1/unevictable : long long = 0
> /cloud/vmm/host/backend/perfmetric/node0/zone1/writeback : long long = 8
> /cloud/vmm/host/backend/perfmetric/node0/zone1/written : long long = 16,639,708
>
> These value were static while the machine was hung up in kswapd. I
> unfortunately don't have the low/min/max or lowmem watermarks handy.
>
> From stepping through with gdb, I was able to determine that
> ZONE_DMA32 would fail zone_watermark_ok_safe(), causing a scan up to
> end_zone == 1. If memory serves, it would not get the
> ->all_unreclaimable flag. I didn't get the chance to root cause this
> internal inconsistency though.
>
> FYI, this was seen with a 2.6.39-based kernel with no-numa, no-memcg
> and swap-enabled.
>
> If I get the chance, I can reproduce and look at this closer to try
> and root cause why zone_reclaimable() would return true, but I won't
> be able to do that until after the holidays -- sometime in January.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm: Fix kswapd livelock on single core, no preempt kernel
2011-12-14 4:45 ` Mike Waychison
@ 2011-12-15 1:06 ` Shaohua Li
-1 siblings, 0 replies; 16+ messages in thread
From: Shaohua Li @ 2011-12-15 1:06 UTC (permalink / raw)
To: Mike Waychison
Cc: Andrew Morton, Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki,
Johannes Weiner, linux-mm@kvack.org, linux-kernel@vger.kernel.org,
Hugh Dickens, Greg Thelen
On Wed, 2011-12-14 at 12:45 +0800, Mike Waychison wrote:
> On Tue, Dec 13, 2011 at 8:36 PM, Mike Waychison <mikew@google.com> wrote:
> > On Tue, Dec 13, 2011 at 6:24 PM, Shaohua Li <shaohua.li@intel.com> wrote:
> >> On Wed, 2011-12-14 at 01:44 +0800, Mike Waychison wrote:
> >>> On a single core system with kernel preemption disabled, it is possible
> >>> for the memory system to be so taxed that kswapd cannot make any forward
> >>> progress. This can happen when most of system memory is tied up as
> >>> anonymous memory without swap enabled, causing kswapd to consistently
> >>> fail to achieve its watermark goals. In turn, sleeping_prematurely()
> >>> will consistently return true and kswapd_try_to_sleep() to never invoke
> >>> schedule(). This causes the kswapd thread to stay on the CPU in
> >>> perpetuity and keeps other threads from processing oom-kills to reclaim
> >>> memory.
> >>>
> >>> The cond_resched() instance in balance_pgdat() is never called as the
> >>> loop that iterates from DEF_PRIORITY down to 0 will always set
> >>> all_zones_ok to true, and not set it to false once we've passed
> >>> DEF_PRIORITY as zones that are marked ->all_unreclaimable are not
> >>> considered in the "all_zones_ok" evaluation.
> >>>
> >>> This change modifies kswapd_try_to_sleep to ensure that we enter
> >>> scheduler at least once per invocation if needed. This allows kswapd to
> >>> get off the CPU and allows other threads to die off from the OOM killer
> >>> (freeing memory that is otherwise unavailable in the process).
> >> your description suggests zones with all_unreclaimable set. but in this
> >> case sleeping_prematurely() will return false instead of true, kswapd
> >> will do sleep then. is there anything I missed?
>
> Actually, I don't see where sleeping_prematurely() would return false
> if any zone has ->all_unreclaimable set. In this case, the order was
> 0, so we return !all_zones_ok, which is false because
> !zone_watermark_ok_safe(ZONE_DMA32).
so the ZONE_DMA32 hasn't all_unreclaimable set, right? if all zones have
all_unreclaimable set, all_zones_ok clearly is true. this means kswapd
can reclaim some pages in the zone, which looks sane.
Thanks,
Shaohua
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm: Fix kswapd livelock on single core, no preempt kernel
@ 2011-12-15 1:06 ` Shaohua Li
0 siblings, 0 replies; 16+ messages in thread
From: Shaohua Li @ 2011-12-15 1:06 UTC (permalink / raw)
To: Mike Waychison
Cc: Andrew Morton, Mel Gorman, Minchan Kim, KAMEZAWA Hiroyuki,
Johannes Weiner, linux-mm@kvack.org, linux-kernel@vger.kernel.org,
Hugh Dickens, Greg Thelen
On Wed, 2011-12-14 at 12:45 +0800, Mike Waychison wrote:
> On Tue, Dec 13, 2011 at 8:36 PM, Mike Waychison <mikew@google.com> wrote:
> > On Tue, Dec 13, 2011 at 6:24 PM, Shaohua Li <shaohua.li@intel.com> wrote:
> >> On Wed, 2011-12-14 at 01:44 +0800, Mike Waychison wrote:
> >>> On a single core system with kernel preemption disabled, it is possible
> >>> for the memory system to be so taxed that kswapd cannot make any forward
> >>> progress. This can happen when most of system memory is tied up as
> >>> anonymous memory without swap enabled, causing kswapd to consistently
> >>> fail to achieve its watermark goals. In turn, sleeping_prematurely()
> >>> will consistently return true and kswapd_try_to_sleep() to never invoke
> >>> schedule(). This causes the kswapd thread to stay on the CPU in
> >>> perpetuity and keeps other threads from processing oom-kills to reclaim
> >>> memory.
> >>>
> >>> The cond_resched() instance in balance_pgdat() is never called as the
> >>> loop that iterates from DEF_PRIORITY down to 0 will always set
> >>> all_zones_ok to true, and not set it to false once we've passed
> >>> DEF_PRIORITY as zones that are marked ->all_unreclaimable are not
> >>> considered in the "all_zones_ok" evaluation.
> >>>
> >>> This change modifies kswapd_try_to_sleep to ensure that we enter
> >>> scheduler at least once per invocation if needed. This allows kswapd to
> >>> get off the CPU and allows other threads to die off from the OOM killer
> >>> (freeing memory that is otherwise unavailable in the process).
> >> your description suggests zones with all_unreclaimable set. but in this
> >> case sleeping_prematurely() will return false instead of true, kswapd
> >> will do sleep then. is there anything I missed?
>
> Actually, I don't see where sleeping_prematurely() would return false
> if any zone has ->all_unreclaimable set. In this case, the order was
> 0, so we return !all_zones_ok, which is false because
> !zone_watermark_ok_safe(ZONE_DMA32).
so the ZONE_DMA32 hasn't all_unreclaimable set, right? if all zones have
all_unreclaimable set, all_zones_ok clearly is true. this means kswapd
can reclaim some pages in the zone, which looks sane.
Thanks,
Shaohua
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm: Fix kswapd livelock on single core, no preempt kernel
2011-12-14 4:36 ` Mike Waychison
@ 2011-12-14 12:20 ` Mel Gorman
-1 siblings, 0 replies; 16+ messages in thread
From: Mel Gorman @ 2011-12-14 12:20 UTC (permalink / raw)
To: Mike Waychison
Cc: Shaohua Li, Andrew Morton, Minchan Kim, KAMEZAWA Hiroyuki,
Johannes Weiner, linux-mm@kvack.org, linux-kernel@vger.kernel.org,
Hugh Dickens, Greg Thelen
On Tue, Dec 13, 2011 at 08:36:43PM -0800, Mike Waychison wrote:
> FYI, this was seen with a 2.6.39-based kernel with no-numa, no-memcg
> and swap-enabled.
>
If this is 2.6.39, can you try applying the commit
[f06590bd: mm: vmscan: correctly check if reclaimer should schedule during shrink_slab]
There have been a few fixes around kswapd hogging the CPU since 2.6.39.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm: Fix kswapd livelock on single core, no preempt kernel
@ 2011-12-14 12:20 ` Mel Gorman
0 siblings, 0 replies; 16+ messages in thread
From: Mel Gorman @ 2011-12-14 12:20 UTC (permalink / raw)
To: Mike Waychison
Cc: Shaohua Li, Andrew Morton, Minchan Kim, KAMEZAWA Hiroyuki,
Johannes Weiner, linux-mm@kvack.org, linux-kernel@vger.kernel.org,
Hugh Dickens, Greg Thelen
On Tue, Dec 13, 2011 at 08:36:43PM -0800, Mike Waychison wrote:
> FYI, this was seen with a 2.6.39-based kernel with no-numa, no-memcg
> and swap-enabled.
>
If this is 2.6.39, can you try applying the commit
[f06590bd: mm: vmscan: correctly check if reclaimer should schedule during shrink_slab]
There have been a few fixes around kswapd hogging the CPU since 2.6.39.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm: Fix kswapd livelock on single core, no preempt kernel
2011-12-14 12:20 ` Mel Gorman
@ 2011-12-14 15:37 ` Mike Waychison
-1 siblings, 0 replies; 16+ messages in thread
From: Mike Waychison @ 2011-12-14 15:37 UTC (permalink / raw)
To: Mel Gorman
Cc: Shaohua Li, Andrew Morton, Minchan Kim, KAMEZAWA Hiroyuki,
Johannes Weiner, linux-mm@kvack.org, linux-kernel@vger.kernel.org,
Hugh Dickens, Greg Thelen
On Wed, Dec 14, 2011 at 4:20 AM, Mel Gorman <mgorman@suse.de> wrote:
> On Tue, Dec 13, 2011 at 08:36:43PM -0800, Mike Waychison wrote:
>> FYI, this was seen with a 2.6.39-based kernel with no-numa, no-memcg
>> and swap-enabled.
>>
>
> If this is 2.6.39, can you try applying the commit
> [f06590bd: mm: vmscan: correctly check if reclaimer should schedule during shrink_slab]
>
> There have been a few fixes around kswapd hogging the CPU since 2.6.39.
In this particular case, I didn't see any problem acquiring
shrinker_rwsem (the shrinkers should up in the cpu profile I
gathered). I think this patch would fix my issue though as it happens
to drop in a cond_resched() into the path. It isn't obvious that this
cond_resched() really belongs in shrink_slab() though. Thanks :)
>
> --
> Mel Gorman
> SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm: Fix kswapd livelock on single core, no preempt kernel
@ 2011-12-14 15:37 ` Mike Waychison
0 siblings, 0 replies; 16+ messages in thread
From: Mike Waychison @ 2011-12-14 15:37 UTC (permalink / raw)
To: Mel Gorman
Cc: Shaohua Li, Andrew Morton, Minchan Kim, KAMEZAWA Hiroyuki,
Johannes Weiner, linux-mm@kvack.org, linux-kernel@vger.kernel.org,
Hugh Dickens, Greg Thelen
On Wed, Dec 14, 2011 at 4:20 AM, Mel Gorman <mgorman@suse.de> wrote:
> On Tue, Dec 13, 2011 at 08:36:43PM -0800, Mike Waychison wrote:
>> FYI, this was seen with a 2.6.39-based kernel with no-numa, no-memcg
>> and swap-enabled.
>>
>
> If this is 2.6.39, can you try applying the commit
> [f06590bd: mm: vmscan: correctly check if reclaimer should schedule during shrink_slab]
>
> There have been a few fixes around kswapd hogging the CPU since 2.6.39.
In this particular case, I didn't see any problem acquiring
shrinker_rwsem (the shrinkers should up in the cpu profile I
gathered). I think this patch would fix my issue though as it happens
to drop in a cond_resched() into the path. It isn't obvious that this
cond_resched() really belongs in shrink_slab() though. Thanks :)
>
> --
> Mel Gorman
> SUSE Labs
^ permalink raw reply [flat|nested] 16+ messages in thread