* [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
2011-03-22 11:04 ` [patch 0/5] oom: a few anti fork bomb patches KOSAKI Motohiro
@ 2011-03-22 11:05 ` KOSAKI Motohiro
2011-03-22 14:49 ` Minchan Kim
2011-03-23 7:41 ` KAMEZAWA Hiroyuki
2011-03-22 11:06 ` [PATCH 2/5] Revert "oom: give the dying task a higher priority" KOSAKI Motohiro
` (3 subsequent siblings)
4 siblings, 2 replies; 59+ messages in thread
From: KOSAKI Motohiro @ 2011-03-22 11:05 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin, Hugh Dickins,
KAMEZAWA Hiroyuki, Nick Piggin, Minchan Kim, Johannes Weiner
all_unreclaimable check in direct reclaim has been introduced at 2.6.19
by following commit.
2006 Sep 25; commit 408d8544; oom: use unreclaimable info
And it went through strange history. firstly, following commit broke
the logic unintentionally.
2008 Apr 29; commit a41f24ea; page allocator: smarter retry of
costly-order allocations
Two years later, I've found obvious meaningless code fragment and
restored original intention by following commit.
2010 Jun 04; commit bb21c7ce; vmscan: fix do_try_to_free_pages()
return value when priority==0
But, the logic didn't works when 32bit highmem system goes hibernation
and Minchan slightly changed the algorithm and fixed it .
2010 Sep 22: commit d1908362: vmscan: check all_unreclaimable
in direct reclaim path
But, recently, Andrey Vagin found the new corner case. Look,
struct zone {
..
int all_unreclaimable;
..
unsigned long pages_scanned;
..
}
zone->all_unreclaimable and zone->pages_scanned are neigher atomic
variables nor protected by lock. Therefore a zone can become a state
of zone->page_scanned=0 and zone->all_unreclaimable=1. In this case,
current all_unreclaimable() return false even though
zone->all_unreclaimabe=1.
Is this ignorable minor issue? No. Unfortunatelly, x86 has very
small dma zone and it become zone->all_unreclamble=1 easily. and
if it becase all_unreclaimable, it never return all_unreclaimable=0
beucase it typicall don't have reclaimable pages.
Eventually, oom-killer never works on such systems. Let's remove
this problematic logic completely.
Reported-by: Andrey Vagin <avagin@openvz.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
mm/vmscan.c | 36 +-----------------------------------
1 files changed, 1 insertions(+), 35 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 060e4c1..254aada 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1989,33 +1989,6 @@ static bool zone_reclaimable(struct zone *zone)
}
/*
- * As hibernation is going on, kswapd is freezed so that it can't mark
- * the zone into all_unreclaimable. It can't handle OOM during hibernation.
- * So let's check zone's unreclaimable in direct reclaim as well as kswapd.
- */
-static bool all_unreclaimable(struct zonelist *zonelist,
- struct scan_control *sc)
-{
- struct zoneref *z;
- struct zone *zone;
- bool all_unreclaimable = true;
-
- for_each_zone_zonelist_nodemask(zone, z, zonelist,
- gfp_zone(sc->gfp_mask), sc->nodemask) {
- if (!populated_zone(zone))
- continue;
- if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
- continue;
- if (zone_reclaimable(zone)) {
- all_unreclaimable = false;
- break;
- }
- }
-
- return all_unreclaimable;
-}
-
-/*
* This is the main entry point to direct page reclaim.
*
* If a full scan of the inactive list fails to free enough memory then we
@@ -2105,14 +2078,7 @@ out:
delayacct_freepages_end();
put_mems_allowed();
- if (sc->nr_reclaimed)
- return sc->nr_reclaimed;
-
- /* top priority shrink_zones still had more to do? don't OOM, then */
- if (scanning_global_lru(sc) && !all_unreclaimable(zonelist, sc))
- return 1;
-
- return 0;
+ return sc->nr_reclaimed;
}
unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
--
1.6.5.2
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
2011-03-22 11:05 ` [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely KOSAKI Motohiro
@ 2011-03-22 14:49 ` Minchan Kim
2011-03-23 5:21 ` KOSAKI Motohiro
2011-03-23 7:41 ` KAMEZAWA Hiroyuki
1 sibling, 1 reply; 59+ messages in thread
From: Minchan Kim @ 2011-03-22 14:49 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin, Hugh Dickins,
KAMEZAWA Hiroyuki, Nick Piggin, Johannes Weiner
Hi Kosaki,
On Tue, Mar 22, 2011 at 08:05:55PM +0900, KOSAKI Motohiro wrote:
> all_unreclaimable check in direct reclaim has been introduced at 2.6.19
> by following commit.
>
> 2006 Sep 25; commit 408d8544; oom: use unreclaimable info
>
> And it went through strange history. firstly, following commit broke
> the logic unintentionally.
>
> 2008 Apr 29; commit a41f24ea; page allocator: smarter retry of
> costly-order allocations
>
> Two years later, I've found obvious meaningless code fragment and
> restored original intention by following commit.
>
> 2010 Jun 04; commit bb21c7ce; vmscan: fix do_try_to_free_pages()
> return value when priority==0
>
> But, the logic didn't works when 32bit highmem system goes hibernation
> and Minchan slightly changed the algorithm and fixed it .
>
> 2010 Sep 22: commit d1908362: vmscan: check all_unreclaimable
> in direct reclaim path
>
> But, recently, Andrey Vagin found the new corner case. Look,
>
> struct zone {
> ..
> int all_unreclaimable;
> ..
> unsigned long pages_scanned;
> ..
> }
>
> zone->all_unreclaimable and zone->pages_scanned are neigher atomic
> variables nor protected by lock. Therefore a zone can become a state
> of zone->page_scanned=0 and zone->all_unreclaimable=1. In this case,
Possible although it's very rare.
> current all_unreclaimable() return false even though
> zone->all_unreclaimabe=1.
The case is very rare since we reset zone->all_unreclaimabe to zero
right before resetting zone->page_scanned to zero.
But I admit it's possible.
CPU 0 CPU 1
free_pcppages_bulk balance_pgdat
zone->all_unreclaimabe = 0
zone->all_unreclaimabe = 1
zone->pages_scanned = 0
>
> Is this ignorable minor issue? No. Unfortunatelly, x86 has very
> small dma zone and it become zone->all_unreclamble=1 easily. and
> if it becase all_unreclaimable, it never return all_unreclaimable=0
^^^^^ it's very important verb. ^^^^^ return? reset?
I can't understand your point due to the typo. Please correct the typo.
> beucase it typicall don't have reclaimable pages.
If DMA zone have very small reclaimable pages or zero reclaimable pages,
zone_reclaimable() can return false easily so all_unreclaimable() could return
true. Eventually oom-killer might works.
In my test, I saw the livelock, too so apparently we have a problem.
I couldn't dig in it recently by another urgent my work.
I think you know root cause but the description in this patch isn't enough
for me to be persuaded.
Could you explain the root cause in detail?
>
> Eventually, oom-killer never works on such systems. Let's remove
> this problematic logic completely.
>
> Reported-by: Andrey Vagin <avagin@openvz.org>
> Cc: Nick Piggin <npiggin@kernel.dk>
> Cc: Minchan Kim <minchan.kim@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
> mm/vmscan.c | 36 +-----------------------------------
> 1 files changed, 1 insertions(+), 35 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 060e4c1..254aada 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1989,33 +1989,6 @@ static bool zone_reclaimable(struct zone *zone)
> }
>
> /*
> - * As hibernation is going on, kswapd is freezed so that it can't mark
> - * the zone into all_unreclaimable. It can't handle OOM during hibernation.
> - * So let's check zone's unreclaimable in direct reclaim as well as kswapd.
> - */
> -static bool all_unreclaimable(struct zonelist *zonelist,
> - struct scan_control *sc)
> -{
> - struct zoneref *z;
> - struct zone *zone;
> - bool all_unreclaimable = true;
> -
> - for_each_zone_zonelist_nodemask(zone, z, zonelist,
> - gfp_zone(sc->gfp_mask), sc->nodemask) {
> - if (!populated_zone(zone))
> - continue;
> - if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
> - continue;
> - if (zone_reclaimable(zone)) {
> - all_unreclaimable = false;
> - break;
> - }
> - }
> -
> - return all_unreclaimable;
> -}
> -
> -/*
> * This is the main entry point to direct page reclaim.
> *
> * If a full scan of the inactive list fails to free enough memory then we
> @@ -2105,14 +2078,7 @@ out:
> delayacct_freepages_end();
> put_mems_allowed();
>
> - if (sc->nr_reclaimed)
> - return sc->nr_reclaimed;
> -
> - /* top priority shrink_zones still had more to do? don't OOM, then */
> - if (scanning_global_lru(sc) && !all_unreclaimable(zonelist, sc))
> - return 1;
> -
> - return 0;
> + return sc->nr_reclaimed;
> }
>
> unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> --
> 1.6.5.2
>
>
>
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
2011-03-22 14:49 ` Minchan Kim
@ 2011-03-23 5:21 ` KOSAKI Motohiro
2011-03-23 6:59 ` Minchan Kim
0 siblings, 1 reply; 59+ messages in thread
From: KOSAKI Motohiro @ 2011-03-23 5:21 UTC (permalink / raw)
To: Minchan Kim
Cc: kosaki.motohiro, linux-kernel, Andrew Morton, David Rientjes,
Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki, Nick Piggin,
Johannes Weiner
Hi Minchan,
> > zone->all_unreclaimable and zone->pages_scanned are neigher atomic
> > variables nor protected by lock. Therefore a zone can become a state
> > of zone->page_scanned=0 and zone->all_unreclaimable=1. In this case,
>
> Possible although it's very rare.
Can you test by yourself andrey's case on x86 box? It seems
reprodusable.
> > current all_unreclaimable() return false even though
> > zone->all_unreclaimabe=1.
>
> The case is very rare since we reset zone->all_unreclaimabe to zero
> right before resetting zone->page_scanned to zero.
> But I admit it's possible.
Please apply this patch and run oom-killer. You may see following
pages_scanned:0 and all_unreclaimable:yes combination. likes below.
(but you may need >30min)
Node 0 DMA free:4024kB min:40kB low:48kB high:60kB active_anon:11804kB
inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB present:15676kB mlocked:0kB
dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
slab_unreclaimable:0kB kernel_stack:0kB pagetables:68kB unstable:0kB
bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
>
> CPU 0 CPU 1
> free_pcppages_bulk balance_pgdat
> zone->all_unreclaimabe = 0
> zone->all_unreclaimabe = 1
> zone->pages_scanned = 0
> >
> > Is this ignorable minor issue? No. Unfortunatelly, x86 has very
> > small dma zone and it become zone->all_unreclamble=1 easily. and
> > if it becase all_unreclaimable, it never return all_unreclaimable=0
> ^^^^^ it's very important verb. ^^^^^ return? reset?
>
> I can't understand your point due to the typo. Please correct the typo.
>
> > beucase it typicall don't have reclaimable pages.
>
> If DMA zone have very small reclaimable pages or zero reclaimable pages,
> zone_reclaimable() can return false easily so all_unreclaimable() could return
> true. Eventually oom-killer might works.
The point is, vmscan has following all_unreclaimable check in several place.
if (zone->all_unreclaimable && priority != DEF_PRIORITY)
continue;
But, if the zone has only a few lru pages, get_scan_count(DEF_PRIORITY) return
{0, 0, 0, 0} array. It mean zone will never scan lru pages anymore. therefore
false negative smaller pages_scanned can't be corrected.
Then, false negative all_unreclaimable() also can't be corrected.
btw, Why get_scan_count() return 0 instead 1? Why don't we round up?
Git log says it is intentionally.
commit e0f79b8f1f3394bb344b7b83d6f121ac2af327de
Author: Johannes Weiner <hannes@saeurebad.de>
Date: Sat Oct 18 20:26:55 2008 -0700
vmscan: don't accumulate scan pressure on unrelated lists
>
> In my test, I saw the livelock, too so apparently we have a problem.
> I couldn't dig in it recently by another urgent my work.
> I think you know root cause but the description in this patch isn't enough
> for me to be persuaded.
>
> Could you explain the root cause in detail?
If you have an another fixing idea, please let me know. :)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
2011-03-23 5:21 ` KOSAKI Motohiro
@ 2011-03-23 6:59 ` Minchan Kim
2011-03-23 7:13 ` KOSAKI Motohiro
0 siblings, 1 reply; 59+ messages in thread
From: Minchan Kim @ 2011-03-23 6:59 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin, Hugh Dickins,
KAMEZAWA Hiroyuki, Nick Piggin, Johannes Weiner
On Wed, Mar 23, 2011 at 2:21 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
> Hi Minchan,
>
>> > zone->all_unreclaimable and zone->pages_scanned are neigher atomic
>> > variables nor protected by lock. Therefore a zone can become a state
>> > of zone->page_scanned=0 and zone->all_unreclaimable=1. In this case,
>>
>> Possible although it's very rare.
>
> Can you test by yourself andrey's case on x86 box? It seems
> reprodusable.
>
>> > current all_unreclaimable() return false even though
>> > zone->all_unreclaimabe=1.
>>
>> The case is very rare since we reset zone->all_unreclaimabe to zero
>> right before resetting zone->page_scanned to zero.
>> But I admit it's possible.
>
> Please apply this patch and run oom-killer. You may see following
> pages_scanned:0 and all_unreclaimable:yes combination. likes below.
> (but you may need >30min)
>
> Node 0 DMA free:4024kB min:40kB low:48kB high:60kB active_anon:11804kB
> inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB
> isolated(anon):0kB isolated(file):0kB present:15676kB mlocked:0kB
> dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
> slab_unreclaimable:0kB kernel_stack:0kB pagetables:68kB unstable:0kB
> bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
>
>
>>
>> CPU 0 CPU 1
>> free_pcppages_bulk balance_pgdat
>> zone->all_unreclaimabe = 0
>> zone->all_unreclaimabe = 1
>> zone->pages_scanned = 0
>> >
>> > Is this ignorable minor issue? No. Unfortunatelly, x86 has very
>> > small dma zone and it become zone->all_unreclamble=1 easily. and
>> > if it becase all_unreclaimable, it never return all_unreclaimable=0
>> ^^^^^ it's very important verb. ^^^^^ return? reset?
>>
>> I can't understand your point due to the typo. Please correct the typo.
>>
>> > beucase it typicall don't have reclaimable pages.
>>
>> If DMA zone have very small reclaimable pages or zero reclaimable pages,
>> zone_reclaimable() can return false easily so all_unreclaimable() could return
>> true. Eventually oom-killer might works.
>
> The point is, vmscan has following all_unreclaimable check in several place.
>
> if (zone->all_unreclaimable && priority != DEF_PRIORITY)
> continue;
>
> But, if the zone has only a few lru pages, get_scan_count(DEF_PRIORITY) return
> {0, 0, 0, 0} array. It mean zone will never scan lru pages anymore. therefore
> false negative smaller pages_scanned can't be corrected.
>
> Then, false negative all_unreclaimable() also can't be corrected.
>
>
> btw, Why get_scan_count() return 0 instead 1? Why don't we round up?
> Git log says it is intentionally.
>
> commit e0f79b8f1f3394bb344b7b83d6f121ac2af327de
> Author: Johannes Weiner <hannes@saeurebad.de>
> Date: Sat Oct 18 20:26:55 2008 -0700
>
> vmscan: don't accumulate scan pressure on unrelated lists
>
>>
>> In my test, I saw the livelock, too so apparently we have a problem.
>> I couldn't dig in it recently by another urgent my work.
>> I think you know root cause but the description in this patch isn't enough
>> for me to be persuaded.
>>
>> Could you explain the root cause in detail?
>
> If you have an another fixing idea, please let me know. :)
>
>
>
>
Okay. I got it.
The problem is following as.
By the race the free_pcppages_bulk and balance_pgdat, it is possible
zone->all_unreclaimable = 1 and zone->pages_scanned = 0.
DMA zone have few LRU pages and in case of no-swap and big memory
pressure, there could be a just a page in inactive file list like your
example. (anon lru pages isn't important in case of non-swap system)
In such case, shrink_zones doesn't scan the page at all until priority
become 0 as get_scan_count does scan >>= priority(it's mostly zero).
And although priority become 0, nr_scan_try_batch returns zero until
saved pages become 32. So for scanning the page, at least, we need 32
times iteration of priority 12..0. If system has fork-bomb, it is
almost livelock.
If is is right, how about this?
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 148c6e6..34983e1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1973,6 +1973,9 @@ static void shrink_zones(int priority, struct
zonelist *zonelist,
static bool zone_reclaimable(struct zone *zone)
{
+ if (zone->all_unreclaimable)
+ return false;
+
return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
}
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
2011-03-23 6:59 ` Minchan Kim
@ 2011-03-23 7:13 ` KOSAKI Motohiro
2011-03-23 8:24 ` Minchan Kim
0 siblings, 1 reply; 59+ messages in thread
From: KOSAKI Motohiro @ 2011-03-23 7:13 UTC (permalink / raw)
To: Minchan Kim
Cc: kosaki.motohiro, linux-kernel, Andrew Morton, David Rientjes,
Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki, Nick Piggin,
Johannes Weiner
> Okay. I got it.
>
> The problem is following as.
> By the race the free_pcppages_bulk and balance_pgdat, it is possible
> zone->all_unreclaimable = 1 and zone->pages_scanned = 0.
> DMA zone have few LRU pages and in case of no-swap and big memory
> pressure, there could be a just a page in inactive file list like your
> example. (anon lru pages isn't important in case of non-swap system)
> In such case, shrink_zones doesn't scan the page at all until priority
> become 0 as get_scan_count does scan >>= priority(it's mostly zero).
Nope.
if (zone->all_unreclaimable && priority != DEF_PRIORITY)
continue;
This tow lines mean, all_unreclaimable prevent priority 0 reclaim.
> And although priority become 0, nr_scan_try_batch returns zero until
> saved pages become 32. So for scanning the page, at least, we need 32
> times iteration of priority 12..0. If system has fork-bomb, it is
> almost livelock.
Therefore, 1000 times get_scan_count(DEF_PRIORITY) takes 1000 times no-op.
>
> If is is right, how about this?
Boo.
You seems forgot why you introduced current all_unreclaimable() function.
While hibernation, we can't trust all_unreclaimable.
That's the reason why I proposed following patch when you introduced
all_unreclaimable().
---
mm/vmscan.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c391c32..1919d8a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -40,6 +40,7 @@
#include <linux/memcontrol.h>
#include <linux/delayacct.h>
#include <linux/sysctl.h>
+#include <linux/oom.h>
#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -1931,7 +1932,7 @@ out:
return sc->nr_reclaimed;
/* top priority shrink_zones still had more to do? don't OOM, then */
- if (scanning_global_lru(sc) && !all_unreclaimable)
+ if (scanning_global_lru(sc) && !all_unreclaimable && !oom_killer_disabled)
return 1;
return 0;
--
1.6.5.2
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
2011-03-23 7:13 ` KOSAKI Motohiro
@ 2011-03-23 8:24 ` Minchan Kim
2011-03-23 8:44 ` KOSAKI Motohiro
0 siblings, 1 reply; 59+ messages in thread
From: Minchan Kim @ 2011-03-23 8:24 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin, Hugh Dickins,
KAMEZAWA Hiroyuki, Nick Piggin, Johannes Weiner
On Wed, Mar 23, 2011 at 04:13:21PM +0900, KOSAKI Motohiro wrote:
> > Okay. I got it.
> >
> > The problem is following as.
> > By the race the free_pcppages_bulk and balance_pgdat, it is possible
> > zone->all_unreclaimable = 1 and zone->pages_scanned = 0.
> > DMA zone have few LRU pages and in case of no-swap and big memory
> > pressure, there could be a just a page in inactive file list like your
> > example. (anon lru pages isn't important in case of non-swap system)
> > In such case, shrink_zones doesn't scan the page at all until priority
> > become 0 as get_scan_count does scan >>= priority(it's mostly zero).
>
> Nope.
>
> if (zone->all_unreclaimable && priority != DEF_PRIORITY)
> continue;
>
> This tow lines mean, all_unreclaimable prevent priority 0 reclaim.
>
Yes. I missed it. Thanks.
>
> > And although priority become 0, nr_scan_try_batch returns zero until
> > saved pages become 32. So for scanning the page, at least, we need 32
> > times iteration of priority 12..0. If system has fork-bomb, it is
> > almost livelock.
>
> Therefore, 1000 times get_scan_count(DEF_PRIORITY) takes 1000 times no-op.
>
> >
> > If is is right, how about this?
>
> Boo.
> You seems forgot why you introduced current all_unreclaimable() function.
> While hibernation, we can't trust all_unreclaimable.
Hmm. AFAIR, the why we add all_unreclaimable is when the hibernation is going on,
kswapd is freezed so it can't mark the zone->all_unreclaimable.
So I think hibernation can't be a problem.
Am I miss something?
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
2011-03-23 8:24 ` Minchan Kim
@ 2011-03-23 8:44 ` KOSAKI Motohiro
2011-03-23 9:02 ` Minchan Kim
0 siblings, 1 reply; 59+ messages in thread
From: KOSAKI Motohiro @ 2011-03-23 8:44 UTC (permalink / raw)
To: Minchan Kim
Cc: kosaki.motohiro, linux-kernel, Andrew Morton, David Rientjes,
Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki, Nick Piggin,
Johannes Weiner
> > Boo.
> > You seems forgot why you introduced current all_unreclaimable() function.
> > While hibernation, we can't trust all_unreclaimable.
>
> Hmm. AFAIR, the why we add all_unreclaimable is when the hibernation is going on,
> kswapd is freezed so it can't mark the zone->all_unreclaimable.
> So I think hibernation can't be a problem.
> Am I miss something?
Ahh, I missed. thans correct me. Okay, I recognized both mine and your works.
Can you please explain why do you like your one than mine?
btw, Your one is very similar andrey's initial patch. If your one is
better, I'd like to ack with andrey instead.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
2011-03-23 8:44 ` KOSAKI Motohiro
@ 2011-03-23 9:02 ` Minchan Kim
2011-03-24 2:11 ` KOSAKI Motohiro
0 siblings, 1 reply; 59+ messages in thread
From: Minchan Kim @ 2011-03-23 9:02 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin, Hugh Dickins,
KAMEZAWA Hiroyuki, Nick Piggin, Johannes Weiner
On Wed, Mar 23, 2011 at 5:44 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
>> > Boo.
>> > You seems forgot why you introduced current all_unreclaimable() function.
>> > While hibernation, we can't trust all_unreclaimable.
>>
>> Hmm. AFAIR, the why we add all_unreclaimable is when the hibernation is going on,
>> kswapd is freezed so it can't mark the zone->all_unreclaimable.
>> So I think hibernation can't be a problem.
>> Am I miss something?
>
> Ahh, I missed. thans correct me. Okay, I recognized both mine and your works.
> Can you please explain why do you like your one than mine?
Just _simple_ :)
I don't want to change many lines although we can do it simple and very clear.
>
> btw, Your one is very similar andrey's initial patch. If your one is
> better, I'd like to ack with andrey instead.
When Andrey sent a patch, I though this as zone_reclaimable() is right
place to check it than out of zone_reclaimable. Why I didn't ack is
that Andrey can't explain root cause but you did so you persuade me.
I don't mind if Andrey move the check in zone_reclaimable and resend
or I resend with concrete description.
Anyway, most important thing is good description to show the root cause.
It is applied to your patch, too.
You should have written down root cause in description.
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
2011-03-23 9:02 ` Minchan Kim
@ 2011-03-24 2:11 ` KOSAKI Motohiro
2011-03-24 2:21 ` Andrew Morton
2011-03-24 4:19 ` Minchan Kim
0 siblings, 2 replies; 59+ messages in thread
From: KOSAKI Motohiro @ 2011-03-24 2:11 UTC (permalink / raw)
To: Minchan Kim
Cc: kosaki.motohiro, linux-kernel, Andrew Morton, David Rientjes,
Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki, Nick Piggin,
Johannes Weiner
> On Wed, Mar 23, 2011 at 5:44 PM, KOSAKI Motohiro
> <kosaki.motohiro@jp.fujitsu.com> wrote:
> >> > Boo.
> >> > You seems forgot why you introduced current all_unreclaimable() function.
> >> > While hibernation, we can't trust all_unreclaimable.
> >>
> >> Hmm. AFAIR, the why we add all_unreclaimable is when the hibernation is going on,
> >> kswapd is freezed so it can't mark the zone->all_unreclaimable.
> >> So I think hibernation can't be a problem.
> >> Am I miss something?
> >
> > Ahh, I missed. thans correct me. Okay, I recognized both mine and your works.
> > Can you please explain why do you like your one than mine?
>
> Just _simple_ :)
> I don't want to change many lines although we can do it simple and very clear.
>
> >
> > btw, Your one is very similar andrey's initial patch. If your one is
> > better, I'd like to ack with andrey instead.
>
> When Andrey sent a patch, I though this as zone_reclaimable() is right
> place to check it than out of zone_reclaimable. Why I didn't ack is
> that Andrey can't explain root cause but you did so you persuade me.
>
> I don't mind if Andrey move the check in zone_reclaimable and resend
> or I resend with concrete description.
>
> Anyway, most important thing is good description to show the root cause.
> It is applied to your patch, too.
> You should have written down root cause in description.
honestly, I really dislike to use mixing zone->pages_scanned and
zone->all_unreclaimable. because I think it's no simple. I don't
think it's good taste nor easy to review. Even though you who VM
expert didn't understand this issue at once, it's smell of too
mess code.
therefore, I prefore to take either 1) just remove the function or
2) just only check zone->all_unreclaimable and oom_killer_disabled
instead zone->pages_scanned.
And, I agree I need to rewrite the description.
How's this?
==================================================
From 216bcf3fb0476b453080debf8999c74c635ed72f Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Date: Sun, 8 May 2011 17:39:44 +0900
Subject: [PATCH] vmscan: remove all_unreclaimable check from direct reclaim path completely
all_unreclaimable check in direct reclaim has been introduced at 2.6.19
by following commit.
2006 Sep 25; commit 408d8544; oom: use unreclaimable info
And it went through strange history. firstly, following commit broke
the logic unintentionally.
2008 Apr 29; commit a41f24ea; page allocator: smarter retry of
costly-order allocations
Two years later, I've found obvious meaningless code fragment and
restored original intention by following commit.
2010 Jun 04; commit bb21c7ce; vmscan: fix do_try_to_free_pages()
return value when priority==0
But, the logic didn't works when 32bit highmem system goes hibernation
and Minchan slightly changed the algorithm and fixed it .
2010 Sep 22: commit d1908362: vmscan: check all_unreclaimable
in direct reclaim path
But, recently, Andrey Vagin found the new corner case. Look,
struct zone {
..
int all_unreclaimable;
..
unsigned long pages_scanned;
..
}
zone->all_unreclaimable and zone->pages_scanned are neigher atomic
variables nor protected by lock. Therefore zones can become a state
of zone->page_scanned=0 and zone->all_unreclaimable=1. In this case,
current all_unreclaimable() return false even though
zone->all_unreclaimabe=1.
Is this ignorable minor issue? No. Unfortunatelly, x86 has very
small dma zone and it become zone->all_unreclamble=1 easily. and
if it become all_unreclaimable=1, it never restore all_unreclaimable=0.
Why? if all_unreclaimable=1, vmscan only try DEF_PRIORITY reclaim and
a-few-lru-pages>>DEF_PRIORITY always makes 0. that mean no page scan
at all!
Eventually, oom-killer never works on such systems. Let's remove
this problematic logic completely.
Reported-by: Andrey Vagin <avagin@openvz.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
mm/vmscan.c | 36 +-----------------------------------
1 files changed, 1 insertions(+), 35 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 060e4c1..254aada 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1989,33 +1989,6 @@ static bool zone_reclaimable(struct zone *zone)
}
/*
- * As hibernation is going on, kswapd is freezed so that it can't mark
- * the zone into all_unreclaimable. It can't handle OOM during hibernation.
- * So let's check zone's unreclaimable in direct reclaim as well as kswapd.
- */
-static bool all_unreclaimable(struct zonelist *zonelist,
- struct scan_control *sc)
-{
- struct zoneref *z;
- struct zone *zone;
- bool all_unreclaimable = true;
-
- for_each_zone_zonelist_nodemask(zone, z, zonelist,
- gfp_zone(sc->gfp_mask), sc->nodemask) {
- if (!populated_zone(zone))
- continue;
- if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
- continue;
- if (zone_reclaimable(zone)) {
- all_unreclaimable = false;
- break;
- }
- }
-
- return all_unreclaimable;
-}
-
-/*
* This is the main entry point to direct page reclaim.
*
* If a full scan of the inactive list fails to free enough memory then we
@@ -2105,14 +2078,7 @@ out:
delayacct_freepages_end();
put_mems_allowed();
- if (sc->nr_reclaimed)
- return sc->nr_reclaimed;
-
- /* top priority shrink_zones still had more to do? don't OOM, then */
- if (scanning_global_lru(sc) && !all_unreclaimable(zonelist, sc))
- return 1;
-
- return 0;
+ return sc->nr_reclaimed;
}
unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
--
1.6.5.2
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
2011-03-24 2:11 ` KOSAKI Motohiro
@ 2011-03-24 2:21 ` Andrew Morton
2011-03-24 2:48 ` KOSAKI Motohiro
2011-03-24 4:19 ` Minchan Kim
1 sibling, 1 reply; 59+ messages in thread
From: Andrew Morton @ 2011-03-24 2:21 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Minchan Kim, linux-kernel, David Rientjes, Linus Torvalds,
Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin, Hugh Dickins,
KAMEZAWA Hiroyuki, Nick Piggin, Johannes Weiner
On Thu, 24 Mar 2011 11:11:46 +0900 (JST) KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> Subject: [PATCH] vmscan: remove all_unreclaimable check from direct reclaim path completely
zone.all_unreclaimable is there to prevent reclaim from wasting CPU
cycles scanning a zone which has no reclaimable pages. When originally
implemented it did this very well.
That you guys keep breaking it, or don't feel like improving it is not a
reason to remove it!
If the code is unneeded and the kernel now reliably solves this problem
by other means then this should have been fully explained in the
changelog, but it was not even mentioned.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
2011-03-24 2:21 ` Andrew Morton
@ 2011-03-24 2:48 ` KOSAKI Motohiro
2011-03-24 3:04 ` Andrew Morton
0 siblings, 1 reply; 59+ messages in thread
From: KOSAKI Motohiro @ 2011-03-24 2:48 UTC (permalink / raw)
To: Andrew Morton
Cc: kosaki.motohiro, Minchan Kim, linux-kernel, David Rientjes,
Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki, Nick Piggin,
Johannes Weiner
> On Thu, 24 Mar 2011 11:11:46 +0900 (JST) KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
>
> > Subject: [PATCH] vmscan: remove all_unreclaimable check from direct reclaim path completely
>
> zone.all_unreclaimable is there to prevent reclaim from wasting CPU
> cycles scanning a zone which has no reclaimable pages. When originally
> implemented it did this very well.
>
> That you guys keep breaking it, or don't feel like improving it is not a
> reason to remove it!
>
> If the code is unneeded and the kernel now reliably solves this problem
> by other means then this should have been fully explained in the
> changelog, but it was not even mentioned.
The changelog says, the logic was removed at 2008. three years ago.
even though it's unintentionally. and I and minchan tried to resurrect
the broken logic and resurrected a bug in the logic too. then, we
are discussed it should die or alive.
Which part is hard to understand for you?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
2011-03-24 2:48 ` KOSAKI Motohiro
@ 2011-03-24 3:04 ` Andrew Morton
2011-03-24 5:35 ` KOSAKI Motohiro
0 siblings, 1 reply; 59+ messages in thread
From: Andrew Morton @ 2011-03-24 3:04 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Minchan Kim, linux-kernel, David Rientjes, Linus Torvalds,
Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin, Hugh Dickins,
KAMEZAWA Hiroyuki, Nick Piggin, Johannes Weiner
On Thu, 24 Mar 2011 11:48:19 +0900 (JST) KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > On Thu, 24 Mar 2011 11:11:46 +0900 (JST) KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> >
> > > Subject: [PATCH] vmscan: remove all_unreclaimable check from direct reclaim path completely
> >
> > zone.all_unreclaimable is there to prevent reclaim from wasting CPU
> > cycles scanning a zone which has no reclaimable pages. When originally
> > implemented it did this very well.
> >
> > That you guys keep breaking it, or don't feel like improving it is not a
> > reason to remove it!
> >
> > If the code is unneeded and the kernel now reliably solves this problem
> > by other means then this should have been fully explained in the
> > changelog, but it was not even mentioned.
>
> The changelog says, the logic was removed at 2008. three years ago.
> even though it's unintentionally. and I and minchan tried to resurrect
> the broken logic and resurrected a bug in the logic too. then, we
> are discussed it should die or alive.
>
> Which part is hard to understand for you?
>
The part which isn't there: how does the kernel now address the problem
which that code fixed?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
2011-03-24 3:04 ` Andrew Morton
@ 2011-03-24 5:35 ` KOSAKI Motohiro
0 siblings, 0 replies; 59+ messages in thread
From: KOSAKI Motohiro @ 2011-03-24 5:35 UTC (permalink / raw)
To: Andrew Morton
Cc: kosaki.motohiro, Minchan Kim, linux-kernel, David Rientjes,
Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki, Nick Piggin,
Johannes Weiner
> On Thu, 24 Mar 2011 11:48:19 +0900 (JST) KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
>
> > > On Thu, 24 Mar 2011 11:11:46 +0900 (JST) KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > >
> > > > Subject: [PATCH] vmscan: remove all_unreclaimable check from direct reclaim path completely
> > >
> > > zone.all_unreclaimable is there to prevent reclaim from wasting CPU
> > > cycles scanning a zone which has no reclaimable pages. When originally
> > > implemented it did this very well.
> > >
> > > That you guys keep breaking it, or don't feel like improving it is not a
> > > reason to remove it!
> > >
> > > If the code is unneeded and the kernel now reliably solves this problem
> > > by other means then this should have been fully explained in the
> > > changelog, but it was not even mentioned.
> >
> > The changelog says, the logic was removed at 2008. three years ago.
> > even though it's unintentionally. and I and minchan tried to resurrect
> > the broken logic and resurrected a bug in the logic too. then, we
> > are discussed it should die or alive.
> >
> > Which part is hard to understand for you?
>
> The part which isn't there: how does the kernel now address the problem
> which that code fixed?
Ah, got it.
The history says the problem haven't occur for three years. thus I
meant
past: code exist, but broken and don't work for three years.
new: code removed.
What's different? But last minchan's mail pointed out recent
drain_all_pages() stuff depend on a return value of try_to_free_pages.
thus, I've made new patch and sent it. please see it?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
2011-03-24 2:11 ` KOSAKI Motohiro
2011-03-24 2:21 ` Andrew Morton
@ 2011-03-24 4:19 ` Minchan Kim
2011-03-24 5:35 ` KOSAKI Motohiro
1 sibling, 1 reply; 59+ messages in thread
From: Minchan Kim @ 2011-03-24 4:19 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin, Hugh Dickins,
KAMEZAWA Hiroyuki, Nick Piggin, Johannes Weiner
On Thu, Mar 24, 2011 at 11:11 AM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
>> On Wed, Mar 23, 2011 at 5:44 PM, KOSAKI Motohiro
>> <kosaki.motohiro@jp.fujitsu.com> wrote:
>> >> > Boo.
>> >> > You seems forgot why you introduced current all_unreclaimable() function.
>> >> > While hibernation, we can't trust all_unreclaimable.
>> >>
>> >> Hmm. AFAIR, the why we add all_unreclaimable is when the hibernation is going on,
>> >> kswapd is freezed so it can't mark the zone->all_unreclaimable.
>> >> So I think hibernation can't be a problem.
>> >> Am I miss something?
>> >
>> > Ahh, I missed. thans correct me. Okay, I recognized both mine and your works.
>> > Can you please explain why do you like your one than mine?
>>
>> Just _simple_ :)
>> I don't want to change many lines although we can do it simple and very clear.
>>
>> >
>> > btw, Your one is very similar andrey's initial patch. If your one is
>> > better, I'd like to ack with andrey instead.
>>
>> When Andrey sent a patch, I though this as zone_reclaimable() is right
>> place to check it than out of zone_reclaimable. Why I didn't ack is
>> that Andrey can't explain root cause but you did so you persuade me.
>>
>> I don't mind if Andrey move the check in zone_reclaimable and resend
>> or I resend with concrete description.
>>
>> Anyway, most important thing is good description to show the root cause.
>> It is applied to your patch, too.
>> You should have written down root cause in description.
>
> honestly, I really dislike to use mixing zone->pages_scanned and
> zone->all_unreclaimable. because I think it's no simple. I don't
> think it's good taste nor easy to review. Even though you who VM
> expert didn't understand this issue at once, it's smell of too
> mess code.
>
> therefore, I prefore to take either 1) just remove the function or
> 2) just only check zone->all_unreclaimable and oom_killer_disabled
> instead zone->pages_scanned.
Nick's original goal is to prevent OOM killing until all zone we're
interested in are unreclaimable and whether zone is reclaimable or not
depends on kswapd. And Nick's original solution is just peeking
zone->all_unreclaimable but I made it dirty when we are considering
kswapd freeze in hibernation. So I think we still need it to handle
kswapd freeze problem and we should add original behavior we missed at
that time like below.
static bool zone_reclaimable(struct zone *zone)
{
if (zone->all_unreclaimable)
return false;
return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
}
If you remove the logic, the problem Nick addressed would be showed
up, again. How about addressing the problem in your patch? If you
remove the logic, __alloc_pages_direct_reclaim lose the chance calling
dran_all_pages. Of course, it was a side effect but we should handle
it.
And my last concern is we are going on right way?
I think fundamental cause of this problem is page_scanned and
all_unreclaimable is race so isn't the approach fixing the race right
way?
If it is hard or very costly, your and my approach will be fallback.
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
2011-03-24 4:19 ` Minchan Kim
@ 2011-03-24 5:35 ` KOSAKI Motohiro
2011-03-24 5:53 ` Minchan Kim
2011-03-24 7:43 ` Minchan Kim
0 siblings, 2 replies; 59+ messages in thread
From: KOSAKI Motohiro @ 2011-03-24 5:35 UTC (permalink / raw)
To: Minchan Kim
Cc: kosaki.motohiro, linux-kernel, Andrew Morton, David Rientjes,
Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki, Nick Piggin,
Johannes Weiner
Hi Minchan,
> Nick's original goal is to prevent OOM killing until all zone we're
> interested in are unreclaimable and whether zone is reclaimable or not
> depends on kswapd. And Nick's original solution is just peeking
> zone->all_unreclaimable but I made it dirty when we are considering
> kswapd freeze in hibernation. So I think we still need it to handle
> kswapd freeze problem and we should add original behavior we missed at
> that time like below.
>
> static bool zone_reclaimable(struct zone *zone)
> {
> if (zone->all_unreclaimable)
> return false;
>
> return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
> }
>
> If you remove the logic, the problem Nick addressed would be showed
> up, again. How about addressing the problem in your patch? If you
> remove the logic, __alloc_pages_direct_reclaim lose the chance calling
> dran_all_pages. Of course, it was a side effect but we should handle
> it.
Ok, you are successfull to persuade me. lost drain_all_pages() chance has
a risk.
> And my last concern is we are going on right way?
> I think fundamental cause of this problem is page_scanned and
> all_unreclaimable is race so isn't the approach fixing the race right
> way?
Hmm..
If we can avoid lock, we should. I think. that's performance reason.
therefore I'd like to cap the issue in do_try_to_free_pages(). it's
slow path.
Is the following patch acceptable to you? it is
o rewrote the description
o avoid mix to use zone->all_unreclaimable and zone->pages_scanned
o avoid to reintroduce hibernation issue
o don't touch fast path
> If it is hard or very costly, your and my approach will be fallback.
-----------------------------------------------------------------
From f3d277057ad3a092aa1c94244f0ed0d3ebe5411c Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Date: Sat, 14 May 2011 05:07:48 +0900
Subject: [PATCH] vmscan: all_unreclaimable() use zone->all_unreclaimable as the name
all_unreclaimable check in direct reclaim has been introduced at 2.6.19
by following commit.
2006 Sep 25; commit 408d8544; oom: use unreclaimable info
And it went through strange history. firstly, following commit broke
the logic unintentionally.
2008 Apr 29; commit a41f24ea; page allocator: smarter retry of
costly-order allocations
Two years later, I've found obvious meaningless code fragment and
restored original intention by following commit.
2010 Jun 04; commit bb21c7ce; vmscan: fix do_try_to_free_pages()
return value when priority==0
But, the logic didn't works when 32bit highmem system goes hibernation
and Minchan slightly changed the algorithm and fixed it .
2010 Sep 22: commit d1908362: vmscan: check all_unreclaimable
in direct reclaim path
But, recently, Andrey Vagin found the new corner case. Look,
struct zone {
..
int all_unreclaimable;
..
unsigned long pages_scanned;
..
}
zone->all_unreclaimable and zone->pages_scanned are neigher atomic
variables nor protected by lock. Therefore zones can become a state
of zone->page_scanned=0 and zone->all_unreclaimable=1. In this case,
current all_unreclaimable() return false even though
zone->all_unreclaimabe=1.
Is this ignorable minor issue? No. Unfortunatelly, x86 has very
small dma zone and it become zone->all_unreclamble=1 easily. and
if it become all_unreclaimable=1, it never restore all_unreclaimable=0.
Why? if all_unreclaimable=1, vmscan only try DEF_PRIORITY reclaim and
a-few-lru-pages>>DEF_PRIORITY always makes 0. that mean no page scan
at all!
Eventually, oom-killer never works on such systems. That said, we
can't use zone->pages_scanned for this purpose. This patch restore
all_unreclaimable() use zone->all_unreclaimable as old. and in addition,
to add oom_killer_disabled check to avoid reintroduce the issue of
commit d1908362.
Reported-by: Andrey Vagin <avagin@openvz.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
mm/vmscan.c | 24 +++++++++++++-----------
1 files changed, 13 insertions(+), 11 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 060e4c1..54ac548 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -41,6 +41,7 @@
#include <linux/memcontrol.h>
#include <linux/delayacct.h>
#include <linux/sysctl.h>
+#include <linux/oom.h>
#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -1988,17 +1989,12 @@ static bool zone_reclaimable(struct zone *zone)
return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
}
-/*
- * As hibernation is going on, kswapd is freezed so that it can't mark
- * the zone into all_unreclaimable. It can't handle OOM during hibernation.
- * So let's check zone's unreclaimable in direct reclaim as well as kswapd.
- */
+/* All zones in zonelist are unreclaimable? */
static bool all_unreclaimable(struct zonelist *zonelist,
struct scan_control *sc)
{
struct zoneref *z;
struct zone *zone;
- bool all_unreclaimable = true;
for_each_zone_zonelist_nodemask(zone, z, zonelist,
gfp_zone(sc->gfp_mask), sc->nodemask) {
@@ -2006,13 +2002,11 @@ static bool all_unreclaimable(struct zonelist *zonelist,
continue;
if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
continue;
- if (zone_reclaimable(zone)) {
- all_unreclaimable = false;
- break;
- }
+ if (!zone->all_unreclaimable)
+ return false;
}
- return all_unreclaimable;
+ return true;
}
/*
@@ -2108,6 +2102,14 @@ out:
if (sc->nr_reclaimed)
return sc->nr_reclaimed;
+ /*
+ * As hibernation is going on, kswapd is freezed so that it can't mark
+ * the zone into all_unreclaimable. Thus bypassing all_unreclaimable
+ * check.
+ */
+ if (oom_killer_disabled)
+ return 0;
+
/* top priority shrink_zones still had more to do? don't OOM, then */
if (scanning_global_lru(sc) && !all_unreclaimable(zonelist, sc))
return 1;
--
1.6.5.2
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
2011-03-24 5:35 ` KOSAKI Motohiro
@ 2011-03-24 5:53 ` Minchan Kim
2011-03-24 6:16 ` KOSAKI Motohiro
2011-03-24 7:43 ` Minchan Kim
1 sibling, 1 reply; 59+ messages in thread
From: Minchan Kim @ 2011-03-24 5:53 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin, Hugh Dickins,
KAMEZAWA Hiroyuki, Nick Piggin, Johannes Weiner
Hi Kosaki,
On Thu, Mar 24, 2011 at 2:35 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
> Hi Minchan,
>
>> Nick's original goal is to prevent OOM killing until all zone we're
>> interested in are unreclaimable and whether zone is reclaimable or not
>> depends on kswapd. And Nick's original solution is just peeking
>> zone->all_unreclaimable but I made it dirty when we are considering
>> kswapd freeze in hibernation. So I think we still need it to handle
>> kswapd freeze problem and we should add original behavior we missed at
>> that time like below.
>>
>> static bool zone_reclaimable(struct zone *zone)
>> {
>> if (zone->all_unreclaimable)
>> return false;
>>
>> return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
>> }
>>
>> If you remove the logic, the problem Nick addressed would be showed
>> up, again. How about addressing the problem in your patch? If you
>> remove the logic, __alloc_pages_direct_reclaim lose the chance calling
>> dran_all_pages. Of course, it was a side effect but we should handle
>> it.
>
> Ok, you are successfull to persuade me. lost drain_all_pages() chance has
> a risk.
>
>> And my last concern is we are going on right way?
>
>
>> I think fundamental cause of this problem is page_scanned and
>> all_unreclaimable is race so isn't the approach fixing the race right
>> way?
>
> Hmm..
> If we can avoid lock, we should. I think. that's performance reason.
> therefore I'd like to cap the issue in do_try_to_free_pages(). it's
> slow path.
>
> Is the following patch acceptable to you? it is
> o rewrote the description
> o avoid mix to use zone->all_unreclaimable and zone->pages_scanned
> o avoid to reintroduce hibernation issue
> o don't touch fast path
>
>
>> If it is hard or very costly, your and my approach will be fallback.
>
> -----------------------------------------------------------------
> From f3d277057ad3a092aa1c94244f0ed0d3ebe5411c Mon Sep 17 00:00:00 2001
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Date: Sat, 14 May 2011 05:07:48 +0900
> Subject: [PATCH] vmscan: all_unreclaimable() use zone->all_unreclaimable as the name
>
> all_unreclaimable check in direct reclaim has been introduced at 2.6.19
> by following commit.
>
> 2006 Sep 25; commit 408d8544; oom: use unreclaimable info
>
> And it went through strange history. firstly, following commit broke
> the logic unintentionally.
>
> 2008 Apr 29; commit a41f24ea; page allocator: smarter retry of
> costly-order allocations
>
> Two years later, I've found obvious meaningless code fragment and
> restored original intention by following commit.
>
> 2010 Jun 04; commit bb21c7ce; vmscan: fix do_try_to_free_pages()
> return value when priority==0
>
> But, the logic didn't works when 32bit highmem system goes hibernation
> and Minchan slightly changed the algorithm and fixed it .
>
> 2010 Sep 22: commit d1908362: vmscan: check all_unreclaimable
> in direct reclaim path
>
> But, recently, Andrey Vagin found the new corner case. Look,
>
> struct zone {
> ..
> int all_unreclaimable;
> ..
> unsigned long pages_scanned;
> ..
> }
>
> zone->all_unreclaimable and zone->pages_scanned are neigher atomic
> variables nor protected by lock. Therefore zones can become a state
> of zone->page_scanned=0 and zone->all_unreclaimable=1. In this case,
> current all_unreclaimable() return false even though
> zone->all_unreclaimabe=1.
>
> Is this ignorable minor issue? No. Unfortunatelly, x86 has very
> small dma zone and it become zone->all_unreclamble=1 easily. and
> if it become all_unreclaimable=1, it never restore all_unreclaimable=0.
> Why? if all_unreclaimable=1, vmscan only try DEF_PRIORITY reclaim and
> a-few-lru-pages>>DEF_PRIORITY always makes 0. that mean no page scan
> at all!
>
> Eventually, oom-killer never works on such systems. That said, we
> can't use zone->pages_scanned for this purpose. This patch restore
> all_unreclaimable() use zone->all_unreclaimable as old. and in addition,
> to add oom_killer_disabled check to avoid reintroduce the issue of
> commit d1908362.
>
> Reported-by: Andrey Vagin <avagin@openvz.org>
> Cc: Nick Piggin <npiggin@kernel.dk>
> Cc: Minchan Kim <minchan.kim@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
> mm/vmscan.c | 24 +++++++++++++-----------
> 1 files changed, 13 insertions(+), 11 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 060e4c1..54ac548 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -41,6 +41,7 @@
> #include <linux/memcontrol.h>
> #include <linux/delayacct.h>
> #include <linux/sysctl.h>
> +#include <linux/oom.h>
>
> #include <asm/tlbflush.h>
> #include <asm/div64.h>
> @@ -1988,17 +1989,12 @@ static bool zone_reclaimable(struct zone *zone)
> return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
> }
>
> -/*
> - * As hibernation is going on, kswapd is freezed so that it can't mark
> - * the zone into all_unreclaimable. It can't handle OOM during hibernation.
> - * So let's check zone's unreclaimable in direct reclaim as well as kswapd.
> - */
> +/* All zones in zonelist are unreclaimable? */
> static bool all_unreclaimable(struct zonelist *zonelist,
> struct scan_control *sc)
> {
> struct zoneref *z;
> struct zone *zone;
> - bool all_unreclaimable = true;
>
> for_each_zone_zonelist_nodemask(zone, z, zonelist,
> gfp_zone(sc->gfp_mask), sc->nodemask) {
> @@ -2006,13 +2002,11 @@ static bool all_unreclaimable(struct zonelist *zonelist,
> continue;
> if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
> continue;
> - if (zone_reclaimable(zone)) {
> - all_unreclaimable = false;
> - break;
> - }
> + if (!zone->all_unreclaimable)
> + return false;
> }
>
> - return all_unreclaimable;
> + return true;
> }
>
> /*
> @@ -2108,6 +2102,14 @@ out:
> if (sc->nr_reclaimed)
> return sc->nr_reclaimed;
>
> + /*
> + * As hibernation is going on, kswapd is freezed so that it can't mark
> + * the zone into all_unreclaimable. Thus bypassing all_unreclaimable
> + * check.
> + */
> + if (oom_killer_disabled)
> + return 0;
> +
> /* top priority shrink_zones still had more to do? don't OOM, then */
> if (scanning_global_lru(sc) && !all_unreclaimable(zonelist, sc))
> return 1;
> --
> 1.6.5.2
>
Thanks for your effort, Kosaki.
But I still doubt this patch is good.
This patch makes early oom killing in hibernation as it skip
all_unreclaimable check.
Normally, hibernation needs many memory so page_reclaim pressure
would be big in small memory system. So I don't like early give up.
Do you think my patch has a problem? Personally, I think it's very
simple and clear. :)
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
2011-03-24 5:53 ` Minchan Kim
@ 2011-03-24 6:16 ` KOSAKI Motohiro
2011-03-24 6:32 ` Minchan Kim
0 siblings, 1 reply; 59+ messages in thread
From: KOSAKI Motohiro @ 2011-03-24 6:16 UTC (permalink / raw)
To: Minchan Kim
Cc: kosaki.motohiro, linux-kernel, Andrew Morton, David Rientjes,
Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki, Nick Piggin,
Johannes Weiner
Hi
> Thanks for your effort, Kosaki.
> But I still doubt this patch is good.
>
> This patch makes early oom killing in hibernation as it skip
> all_unreclaimable check.
> Normally, hibernation needs many memory so page_reclaim pressure
> would be big in small memory system. So I don't like early give up.
Wait. When occur big pressure? hibernation reclaim pressure
(sc->nr_to_recliam) depend on physical memory size. therefore
a pressure seems to don't depend on the size.
> Do you think my patch has a problem? Personally, I think it's very
> simple and clear. :)
To be honest, I dislike following parts. It's madness on madness.
static bool zone_reclaimable(struct zone *zone)
{
if (zone->all_unreclaimable)
return false;
return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
}
The function require a reviewer know
o pages_scanned and all_unreclaimable are racy
o at hibernation, zone->all_unreclaimable can be false negative,
but can't be false positive.
And, a function comment of all_unreclaimable() says
/*
* As hibernation is going on, kswapd is freezed so that it can't mark
* the zone into all_unreclaimable. It can't handle OOM during hibernation.
* So let's check zone's unreclaimable in direct reclaim as well as kswapd.
*/
But, now it is no longer copy of kswapd algorithm.
If you strongly prefer this idea even if you hear above explanation,
please consider to add much and much comments. I can't say
current your patch is enough readable/reviewable.
Thanks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
2011-03-24 6:16 ` KOSAKI Motohiro
@ 2011-03-24 6:32 ` Minchan Kim
2011-03-24 7:03 ` KOSAKI Motohiro
0 siblings, 1 reply; 59+ messages in thread
From: Minchan Kim @ 2011-03-24 6:32 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin, Hugh Dickins,
KAMEZAWA Hiroyuki, Nick Piggin, Johannes Weiner
On Thu, Mar 24, 2011 at 3:16 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
> Hi
>
>> Thanks for your effort, Kosaki.
>> But I still doubt this patch is good.
>>
>> This patch makes early oom killing in hibernation as it skip
>> all_unreclaimable check.
>> Normally, hibernation needs many memory so page_reclaim pressure
>> would be big in small memory system. So I don't like early give up.
>
> Wait. When occur big pressure? hibernation reclaim pressure
> (sc->nr_to_recliam) depend on physical memory size. therefore
> a pressure seems to don't depend on the size.
It depends on physical memory size and /sys/power/image_size.
If you want to tune image size bigger, reclaim pressure would be big.
>
>
>> Do you think my patch has a problem? Personally, I think it's very
>> simple and clear. :)
>
> To be honest, I dislike following parts. It's madness on madness.
>
> static bool zone_reclaimable(struct zone *zone)
> {
> if (zone->all_unreclaimable)
> return false;
>
> return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
> }
>
>
> The function require a reviewer know
>
> o pages_scanned and all_unreclaimable are racy
Yes. That part should be written down of comment.
> o at hibernation, zone->all_unreclaimable can be false negative,
> but can't be false positive.
The comment of all_unreclaimable already does explain it well, I think.
>
> And, a function comment of all_unreclaimable() says
>
> /*
> * As hibernation is going on, kswapd is freezed so that it can't mark
> * the zone into all_unreclaimable. It can't handle OOM during hibernation.
> * So let's check zone's unreclaimable in direct reclaim as well as kswapd.
> */
>
> But, now it is no longer copy of kswapd algorithm.
The comment don't say it should be a copy of kswapd.
>
> If you strongly prefer this idea even if you hear above explanation,
> please consider to add much and much comments. I can't say
> current your patch is enough readable/reviewable.
My patch isn't a formal patch for merge but just a concept to show.
If you agree the idea, of course, I will add more concrete comment
when I send formal patch.
Before, I would like to get a your agreement. :)
If you solve my concern(early give up in hibernation) in your patch, I
don't insist on my patch, either.
Thanks for the comment, Kosaki.
>
> Thanks.
>
>
>
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
2011-03-24 6:32 ` Minchan Kim
@ 2011-03-24 7:03 ` KOSAKI Motohiro
2011-03-24 7:25 ` Minchan Kim
0 siblings, 1 reply; 59+ messages in thread
From: KOSAKI Motohiro @ 2011-03-24 7:03 UTC (permalink / raw)
To: Minchan Kim
Cc: kosaki.motohiro, linux-kernel, Andrew Morton, David Rientjes,
Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki, Nick Piggin,
Johannes Weiner
> On Thu, Mar 24, 2011 at 3:16 PM, KOSAKI Motohiro
> <kosaki.motohiro@jp.fujitsu.com> wrote:
> > Hi
> >
> >> Thanks for your effort, Kosaki.
> >> But I still doubt this patch is good.
> >>
> >> This patch makes early oom killing in hibernation as it skip
> >> all_unreclaimable check.
> >> Normally, A hibernation needs many memory so page_reclaim pressure
> >> would be big in small memory system. So I don't like early give up.
> >
> > Wait. When occur big pressure? hibernation reclaim pressure
> > (sc->nr_to_recliam) depend on physical memory size. therefore
> > a pressure seems to don't depend on the size.
>
> It depends on physical memory size and /sys/power/image_size.
> If you want to tune image size bigger, reclaim pressure would be big.
Ok, _If_ I want.
However, I haven't seen desktop people customize it.
> >> Do you think my patch has a problem? Personally, I think it's very
> >> simple and clear. :)
> >
> > To be honest, I dislike following parts. It's madness on madness.
> >
> > A A A A static bool zone_reclaimable(struct zone *zone)
> > A A A A {
> > A A A A A A A A if (zone->all_unreclaimable)
> > A A A A A A A A A A A A return false;
> >
> > A A A A A A A A return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
> > A A A A }
> >
> >
> > The function require a reviewer know
> >
> > A o pages_scanned and all_unreclaimable are racy
>
> Yes. That part should be written down of comment.
>
> > A o at hibernation, zone->all_unreclaimable can be false negative,
> > A but can't be false positive.
>
> The comment of all_unreclaimable already does explain it well, I think.
Where is?
> > And, a function comment of all_unreclaimable() says
> >
> > A A A A /*
> > A A A A A * As hibernation is going on, kswapd is freezed so that it can't mark
> > A A A A A * the zone into all_unreclaimable. It can't handle OOM during hibernation.
> > A A A A A * So let's check zone's unreclaimable in direct reclaim as well as kswapd.
> > A A A A A */
> >
> > But, now it is no longer copy of kswapd algorithm.
>
> The comment don't say it should be a copy of kswapd.
I meant the comments says
A A A A A * So let's check zone's unreclaimable in direct reclaim as well as kswapd.
but now it isn't aswell as kswapd.
I think it's critical important. If people can't understand why the
algorithm was choosed, anyone will break the code again sooner or later.
> > If you strongly prefer this idea even if you hear above explanation,
> > please consider to add much and much comments. I can't say
> > current your patch is enough readable/reviewable.
>
> My patch isn't a formal patch for merge but just a concept to show.
> If you agree the idea, of course, I will add more concrete comment
> when I send formal patch.
>
> Before, I would like to get a your agreement. :)
> If you solve my concern(early give up in hibernation) in your patch, I
> don't insist on my patch, either.
Ok. Let's try.
Please concern why priority=0 is not enough. zone_reclaimable_pages(zone) * 6
is a conservative value of worry about multi thread race. While one task
is reclaiming, others can allocate/free memory concurrently. therefore,
even after priority=0, we have a chance getting reclaimable pages on lru.
But, in hibernation case, almost all tasks was freezed before hibernation
call shrink_all_memory(). therefore, there is no race. priority=0 reclaim
can cover all lru pages.
Is this enough explanation for you?
>
> Thanks for the comment, Kosaki.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
2011-03-24 7:03 ` KOSAKI Motohiro
@ 2011-03-24 7:25 ` Minchan Kim
2011-03-24 7:28 ` KOSAKI Motohiro
0 siblings, 1 reply; 59+ messages in thread
From: Minchan Kim @ 2011-03-24 7:25 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin, Hugh Dickins,
KAMEZAWA Hiroyuki, Nick Piggin, Johannes Weiner
On Thu, Mar 24, 2011 at 4:03 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
>> On Thu, Mar 24, 2011 at 3:16 PM, KOSAKI Motohiro
>> <kosaki.motohiro@jp.fujitsu.com> wrote:
>> > Hi
>> >
>> >> Thanks for your effort, Kosaki.
>> >> But I still doubt this patch is good.
>> >>
>> >> This patch makes early oom killing in hibernation as it skip
>> >> all_unreclaimable check.
>> >> Normally, hibernation needs many memory so page_reclaim pressure
>> >> would be big in small memory system. So I don't like early give up.
>> >
>> > Wait. When occur big pressure? hibernation reclaim pressure
>> > (sc->nr_to_recliam) depend on physical memory size. therefore
>> > a pressure seems to don't depend on the size.
>>
>> It depends on physical memory size and /sys/power/image_size.
>> If you want to tune image size bigger, reclaim pressure would be big.
>
> Ok, _If_ I want.
> However, I haven't seen desktop people customize it.
>
>
>> >> Do you think my patch has a problem? Personally, I think it's very
>> >> simple and clear. :)
>> >
>> > To be honest, I dislike following parts. It's madness on madness.
>> >
>> > static bool zone_reclaimable(struct zone *zone)
>> > {
>> > if (zone->all_unreclaimable)
>> > return false;
>> >
>> > return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
>> > }
>> >
>> >
>> > The function require a reviewer know
>> >
>> > o pages_scanned and all_unreclaimable are racy
>>
>> Yes. That part should be written down of comment.
>>
>> > o at hibernation, zone->all_unreclaimable can be false negative,
>> > but can't be false positive.
>>
>> The comment of all_unreclaimable already does explain it well, I think.
>
> Where is?
>
>
>> > And, a function comment of all_unreclaimable() says
>> >
>> > /*
>> > * As hibernation is going on, kswapd is freezed so that it can't mark
>> > * the zone into all_unreclaimable. It can't handle OOM during hibernation.
>> > * So let's check zone's unreclaimable in direct reclaim as well as kswapd.
>> > */
>> >
>> > But, now it is no longer copy of kswapd algorithm.
>>
>> The comment don't say it should be a copy of kswapd.
>
> I meant the comments says
>
> * So let's check zone's unreclaimable in direct reclaim as well as kswapd.
>
> but now it isn't aswell as kswapd.
>
> I think it's critical important. If people can't understand why the
> algorithm was choosed, anyone will break the code again sooner or later.
>
>
>> > If you strongly prefer this idea even if you hear above explanation,
>> > please consider to add much and much comments. I can't say
>> > current your patch is enough readable/reviewable.
>>
>> My patch isn't a formal patch for merge but just a concept to show.
>> If you agree the idea, of course, I will add more concrete comment
>> when I send formal patch.
>>
>> Before, I would like to get a your agreement. :)
>> If you solve my concern(early give up in hibernation) in your patch, I
>> don't insist on my patch, either.
>
> Ok. Let's try.
>
> Please concern why priority=0 is not enough. zone_reclaimable_pages(zone) * 6
> is a conservative value of worry about multi thread race. While one task
> is reclaiming, others can allocate/free memory concurrently. therefore,
> even after priority=0, we have a chance getting reclaimable pages on lru.
> But, in hibernation case, almost all tasks was freezed before hibernation
> call shrink_all_memory(). therefore, there is no race. priority=0 reclaim
> can cover all lru pages.
>
> Is this enough explanation for you?
For example, In 4G desktop system.
32M full scanning and fail to reclaim a page.
It's under 1% coverage.
Is it enough to give up?
I am not sure.
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
2011-03-24 7:25 ` Minchan Kim
@ 2011-03-24 7:28 ` KOSAKI Motohiro
2011-03-24 7:34 ` Minchan Kim
0 siblings, 1 reply; 59+ messages in thread
From: KOSAKI Motohiro @ 2011-03-24 7:28 UTC (permalink / raw)
To: Minchan Kim
Cc: kosaki.motohiro, linux-kernel, Andrew Morton, David Rientjes,
Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki, Nick Piggin,
Johannes Weiner
> For example, In 4G desktop system.
> 32M full scanning and fail to reclaim a page.
> It's under 1% coverage.
?? I'm sorry. I haven't catch it.
Where 32M come from?
> Is it enough to give up?
> I am not sure.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
2011-03-24 7:28 ` KOSAKI Motohiro
@ 2011-03-24 7:34 ` Minchan Kim
2011-03-24 7:41 ` Minchan Kim
2011-03-24 7:43 ` KOSAKI Motohiro
0 siblings, 2 replies; 59+ messages in thread
From: Minchan Kim @ 2011-03-24 7:34 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin, Hugh Dickins,
KAMEZAWA Hiroyuki, Nick Piggin, Johannes Weiner
On Thu, Mar 24, 2011 at 4:28 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
>> For example, In 4G desktop system.
>> 32M full scanning and fail to reclaim a page.
>> It's under 1% coverage.
>
> ?? I'm sorry. I haven't catch it.
> Where 32M come from?
(1<<12) * 4K + (1<<11) + 4K + .. (1 << 0) + 4K in shrink_zones.
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
2011-03-24 7:34 ` Minchan Kim
@ 2011-03-24 7:41 ` Minchan Kim
2011-03-24 7:43 ` KOSAKI Motohiro
1 sibling, 0 replies; 59+ messages in thread
From: Minchan Kim @ 2011-03-24 7:41 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin, Hugh Dickins,
KAMEZAWA Hiroyuki, Nick Piggin, Johannes Weiner
On Thu, Mar 24, 2011 at 4:34 PM, Minchan Kim <minchan.kim@gmail.com> wrote:
> On Thu, Mar 24, 2011 at 4:28 PM, KOSAKI Motohiro
> <kosaki.motohiro@jp.fujitsu.com> wrote:
>>> For example, In 4G desktop system.
>>> 32M full scanning and fail to reclaim a page.
>>> It's under 1% coverage.
>>
>> ?? I'm sorry. I haven't catch it.
>> Where 32M come from?
>
> (1<<12) * 4K + (1<<11) + 4K + .. (1 << 0) + 4K in shrink_zones.
Stupid me.
Sorry. my calculation is totally wrong.
Your explanation was perfect
Okay. I don't have any objection to your solution.
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
2011-03-24 7:34 ` Minchan Kim
2011-03-24 7:41 ` Minchan Kim
@ 2011-03-24 7:43 ` KOSAKI Motohiro
1 sibling, 0 replies; 59+ messages in thread
From: KOSAKI Motohiro @ 2011-03-24 7:43 UTC (permalink / raw)
To: Minchan Kim
Cc: kosaki.motohiro, linux-kernel, Andrew Morton, David Rientjes,
Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki, Nick Piggin,
Johannes Weiner
> On Thu, Mar 24, 2011 at 4:28 PM, KOSAKI Motohiro
> <kosaki.motohiro@jp.fujitsu.com> wrote:
> >> For example, In 4G desktop system.
> >> 32M full scanning and fail to reclaim a page.
> >> It's under 1% coverage.
> >
> > ?? I'm sorry. I haven't catch it.
> > Where 32M come from?
>
> (1<<12) * 4K + (1<<11) + 4K + .. (1 << 0) + 4K in shrink_zones.
(lru-pages>>12) + (lru-pages>>11) + (lru-pages>>10) ... =~ 2 * lru-page
?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
2011-03-24 5:35 ` KOSAKI Motohiro
2011-03-24 5:53 ` Minchan Kim
@ 2011-03-24 7:43 ` Minchan Kim
1 sibling, 0 replies; 59+ messages in thread
From: Minchan Kim @ 2011-03-24 7:43 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin, Hugh Dickins,
KAMEZAWA Hiroyuki, Nick Piggin, Johannes Weiner
On Thu, Mar 24, 2011 at 2:35 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
> Hi Minchan,
>
>> Nick's original goal is to prevent OOM killing until all zone we're
>> interested in are unreclaimable and whether zone is reclaimable or not
>> depends on kswapd. And Nick's original solution is just peeking
>> zone->all_unreclaimable but I made it dirty when we are considering
>> kswapd freeze in hibernation. So I think we still need it to handle
>> kswapd freeze problem and we should add original behavior we missed at
>> that time like below.
>>
>> static bool zone_reclaimable(struct zone *zone)
>> {
>> if (zone->all_unreclaimable)
>> return false;
>>
>> return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
>> }
>>
>> If you remove the logic, the problem Nick addressed would be showed
>> up, again. How about addressing the problem in your patch? If you
>> remove the logic, __alloc_pages_direct_reclaim lose the chance calling
>> dran_all_pages. Of course, it was a side effect but we should handle
>> it.
>
> Ok, you are successfull to persuade me. lost drain_all_pages() chance has
> a risk.
>
>> And my last concern is we are going on right way?
>
>
>> I think fundamental cause of this problem is page_scanned and
>> all_unreclaimable is race so isn't the approach fixing the race right
>> way?
>
> Hmm..
> If we can avoid lock, we should. I think. that's performance reason.
> therefore I'd like to cap the issue in do_try_to_free_pages(). it's
> slow path.
>
> Is the following patch acceptable to you? it is
> o rewrote the description
> o avoid mix to use zone->all_unreclaimable and zone->pages_scanned
> o avoid to reintroduce hibernation issue
> o don't touch fast path
>
>
>> If it is hard or very costly, your and my approach will be fallback.
>
> -----------------------------------------------------------------
> From f3d277057ad3a092aa1c94244f0ed0d3ebe5411c Mon Sep 17 00:00:00 2001
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Date: Sat, 14 May 2011 05:07:48 +0900
> Subject: [PATCH] vmscan: all_unreclaimable() use zone->all_unreclaimable as the name
>
> all_unreclaimable check in direct reclaim has been introduced at 2.6.19
> by following commit.
>
> 2006 Sep 25; commit 408d8544; oom: use unreclaimable info
>
> And it went through strange history. firstly, following commit broke
> the logic unintentionally.
>
> 2008 Apr 29; commit a41f24ea; page allocator: smarter retry of
> costly-order allocations
>
> Two years later, I've found obvious meaningless code fragment and
> restored original intention by following commit.
>
> 2010 Jun 04; commit bb21c7ce; vmscan: fix do_try_to_free_pages()
> return value when priority==0
>
> But, the logic didn't works when 32bit highmem system goes hibernation
> and Minchan slightly changed the algorithm and fixed it .
>
> 2010 Sep 22: commit d1908362: vmscan: check all_unreclaimable
> in direct reclaim path
>
> But, recently, Andrey Vagin found the new corner case. Look,
>
> struct zone {
> ..
> int all_unreclaimable;
> ..
> unsigned long pages_scanned;
> ..
> }
>
> zone->all_unreclaimable and zone->pages_scanned are neigher atomic
> variables nor protected by lock. Therefore zones can become a state
> of zone->page_scanned=0 and zone->all_unreclaimable=1. In this case,
> current all_unreclaimable() return false even though
> zone->all_unreclaimabe=1.
>
> Is this ignorable minor issue? No. Unfortunatelly, x86 has very
> small dma zone and it become zone->all_unreclamble=1 easily. and
> if it become all_unreclaimable=1, it never restore all_unreclaimable=0.
> Why? if all_unreclaimable=1, vmscan only try DEF_PRIORITY reclaim and
> a-few-lru-pages>>DEF_PRIORITY always makes 0. that mean no page scan
> at all!
>
> Eventually, oom-killer never works on such systems. That said, we
> can't use zone->pages_scanned for this purpose. This patch restore
> all_unreclaimable() use zone->all_unreclaimable as old. and in addition,
> to add oom_killer_disabled check to avoid reintroduce the issue of
> commit d1908362.
>
> Reported-by: Andrey Vagin <avagin@openvz.org>
> Cc: Nick Piggin <npiggin@kernel.dk>
> Cc: Minchan Kim <minchan.kim@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Thanks for the good discussion, Kosaki.
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
2011-03-22 11:05 ` [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely KOSAKI Motohiro
2011-03-22 14:49 ` Minchan Kim
@ 2011-03-23 7:41 ` KAMEZAWA Hiroyuki
2011-03-23 7:55 ` KOSAKI Motohiro
1 sibling, 1 reply; 59+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-03-23 7:41 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin, Hugh Dickins,
Nick Piggin, Minchan Kim, Johannes Weiner
On Tue, 22 Mar 2011 20:05:55 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> all_unreclaimable check in direct reclaim has been introduced at 2.6.19
> by following commit.
>
> 2006 Sep 25; commit 408d8544; oom: use unreclaimable info
>
> And it went through strange history. firstly, following commit broke
> the logic unintentionally.
>
> 2008 Apr 29; commit a41f24ea; page allocator: smarter retry of
> costly-order allocations
>
> Two years later, I've found obvious meaningless code fragment and
> restored original intention by following commit.
>
> 2010 Jun 04; commit bb21c7ce; vmscan: fix do_try_to_free_pages()
> return value when priority==0
>
> But, the logic didn't works when 32bit highmem system goes hibernation
> and Minchan slightly changed the algorithm and fixed it .
>
> 2010 Sep 22: commit d1908362: vmscan: check all_unreclaimable
> in direct reclaim path
>
> But, recently, Andrey Vagin found the new corner case. Look,
>
> struct zone {
> ..
> int all_unreclaimable;
> ..
> unsigned long pages_scanned;
> ..
> }
>
> zone->all_unreclaimable and zone->pages_scanned are neigher atomic
> variables nor protected by lock. Therefore a zone can become a state
> of zone->page_scanned=0 and zone->all_unreclaimable=1. In this case,
> current all_unreclaimable() return false even though
> zone->all_unreclaimabe=1.
>
> Is this ignorable minor issue? No. Unfortunatelly, x86 has very
> small dma zone and it become zone->all_unreclamble=1 easily. and
> if it becase all_unreclaimable, it never return all_unreclaimable=0
> beucase it typicall don't have reclaimable pages.
>
> Eventually, oom-killer never works on such systems. Let's remove
> this problematic logic completely.
>
> Reported-by: Andrey Vagin <avagin@openvz.org>
> Cc: Nick Piggin <npiggin@kernel.dk>
> Cc: Minchan Kim <minchan.kim@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
IIUC, I saw the pehnomenon which you pointed out, as
- all zone->all_unreclaimable = yes
- zone_reclaimable() returns true
- no pgscan proceeds.
on a swapless system. So, I'd like to vote for this patch.
But hmm...what happens all of pages are isolated or locked and now under freeing ?
I think we should have alternative safe-guard logic for avoiding to call
oom-killer. Hmm.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely
2011-03-23 7:41 ` KAMEZAWA Hiroyuki
@ 2011-03-23 7:55 ` KOSAKI Motohiro
0 siblings, 0 replies; 59+ messages in thread
From: KOSAKI Motohiro @ 2011-03-23 7:55 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: kosaki.motohiro, linux-kernel, Andrew Morton, David Rientjes,
Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
Andrey Vagin, Hugh Dickins, Nick Piggin, Minchan Kim,
Johannes Weiner
> > Reported-by: Andrey Vagin <avagin@openvz.org>
> > Cc: Nick Piggin <npiggin@kernel.dk>
> > Cc: Minchan Kim <minchan.kim@gmail.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Rik van Riel <riel@redhat.com>
> > Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>
> IIUC, I saw the pehnomenon which you pointed out, as
> - all zone->all_unreclaimable = yes
> - zone_reclaimable() returns true
> - no pgscan proceeds.
>
> on a swapless system. So, I'd like to vote for this patch.
>
> But hmm...what happens all of pages are isolated or locked and now under freeing ?
> I think we should have alternative safe-guard logic for avoiding to call
> oom-killer. Hmm.
Yes, this patch has small risk. but 1) this logic didn't work about two
years (see changelog) 2) memcg haven't use this logic and I haven't get
any bug report from memcg developers. therefore I decided to take most
simple way.
Of cource, I'll make another protection if I'll get any regression report.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* [PATCH 2/5] Revert "oom: give the dying task a higher priority"
2011-03-22 11:04 ` [patch 0/5] oom: a few anti fork bomb patches KOSAKI Motohiro
2011-03-22 11:05 ` [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely KOSAKI Motohiro
@ 2011-03-22 11:06 ` KOSAKI Motohiro
2011-03-23 7:42 ` KAMEZAWA Hiroyuki
2011-03-24 15:27 ` Minchan Kim
2011-03-22 11:08 ` [PATCH 3/5] oom: create oom autogroup KOSAKI Motohiro
` (2 subsequent siblings)
4 siblings, 2 replies; 59+ messages in thread
From: KOSAKI Motohiro @ 2011-03-22 11:06 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin, Hugh Dickins,
KAMEZAWA Hiroyuki, Luis Claudio R. Goncalves
This reverts commit 93b43fa55088fe977503a156d1097cc2055449a2.
The commit dramatically improve oom killer logic when fork-bomb
occur. But, I've found it has nasty corner case. Now cpu cgroup
has strange default RT runtime. It's 0! That said, if a process
under cpu cgroup promote RT scheduling class, the process never
run at all.
Eventually, kernel may hang up when oom kill occur.
The author need to resubmit it as adding knob and disabled
by default if he really need this feature.
Cc: Luis Claudio R. Goncalves <lclaudio@uudg.org>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
mm/oom_kill.c | 27 ---------------------------
1 files changed, 0 insertions(+), 27 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 3100bc5..739dee4 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -84,24 +84,6 @@ static bool has_intersects_mems_allowed(struct task_struct *tsk,
#endif /* CONFIG_NUMA */
/*
- * If this is a system OOM (not a memcg OOM) and the task selected to be
- * killed is not already running at high (RT) priorities, speed up the
- * recovery by boosting the dying task to the lowest FIFO priority.
- * That helps with the recovery and avoids interfering with RT tasks.
- */
-static void boost_dying_task_prio(struct task_struct *p,
- struct mem_cgroup *mem)
-{
- struct sched_param param = { .sched_priority = 1 };
-
- if (mem)
- return;
-
- if (!rt_task(p))
- sched_setscheduler_nocheck(p, SCHED_FIFO, ¶m);
-}
-
-/*
* The process p may have detached its own ->mm while exiting or through
* use_mm(), but one or more of its subthreads may still have a valid
* pointer. Return p, or any of its subthreads with a valid ->mm, with
@@ -452,13 +434,6 @@ static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
set_tsk_thread_flag(p, TIF_MEMDIE);
force_sig(SIGKILL, p);
- /*
- * We give our sacrificial lamb high priority and access to
- * all the memory it needs. That way it should be able to
- * exit() and clear out its resources quickly...
- */
- boost_dying_task_prio(p, mem);
-
return 0;
}
#undef K
@@ -482,7 +457,6 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
*/
if (p->flags & PF_EXITING) {
set_tsk_thread_flag(p, TIF_MEMDIE);
- boost_dying_task_prio(p, mem);
return 0;
}
@@ -701,7 +675,6 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
*/
if (fatal_signal_pending(current)) {
set_thread_flag(TIF_MEMDIE);
- boost_dying_task_prio(current, NULL);
return;
}
--
1.6.5.2
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [PATCH 2/5] Revert "oom: give the dying task a higher priority"
2011-03-22 11:06 ` [PATCH 2/5] Revert "oom: give the dying task a higher priority" KOSAKI Motohiro
@ 2011-03-23 7:42 ` KAMEZAWA Hiroyuki
2011-03-23 13:40 ` Luis Claudio R. Goncalves
2011-03-24 15:27 ` Minchan Kim
1 sibling, 1 reply; 59+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-03-23 7:42 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin, Hugh Dickins,
Luis Claudio R. Goncalves
On Tue, 22 Mar 2011 20:06:48 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> This reverts commit 93b43fa55088fe977503a156d1097cc2055449a2.
>
> The commit dramatically improve oom killer logic when fork-bomb
> occur. But, I've found it has nasty corner case. Now cpu cgroup
> has strange default RT runtime. It's 0! That said, if a process
> under cpu cgroup promote RT scheduling class, the process never
> run at all.
>
> Eventually, kernel may hang up when oom kill occur.
>
> The author need to resubmit it as adding knob and disabled
> by default if he really need this feature.
>
> Cc: Luis Claudio R. Goncalves <lclaudio@uudg.org>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/5] Revert "oom: give the dying task a higher priority"
2011-03-23 7:42 ` KAMEZAWA Hiroyuki
@ 2011-03-23 13:40 ` Luis Claudio R. Goncalves
2011-03-24 0:06 ` KOSAKI Motohiro
0 siblings, 1 reply; 59+ messages in thread
From: Luis Claudio R. Goncalves @ 2011-03-23 13:40 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro, linux-kernel, Andrew Morton, David Rientjes,
Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
Andrey Vagin, Hugh Dickins
On Wed, Mar 23, 2011 at 04:42:29PM +0900, KAMEZAWA Hiroyuki wrote:
| On Tue, 22 Mar 2011 20:06:48 +0900 (JST)
| KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
|
| > This reverts commit 93b43fa55088fe977503a156d1097cc2055449a2.
| >
| > The commit dramatically improve oom killer logic when fork-bomb
| > occur. But, I've found it has nasty corner case. Now cpu cgroup
| > has strange default RT runtime. It's 0! That said, if a process
| > under cpu cgroup promote RT scheduling class, the process never
| > run at all.
| >
| > Eventually, kernel may hang up when oom kill occur.
| >
| > The author need to resubmit it as adding knob and disabled
| > by default if he really need this feature.
| >
| > Cc: Luis Claudio R. Goncalves <lclaudio@uudg.org>
| > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
|
| Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
The original patch was written to fix an issue observed in 2.6.24.7-rt.
As the logic sounded useful, I ported it to upstream. Anyway,I am trying
a few ideas to rework that patch. In the meantime, I'm pretty fine with
reverting the commit.
Acked-by: Luis Claudio R. Goncalves <lgoncalv@uudg.org>
--
[ Luis Claudio R. Goncalves Bass - Gospel - RT ]
[ Fingerprint: 4FDD B8C4 3C59 34BD 8BE9 2696 7203 D980 A448 C8F8 ]
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/5] Revert "oom: give the dying task a higher priority"
2011-03-23 13:40 ` Luis Claudio R. Goncalves
@ 2011-03-24 0:06 ` KOSAKI Motohiro
0 siblings, 0 replies; 59+ messages in thread
From: KOSAKI Motohiro @ 2011-03-24 0:06 UTC (permalink / raw)
To: Luis Claudio R. Goncalves
Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, linux-kernel, Andrew Morton,
David Rientjes, Linus Torvalds, Rik van Riel, Oleg Nesterov,
linux-mm, Andrey Vagin, Hugh Dickins
> On Wed, Mar 23, 2011 at 04:42:29PM +0900, KAMEZAWA Hiroyuki wrote:
> | On Tue, 22 Mar 2011 20:06:48 +0900 (JST)
> | KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> |
> | > This reverts commit 93b43fa55088fe977503a156d1097cc2055449a2.
> | >
> | > The commit dramatically improve oom killer logic when fork-bomb
> | > occur. But, I've found it has nasty corner case. Now cpu cgroup
> | > has strange default RT runtime. It's 0! That said, if a process
> | > under cpu cgroup promote RT scheduling class, the process never
> | > run at all.
> | >
> | > Eventually, kernel may hang up when oom kill occur.
> | >
> | > The author need to resubmit it as adding knob and disabled
> | > by default if he really need this feature.
> | >
> | > Cc: Luis Claudio R. Goncalves <lclaudio@uudg.org>
> | > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> |
> | Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> The original patch was written to fix an issue observed in 2.6.24.7-rt.
> As the logic sounded useful, I ported it to upstream. Anyway,I am trying
> a few ideas to rework that patch. In the meantime, I'm pretty fine with
> reverting the commit.
>
> Acked-by: Luis Claudio R. Gonçalves <lgoncalv@uudg.org>
Ok, and then, I'll drop [patch 3/5] too. I hope to focus to discuss your idea.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/5] Revert "oom: give the dying task a higher priority"
2011-03-22 11:06 ` [PATCH 2/5] Revert "oom: give the dying task a higher priority" KOSAKI Motohiro
2011-03-23 7:42 ` KAMEZAWA Hiroyuki
@ 2011-03-24 15:27 ` Minchan Kim
2011-03-28 9:48 ` KOSAKI Motohiro
2011-03-28 9:51 ` Peter Zijlstra
1 sibling, 2 replies; 59+ messages in thread
From: Minchan Kim @ 2011-03-24 15:27 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin, Hugh Dickins,
KAMEZAWA Hiroyuki, Luis Claudio R. Goncalves, Peter Zijlstra
On Tue, Mar 22, 2011 at 08:06:48PM +0900, KOSAKI Motohiro wrote:
> This reverts commit 93b43fa55088fe977503a156d1097cc2055449a2.
>
> The commit dramatically improve oom killer logic when fork-bomb
> occur. But, I've found it has nasty corner case. Now cpu cgroup
> has strange default RT runtime. It's 0! That said, if a process
> under cpu cgroup promote RT scheduling class, the process never
> run at all.
>
> Eventually, kernel may hang up when oom kill occur.
>
> The author need to resubmit it as adding knob and disabled
> by default if he really need this feature.
>
> Cc: Luis Claudio R. Goncalves <lclaudio@uudg.org>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Just a comment below.
> ---
> mm/oom_kill.c | 27 ---------------------------
> 1 files changed, 0 insertions(+), 27 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 3100bc5..739dee4 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -84,24 +84,6 @@ static bool has_intersects_mems_allowed(struct task_struct *tsk,
> #endif /* CONFIG_NUMA */
>
> /*
> - * If this is a system OOM (not a memcg OOM) and the task selected to be
> - * killed is not already running at high (RT) priorities, speed up the
> - * recovery by boosting the dying task to the lowest FIFO priority.
> - * That helps with the recovery and avoids interfering with RT tasks.
> - */
> -static void boost_dying_task_prio(struct task_struct *p,
> - struct mem_cgroup *mem)
> -{
> - struct sched_param param = { .sched_priority = 1 };
> -
> - if (mem)
> - return;
> -
> - if (!rt_task(p))
> - sched_setscheduler_nocheck(p, SCHED_FIFO, ¶m);
> -}
> -
> -/*
> * The process p may have detached its own ->mm while exiting or through
> * use_mm(), but one or more of its subthreads may still have a valid
> * pointer. Return p, or any of its subthreads with a valid ->mm, with
> @@ -452,13 +434,6 @@ static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
> set_tsk_thread_flag(p, TIF_MEMDIE);
> force_sig(SIGKILL, p);
>
> - /*
> - * We give our sacrificial lamb high priority and access to
> - * all the memory it needs. That way it should be able to
> - * exit() and clear out its resources quickly...
> - */
> - boost_dying_task_prio(p, mem);
> -
Before merging 93b43fa5508, we had a following routine.
+static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
{
p = find_lock_task_mm(p);
if (!p) {
@@ -434,9 +452,17 @@ static int oom_kill_task(struct task_struct *p)
K(get_mm_counter(p->mm, MM_FILEPAGES)));
task_unlock(p);
- p->rt.time_slice = HZ; <<---- THIS
+
set_tsk_thread_flag(p, TIF_MEMDIE);
force_sig(SIGKILL, p);
+
+ /*
+ * We give our sacrificial lamb high priority and access to
+ * all the memory it needs. That way it should be able to
+ * exit() and clear out its resources quickly...
+ */
+ boost_dying_task_prio(p, mem);
+
return 0;
}
At that time, I thought that routine is meaningless in non-RT scheduler.
So I Cced Peter but don't get the answer.
I just want to confirm it.
Do you still think it's meaningless?
so you remove it when you revert 93b43fa5508?
Then, this isn't just revert patch but revert + killing meaningless code patch.
-
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/5] Revert "oom: give the dying task a higher priority"
2011-03-24 15:27 ` Minchan Kim
@ 2011-03-28 9:48 ` KOSAKI Motohiro
2011-03-28 12:28 ` Minchan Kim
2011-03-28 9:51 ` Peter Zijlstra
1 sibling, 1 reply; 59+ messages in thread
From: KOSAKI Motohiro @ 2011-03-28 9:48 UTC (permalink / raw)
To: Minchan Kim
Cc: kosaki.motohiro, linux-kernel, Andrew Morton, David Rientjes,
Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki,
Luis Claudio R. Goncalves, Peter Zijlstra
Hi
> @@ -434,9 +452,17 @@ static int oom_kill_task(struct task_struct *p)
> K(get_mm_counter(p->mm, MM_FILEPAGES)));
> task_unlock(p);
>
> - p->rt.time_slice = HZ; <<---- THIS
> +
> set_tsk_thread_flag(p, TIF_MEMDIE);
> force_sig(SIGKILL, p);
> +
> + /*
> + * We give our sacrificial lamb high priority and access to
> + * all the memory it needs. That way it should be able to
> + * exit() and clear out its resources quickly...
> + */
> + boost_dying_task_prio(p, mem);
> +
> return 0;
> }
>
> At that time, I thought that routine is meaningless in non-RT scheduler.
> So I Cced Peter but don't get the answer.
> I just want to confirm it.
>
> Do you still think it's meaningless?
In short, yes.
> so you remove it when you revert 93b43fa5508?
> Then, this isn't just revert patch but revert + killing meaningless code patch.
If you want, I'd like to rename a patch title. That said, we can't revert
93b43fa5508 simple cleanly, several patches depend on it. therefore I
reverted it manualy. and at that time, I don't want to resurrect
meaningless logic. anyway it's no matter. Luis is preparing new patches.
therefore we will get the same end result. :)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/5] Revert "oom: give the dying task a higher priority"
2011-03-28 9:48 ` KOSAKI Motohiro
@ 2011-03-28 12:28 ` Minchan Kim
0 siblings, 0 replies; 59+ messages in thread
From: Minchan Kim @ 2011-03-28 12:28 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin, Hugh Dickins,
KAMEZAWA Hiroyuki, Luis Claudio R. Goncalves, Peter Zijlstra
On Mon, Mar 28, 2011 at 06:48:13PM +0900, KOSAKI Motohiro wrote:
> Hi
>
> > @@ -434,9 +452,17 @@ static int oom_kill_task(struct task_struct *p)
> > K(get_mm_counter(p->mm, MM_FILEPAGES)));
> > task_unlock(p);
> >
> > - p->rt.time_slice = HZ; <<---- THIS
> > +
> > set_tsk_thread_flag(p, TIF_MEMDIE);
> > force_sig(SIGKILL, p);
> > +
> > + /*
> > + * We give our sacrificial lamb high priority and access to
> > + * all the memory it needs. That way it should be able to
> > + * exit() and clear out its resources quickly...
> > + */
> > + boost_dying_task_prio(p, mem);
> > +
> > return 0;
> > }
> >
> > At that time, I thought that routine is meaningless in non-RT scheduler.
> > So I Cced Peter but don't get the answer.
> > I just want to confirm it.
> >
> > Do you still think it's meaningless?
>
> In short, yes.
>
>
> > so you remove it when you revert 93b43fa5508?
> > Then, this isn't just revert patch but revert + killing meaningless code patch.
>
> If you want, I'd like to rename a patch title. That said, we can't revert
> 93b43fa5508 simple cleanly, several patches depend on it. therefore I
> reverted it manualy. and at that time, I don't want to resurrect
> meaningless logic. anyway it's no matter. Luis is preparing new patches.
> therefore we will get the same end result. :)
I don't mind it, either. :)
I just want to make sure the meaningless logic.
Thanks.
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/5] Revert "oom: give the dying task a higher priority"
2011-03-24 15:27 ` Minchan Kim
2011-03-28 9:48 ` KOSAKI Motohiro
@ 2011-03-28 9:51 ` Peter Zijlstra
2011-03-28 12:21 ` Minchan Kim
1 sibling, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2011-03-28 9:51 UTC (permalink / raw)
To: Minchan Kim
Cc: KOSAKI Motohiro, linux-kernel, Andrew Morton, David Rientjes,
Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki,
Luis Claudio R. Goncalves
On Fri, 2011-03-25 at 00:27 +0900, Minchan Kim wrote:
>
> At that time, I thought that routine is meaningless in non-RT scheduler.
> So I Cced Peter but don't get the answer.
> I just want to confirm it.
Probably lost somewhere in the mess that is my inbox :/, what is the
full question?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/5] Revert "oom: give the dying task a higher priority"
2011-03-28 9:51 ` Peter Zijlstra
@ 2011-03-28 12:21 ` Minchan Kim
2011-03-28 12:28 ` Peter Zijlstra
0 siblings, 1 reply; 59+ messages in thread
From: Minchan Kim @ 2011-03-28 12:21 UTC (permalink / raw)
To: Peter Zijlstra
Cc: KOSAKI Motohiro, linux-kernel, Andrew Morton, David Rientjes,
Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki,
Luis Claudio R. Goncalves
Hi Peter,
On Mon, Mar 28, 2011 at 11:51:36AM +0200, Peter Zijlstra wrote:
> On Fri, 2011-03-25 at 00:27 +0900, Minchan Kim wrote:
> >
> > At that time, I thought that routine is meaningless in non-RT scheduler.
> > So I Cced Peter but don't get the answer.
> > I just want to confirm it.
>
> Probably lost somewhere in the mess that is my inbox :/, what is the
> full question?
The question is we had a routine which change rt.time_slice with HZ to
accelarate task exit. But when we applied 93b43fa5508, we found it isn't effective
any more about normal task. So we removed it. Is it right?
And Kosaki is about to revert 93b43fa5508 to find out the problem of this thread
and Luis said he has a another solution to replace 93b43fa5508.
If rt.time_slice handleing is effective, we should restore it until Luis's patch
will be merged.
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/5] Revert "oom: give the dying task a higher priority"
2011-03-28 12:21 ` Minchan Kim
@ 2011-03-28 12:28 ` Peter Zijlstra
2011-03-28 12:40 ` Minchan Kim
0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2011-03-28 12:28 UTC (permalink / raw)
To: Minchan Kim
Cc: KOSAKI Motohiro, linux-kernel, Andrew Morton, David Rientjes,
Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki,
Luis Claudio R. Goncalves
On Mon, 2011-03-28 at 21:21 +0900, Minchan Kim wrote:
> Hi Peter,
>
> On Mon, Mar 28, 2011 at 11:51:36AM +0200, Peter Zijlstra wrote:
> > On Fri, 2011-03-25 at 00:27 +0900, Minchan Kim wrote:
> > >
> > > At that time, I thought that routine is meaningless in non-RT scheduler.
> > > So I Cced Peter but don't get the answer.
> > > I just want to confirm it.
> >
> > Probably lost somewhere in the mess that is my inbox :/, what is the
> > full question?
>
> The question is we had a routine which change rt.time_slice with HZ to
> accelarate task exit. But when we applied 93b43fa5508, we found it isn't effective
> any more about normal task. So we removed it. Is it right?
rt.time_slice is only relevant to SCHED_RR, since you seem to use
SCHED_FIFO (which runs for as long as the task is runnable), its
completely irrelevant.
> And Kosaki is about to revert 93b43fa5508 to find out the problem of this thread
> and Luis said he has a another solution to replace 93b43fa5508.
> If rt.time_slice handleing is effective, we should restore it until Luis's patch
> will be merged.
Right, so only SCHED_RR is affected by time_slice, it will be
decremented on tick (so anything that avoids ticks will also avoid the
decrement) and once it reaches 0 the task will be queued at the tail of
its static priority and reset the slice. If there is no other task on
that same priority we'll again schedule that task.
In short, don't use SCHED_RR and don't worry about time_slice.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/5] Revert "oom: give the dying task a higher priority"
2011-03-28 12:28 ` Peter Zijlstra
@ 2011-03-28 12:40 ` Minchan Kim
2011-03-28 13:10 ` Luis Claudio R. Goncalves
0 siblings, 1 reply; 59+ messages in thread
From: Minchan Kim @ 2011-03-28 12:40 UTC (permalink / raw)
To: Peter Zijlstra
Cc: KOSAKI Motohiro, linux-kernel, Andrew Morton, David Rientjes,
Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki,
Luis Claudio R. Goncalves
On Mon, Mar 28, 2011 at 02:28:27PM +0200, Peter Zijlstra wrote:
> On Mon, 2011-03-28 at 21:21 +0900, Minchan Kim wrote:
> > Hi Peter,
> >
> > On Mon, Mar 28, 2011 at 11:51:36AM +0200, Peter Zijlstra wrote:
> > > On Fri, 2011-03-25 at 00:27 +0900, Minchan Kim wrote:
> > > >
> > > > At that time, I thought that routine is meaningless in non-RT scheduler.
> > > > So I Cced Peter but don't get the answer.
> > > > I just want to confirm it.
> > >
> > > Probably lost somewhere in the mess that is my inbox :/, what is the
> > > full question?
> >
> > The question is we had a routine which change rt.time_slice with HZ to
> > accelarate task exit. But when we applied 93b43fa5508, we found it isn't effective
> > any more about normal task. So we removed it. Is it right?
>
> rt.time_slice is only relevant to SCHED_RR, since you seem to use
> SCHED_FIFO (which runs for as long as the task is runnable), its
> completely irrelevant.
>
> > And Kosaki is about to revert 93b43fa5508 to find out the problem of this thread
> > and Luis said he has a another solution to replace 93b43fa5508.
> > If rt.time_slice handleing is effective, we should restore it until Luis's patch
> > will be merged.
>
> Right, so only SCHED_RR is affected by time_slice, it will be
> decremented on tick (so anything that avoids ticks will also avoid the
> decrement) and once it reaches 0 the task will be queued at the tail of
> its static priority and reset the slice. If there is no other task on
> that same priority we'll again schedule that task.
>
> In short, don't use SCHED_RR and don't worry about time_slice.
There was meaningless code in there. I guess it was in there from CFS.
Thanks for the explanation, Peter.
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/5] Revert "oom: give the dying task a higher priority"
2011-03-28 12:40 ` Minchan Kim
@ 2011-03-28 13:10 ` Luis Claudio R. Goncalves
2011-03-28 13:18 ` Peter Zijlstra
2011-03-28 13:48 ` Minchan Kim
0 siblings, 2 replies; 59+ messages in thread
From: Luis Claudio R. Goncalves @ 2011-03-28 13:10 UTC (permalink / raw)
To: Minchan Kim
Cc: Peter Zijlstra, KOSAKI Motohiro, linux-kernel, Andrew Morton,
David Rientjes, Linus Torvalds, Rik van Riel, Oleg Nesterov,
linux-mm, Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki
On Mon, Mar 28, 2011 at 09:40:25PM +0900, Minchan Kim wrote:
| On Mon, Mar 28, 2011 at 02:28:27PM +0200, Peter Zijlstra wrote:
| > On Mon, 2011-03-28 at 21:21 +0900, Minchan Kim wrote:
| > > Hi Peter,
| > >
| > > On Mon, Mar 28, 2011 at 11:51:36AM +0200, Peter Zijlstra wrote:
| > > > On Fri, 2011-03-25 at 00:27 +0900, Minchan Kim wrote:
| > > > >
| > > > > At that time, I thought that routine is meaningless in non-RT scheduler.
| > > > > So I Cced Peter but don't get the answer.
| > > > > I just want to confirm it.
| > > >
| > > > Probably lost somewhere in the mess that is my inbox :/, what is the
| > > > full question?
| > >
| > > The question is we had a routine which change rt.time_slice with HZ to
| > > accelarate task exit. But when we applied 93b43fa5508, we found it isn't effective
| > > any more about normal task. So we removed it. Is it right?
| >
| > rt.time_slice is only relevant to SCHED_RR, since you seem to use
| > SCHED_FIFO (which runs for as long as the task is runnable), its
| > completely irrelevant.
| >
| > > And Kosaki is about to revert 93b43fa5508 to find out the problem of this thread
| > > and Luis said he has a another solution to replace 93b43fa5508.
| > > If rt.time_slice handleing is effective, we should restore it until Luis's patch
| > > will be merged.
| >
| > Right, so only SCHED_RR is affected by time_slice, it will be
| > decremented on tick (so anything that avoids ticks will also avoid the
| > decrement) and once it reaches 0 the task will be queued at the tail of
| > its static priority and reset the slice. If there is no other task on
| > that same priority we'll again schedule that task.
| >
| > In short, don't use SCHED_RR and don't worry about time_slice.
|
| There was meaningless code in there. I guess it was in there from CFS.
| Thanks for the explanation, Peter.
Yes, it was CFS related:
p = find_lock_task_mm(p);
...
p->rt.time_slice = HZ; <<---- THIS
Peter, would that be effective to boost the priority of the dying task?
I mean, in the context of SCHED_OTHER tasks would it really help the dying
task to be scheduled sooner to release its resources? If so, as we remove
the code in commit 93b43fa5508 we should re-add that old code.
Luis
--
[ Luis Claudio R. Goncalves Red Hat - Realtime Team ]
[ Fingerprint: 4FDD B8C4 3C59 34BD 8BE9 2696 7203 D980 A448 C8F8 ]
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/5] Revert "oom: give the dying task a higher priority"
2011-03-28 13:10 ` Luis Claudio R. Goncalves
@ 2011-03-28 13:18 ` Peter Zijlstra
2011-03-28 13:56 ` Luis Claudio R. Goncalves
2011-03-29 2:46 ` KOSAKI Motohiro
2011-03-28 13:48 ` Minchan Kim
1 sibling, 2 replies; 59+ messages in thread
From: Peter Zijlstra @ 2011-03-28 13:18 UTC (permalink / raw)
To: Luis Claudio R. Goncalves
Cc: Minchan Kim, KOSAKI Motohiro, linux-kernel, Andrew Morton,
David Rientjes, Linus Torvalds, Rik van Riel, Oleg Nesterov,
linux-mm, Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki
On Mon, 2011-03-28 at 10:10 -0300, Luis Claudio R. Goncalves wrote:
> | There was meaningless code in there. I guess it was in there from CFS.
> | Thanks for the explanation, Peter.
>
> Yes, it was CFS related:
>
> p = find_lock_task_mm(p);
> ...
> p->rt.time_slice = HZ; <<---- THIS
CFS has never used rt.time_slice, that's always been a pure SCHED_RR
thing.
> Peter, would that be effective to boost the priority of the dying task?
The thing you're currently doing, making it SCHED_FIFO ?
> I mean, in the context of SCHED_OTHER tasks would it really help the dying
> task to be scheduled sooner to release its resources?
That very much depends on how all this stuff works, I guess if everybody
serializes on OOM and only the first will actually kill a task and all
the waiting tasks will try to allocate a page again before also doing
the OOM thing, and the pending tasks are woken after the OOM target task
has completed dying.. then I don't see much point in boosting things,
since everybody interested in memory will block and eventually only the
dying task will be left running.
Its been a very long while since I stared at the OOM code..
> If so, as we remove
> the code in commit 93b43fa5508 we should re-add that old code.
It doesn't make any sense to fiddle with rt.time_slice afaict.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/5] Revert "oom: give the dying task a higher priority"
2011-03-28 13:18 ` Peter Zijlstra
@ 2011-03-28 13:56 ` Luis Claudio R. Goncalves
2011-03-29 2:46 ` KOSAKI Motohiro
1 sibling, 0 replies; 59+ messages in thread
From: Luis Claudio R. Goncalves @ 2011-03-28 13:56 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Minchan Kim, KOSAKI Motohiro, linux-kernel, Andrew Morton,
David Rientjes, Linus Torvalds, Rik van Riel, Oleg Nesterov,
linux-mm, Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki
On Mon, Mar 28, 2011 at 03:18:13PM +0200, Peter Zijlstra wrote:
| On Mon, 2011-03-28 at 10:10 -0300, Luis Claudio R. Goncalves wrote:
| > | There was meaningless code in there. I guess it was in there from CFS.
| > | Thanks for the explanation, Peter.
| >
| > Yes, it was CFS related:
| >
| > p = find_lock_task_mm(p);
| > ...
| > p->rt.time_slice = HZ; <<---- THIS
|
| CFS has never used rt.time_slice, that's always been a pure SCHED_RR
| thing.
|
| > Peter, would that be effective to boost the priority of the dying task?
|
| The thing you're currently doing, making it SCHED_FIFO ?
I meant the p->rt.time_slice line, but you already answered my question.
Thanks :)
| > I mean, in the context of SCHED_OTHER tasks would it really help the dying
| > task to be scheduled sooner to release its resources?
|
| That very much depends on how all this stuff works, I guess if everybody
| serializes on OOM and only the first will actually kill a task and all
| the waiting tasks will try to allocate a page again before also doing
| the OOM thing, and the pending tasks are woken after the OOM target task
| has completed dying.. then I don't see much point in boosting things,
| since everybody interested in memory will block and eventually only the
| dying task will be left running.
|
| Its been a very long while since I stared at the OOM code..
|
| > If so, as we remove
| > the code in commit 93b43fa5508 we should re-add that old code.
|
| It doesn't make any sense to fiddle with rt.time_slice afaict.
---end quoted text---
--
[ Luis Claudio R. Goncalves Red Hat - Realtime Team ]
[ Fingerprint: 4FDD B8C4 3C59 34BD 8BE9 2696 7203 D980 A448 C8F8 ]
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/5] Revert "oom: give the dying task a higher priority"
2011-03-28 13:18 ` Peter Zijlstra
2011-03-28 13:56 ` Luis Claudio R. Goncalves
@ 2011-03-29 2:46 ` KOSAKI Motohiro
1 sibling, 0 replies; 59+ messages in thread
From: KOSAKI Motohiro @ 2011-03-29 2:46 UTC (permalink / raw)
To: Peter Zijlstra
Cc: kosaki.motohiro, Luis Claudio R. Goncalves, Minchan Kim,
linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin, Hugh Dickins,
KAMEZAWA Hiroyuki
Hi
> > I mean, in the context of SCHED_OTHER tasks would it really help the dying
> > task to be scheduled sooner to release its resources?
>
> That very much depends on how all this stuff works, I guess if everybody
> serializes on OOM and only the first will actually kill a task and all
> the waiting tasks will try to allocate a page again before also doing
> the OOM thing, and the pending tasks are woken after the OOM target task
> has completed dying.. then I don't see much point in boosting things,
> since everybody interested in memory will block and eventually only the
> dying task will be left running.
Probably I can answer this question. When OOM occur, kernel has very a
few pages (typically 10 - 100). but not 0. therefore bloody page-in vs
page-out battle (aka allocation vs free battle) is running.
IOW, While we have multiple cpu or per-cpu page queue, we don't see
page cache become completely 0.
Therefore, not killed task doesn't sleep completely. page-out may have
very small allocation successful chance. (but almostly it's fail. pages
are stealed by another task)
Before Luis's patch, kernel livelock on oom may be solved within 30min,
but after his patch, it's solved within 1 second. that's big different
for human response time. That's the test result.
Thanks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/5] Revert "oom: give the dying task a higher priority"
2011-03-28 13:10 ` Luis Claudio R. Goncalves
2011-03-28 13:18 ` Peter Zijlstra
@ 2011-03-28 13:48 ` Minchan Kim
1 sibling, 0 replies; 59+ messages in thread
From: Minchan Kim @ 2011-03-28 13:48 UTC (permalink / raw)
To: Luis Claudio R. Goncalves
Cc: Peter Zijlstra, KOSAKI Motohiro, linux-kernel, Andrew Morton,
David Rientjes, Linus Torvalds, Rik van Riel, Oleg Nesterov,
linux-mm, Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki
On Mon, Mar 28, 2011 at 10:10:29AM -0300, Luis Claudio R. Goncalves wrote:
> On Mon, Mar 28, 2011 at 09:40:25PM +0900, Minchan Kim wrote:
> | On Mon, Mar 28, 2011 at 02:28:27PM +0200, Peter Zijlstra wrote:
> | > On Mon, 2011-03-28 at 21:21 +0900, Minchan Kim wrote:
> | > > Hi Peter,
> | > >
> | > > On Mon, Mar 28, 2011 at 11:51:36AM +0200, Peter Zijlstra wrote:
> | > > > On Fri, 2011-03-25 at 00:27 +0900, Minchan Kim wrote:
> | > > > >
> | > > > > At that time, I thought that routine is meaningless in non-RT scheduler.
> | > > > > So I Cced Peter but don't get the answer.
> | > > > > I just want to confirm it.
> | > > >
> | > > > Probably lost somewhere in the mess that is my inbox :/, what is the
> | > > > full question?
> | > >
> | > > The question is we had a routine which change rt.time_slice with HZ to
> | > > accelarate task exit. But when we applied 93b43fa5508, we found it isn't effective
> | > > any more about normal task. So we removed it. Is it right?
> | >
> | > rt.time_slice is only relevant to SCHED_RR, since you seem to use
> | > SCHED_FIFO (which runs for as long as the task is runnable), its
> | > completely irrelevant.
> | >
> | > > And Kosaki is about to revert 93b43fa5508 to find out the problem of this thread
> | > > and Luis said he has a another solution to replace 93b43fa5508.
> | > > If rt.time_slice handleing is effective, we should restore it until Luis's patch
> | > > will be merged.
> | >
> | > Right, so only SCHED_RR is affected by time_slice, it will be
> | > decremented on tick (so anything that avoids ticks will also avoid the
> | > decrement) and once it reaches 0 the task will be queued at the tail of
> | > its static priority and reset the slice. If there is no other task on
> | > that same priority we'll again schedule that task.
> | >
> | > In short, don't use SCHED_RR and don't worry about time_slice.
> |
> | There was meaningless code in there. I guess it was in there from CFS.
> | Thanks for the explanation, Peter.
>
> Yes, it was CFS related:
I think it wasn't related CFS but O(1).
I guess as we changed O(1) with CFS, the fault was remained.
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* [PATCH 3/5] oom: create oom autogroup
2011-03-22 11:04 ` [patch 0/5] oom: a few anti fork bomb patches KOSAKI Motohiro
2011-03-22 11:05 ` [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely KOSAKI Motohiro
2011-03-22 11:06 ` [PATCH 2/5] Revert "oom: give the dying task a higher priority" KOSAKI Motohiro
@ 2011-03-22 11:08 ` KOSAKI Motohiro
2011-03-22 23:21 ` Minchan Kim
2011-03-22 11:08 ` [PATCH 4/5] mm: introduce wait_on_page_locked_killable KOSAKI Motohiro
2011-03-22 11:09 ` [PATCH 5/5] x86,mm: make pagefault killable KOSAKI Motohiro
4 siblings, 1 reply; 59+ messages in thread
From: KOSAKI Motohiro @ 2011-03-22 11:08 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin, Hugh Dickins,
KAMEZAWA Hiroyuki, Mike Galbraith
When plenty processes (eg fork bomb) are running, the TIF_MEMDIE task
never exit, at least, human feel it's never. therefore kernel become
hang-up.
"perf sched" tell us a hint.
------------------------------------------------------------------------------
Task | Runtime ms | Average delay ms | Maximum delay ms |
------------------------------------------------------------------------------
python:1754 | 0.197 ms | avg: 1731.727 ms | max: 3433.805 ms |
python:1843 | 0.489 ms | avg: 1707.433 ms | max: 3622.955 ms |
python:1715 | 0.220 ms | avg: 1707.125 ms | max: 3623.246 ms |
python:1818 | 2.127 ms | avg: 1527.331 ms | max: 3622.553 ms |
...
...
Processes flood makes crazy scheduler delay. and then the victim process
can't run enough. Grr. Should we do?
Fortunately, we already have anti process flood framework, autogroup!
This patch reuse this framework and avoid kernel live lock.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
include/linux/oom.h | 1 +
include/linux/sched.h | 4 ++++
init/main.c | 2 ++
kernel/sched_autogroup.c | 4 ++--
mm/oom_kill.c | 23 +++++++++++++++++++++++
5 files changed, 32 insertions(+), 2 deletions(-)
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 5e3aa83..86bcea3 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -67,6 +67,7 @@ extern unsigned long badness(struct task_struct *p, struct mem_cgroup *mem,
const nodemask_t *nodemask, unsigned long uptime);
extern struct task_struct *find_lock_task_mm(struct task_struct *p);
+extern void oom_init(void);
/* sysctls */
extern int sysctl_oom_dump_tasks;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 98fc7ed..bdaad3f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1947,6 +1947,8 @@ int sched_rt_handler(struct ctl_table *table, int write,
#ifdef CONFIG_SCHED_AUTOGROUP
extern unsigned int sysctl_sched_autogroup_enabled;
+extern struct autogroup *autogroup_create(void);
+extern void autogroup_move_group(struct task_struct *p, struct autogroup *ag);
extern void sched_autogroup_create_attach(struct task_struct *p);
extern void sched_autogroup_detach(struct task_struct *p);
extern void sched_autogroup_fork(struct signal_struct *sig);
@@ -1956,6 +1958,8 @@ extern void proc_sched_autogroup_show_task(struct task_struct *p, struct seq_fil
extern int proc_sched_autogroup_set_nice(struct task_struct *p, int *nice);
#endif
#else
+extern struct autogroup *autogroup_create(void) { return NULL; }
+extern void autogroup_move_group(struct task_struct *p, struct autogroup *ag) {}
static inline void sched_autogroup_create_attach(struct task_struct *p) { }
static inline void sched_autogroup_detach(struct task_struct *p) { }
static inline void sched_autogroup_fork(struct signal_struct *sig) { }
diff --git a/init/main.c b/init/main.c
index 4a9479e..2c6e8da 100644
--- a/init/main.c
+++ b/init/main.c
@@ -68,6 +68,7 @@
#include <linux/shmem_fs.h>
#include <linux/slab.h>
#include <linux/perf_event.h>
+#include <linux/oom.h>
#include <asm/io.h>
#include <asm/bugs.h>
@@ -549,6 +550,7 @@ asmlinkage void __init start_kernel(void)
gfp_allowed_mask = __GFP_BITS_MASK;
kmem_cache_init_late();
+ oom_init();
/*
* HACK ALERT! This is early. We're enabling the console before
diff --git a/kernel/sched_autogroup.c b/kernel/sched_autogroup.c
index 5946ac5..6a1a2c4 100644
--- a/kernel/sched_autogroup.c
+++ b/kernel/sched_autogroup.c
@@ -63,7 +63,7 @@ static inline struct autogroup *autogroup_task_get(struct task_struct *p)
static void free_rt_sched_group(struct task_group *tg);
#endif
-static inline struct autogroup *autogroup_create(void)
+struct autogroup *autogroup_create(void)
{
struct autogroup *ag = kzalloc(sizeof(*ag), GFP_KERNEL);
struct task_group *tg;
@@ -143,7 +143,7 @@ autogroup_task_group(struct task_struct *p, struct task_group *tg)
return tg;
}
-static void
+void
autogroup_move_group(struct task_struct *p, struct autogroup *ag)
{
struct autogroup *prev;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 739dee4..2519e6a 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -38,6 +38,28 @@ int sysctl_oom_kill_allocating_task;
int sysctl_oom_dump_tasks = 1;
static DEFINE_SPINLOCK(zone_scan_lock);
+#ifdef CONFIG_SCHED_AUTOGROUP
+struct autogroup *oom_ag;
+
+void __init oom_init(void)
+{
+ oom_ag = autogroup_create();
+}
+
+static void oom_move_oom_ag(struct task_struct *p)
+{
+ autogroup_move_group(p, oom_ag);
+}
+#else
+void __init oom_init(void)
+{
+}
+
+static void oom_move_oom_ag(struct task_struct *p)
+{
+}
+#endif
+
#ifdef CONFIG_NUMA
/**
* has_intersects_mems_allowed() - check task eligiblity for kill
@@ -432,6 +454,7 @@ static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
}
set_tsk_thread_flag(p, TIF_MEMDIE);
+ oom_move_oom_ag(p);
force_sig(SIGKILL, p);
return 0;
--
1.6.5.2
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [PATCH 3/5] oom: create oom autogroup
2011-03-22 11:08 ` [PATCH 3/5] oom: create oom autogroup KOSAKI Motohiro
@ 2011-03-22 23:21 ` Minchan Kim
2011-03-23 1:27 ` KOSAKI Motohiro
0 siblings, 1 reply; 59+ messages in thread
From: Minchan Kim @ 2011-03-22 23:21 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin, Hugh Dickins,
KAMEZAWA Hiroyuki, Mike Galbraith
On Tue, Mar 22, 2011 at 8:08 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
> When plenty processes (eg fork bomb) are running, the TIF_MEMDIE task
> never exit, at least, human feel it's never. therefore kernel become
> hang-up.
>
> "perf sched" tell us a hint.
>
> ------------------------------------------------------------------------------
> Task | Runtime ms | Average delay ms | Maximum delay ms |
> ------------------------------------------------------------------------------
> python:1754 | 0.197 ms | avg: 1731.727 ms | max: 3433.805 ms |
> python:1843 | 0.489 ms | avg: 1707.433 ms | max: 3622.955 ms |
> python:1715 | 0.220 ms | avg: 1707.125 ms | max: 3623.246 ms |
> python:1818 | 2.127 ms | avg: 1527.331 ms | max: 3622.553 ms |
> ...
> ...
>
> Processes flood makes crazy scheduler delay. and then the victim process
> can't run enough. Grr. Should we do?
>
> Fortunately, we already have anti process flood framework, autogroup!
> This patch reuse this framework and avoid kernel live lock.
That's cool idea but I have a concern.
You remove boosting priority in [2/5] and move victim tasks into autogroup.
If I understand autogroup right, victim process and threads in the
process take less schedule chance than now.
Could it make unnecessary killing of other tasks?
I am not sure. Just out of curiosity.
Thanks for nice work, Kosaki.
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 3/5] oom: create oom autogroup
2011-03-22 23:21 ` Minchan Kim
@ 2011-03-23 1:27 ` KOSAKI Motohiro
2011-03-23 2:41 ` Mike Galbraith
0 siblings, 1 reply; 59+ messages in thread
From: KOSAKI Motohiro @ 2011-03-23 1:27 UTC (permalink / raw)
To: Minchan Kim
Cc: kosaki.motohiro, linux-kernel, Andrew Morton, David Rientjes,
Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki, Mike Galbraith,
Luis Claudio R. Goncalves
> On Tue, Mar 22, 2011 at 8:08 PM, KOSAKI Motohiro
> <kosaki.motohiro@jp.fujitsu.com> wrote:
> > When plenty processes (eg fork bomb) are running, the TIF_MEMDIE task
> > never exit, at least, human feel it's never. therefore kernel become
> > hang-up.
> >
> > "perf sched" tell us a hint.
> >
> > A ------------------------------------------------------------------------------
> > A Task A A A A A A A A A | A Runtime ms A | Average delay ms | Maximum delay ms |
> > A ------------------------------------------------------------------------------
> > A python:1754 A A A A A | A A A 0.197 ms | avg: 1731.727 ms | max: 3433.805 ms |
> > A python:1843 A A A A A | A A A 0.489 ms | avg: 1707.433 ms | max: 3622.955 ms |
> > A python:1715 A A A A A | A A A 0.220 ms | avg: 1707.125 ms | max: 3623.246 ms |
> > A python:1818 A A A A A | A A A 2.127 ms | avg: 1527.331 ms | max: 3622.553 ms |
> > A ...
> > A ...
> >
> > Processes flood makes crazy scheduler delay. and then the victim process
> > can't run enough. Grr. Should we do?
> >
> > Fortunately, we already have anti process flood framework, autogroup!
> > This patch reuse this framework and avoid kernel live lock.
>
> That's cool idea but I have a concern.
>
> You remove boosting priority in [2/5] and move victim tasks into autogroup.
> If I understand autogroup right, victim process and threads in the
> process take less schedule chance than now.
Right. Icky cpu-cgroup rt-runtime default enforce me to seek another solution.
Today, I got private mail from Luis and he seems to have another improvement
idea. so, I might drop this patch if his one works fine.
> Could it make unnecessary killing of other tasks?
> I am not sure. Just out of curiosity.
If you are talking about OOM serialization, It isn't. I don't change
OOM serialization stuff. at least for now.
If you are talking about scheduler fairness, both current and this patch
have scheduler unfairness. But that's ok. 1) When OOM situation, scheduling
fairness has been broken already by heavy memory reclaim effort 2) autogroup
mean to change scheduling grouping *automatically*. then, the patch change it
again for exiting memory starvation.
>
> Thanks for nice work, Kosaki.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 3/5] oom: create oom autogroup
2011-03-23 1:27 ` KOSAKI Motohiro
@ 2011-03-23 2:41 ` Mike Galbraith
0 siblings, 0 replies; 59+ messages in thread
From: Mike Galbraith @ 2011-03-23 2:41 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Minchan Kim, linux-kernel, Andrew Morton, David Rientjes,
Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
Andrey Vagin, Hugh Dickins, KAMEZAWA Hiroyuki,
Luis Claudio R. Goncalves
On Wed, 2011-03-23 at 10:27 +0900, KOSAKI Motohiro wrote:
> > On Tue, Mar 22, 2011 at 8:08 PM, KOSAKI Motohiro
> > <kosaki.motohiro@jp.fujitsu.com> wrote:
> > > When plenty processes (eg fork bomb) are running, the TIF_MEMDIE task
> > > never exit, at least, human feel it's never. therefore kernel become
> > > hang-up.
> > >
> > > "perf sched" tell us a hint.
> > >
> > > ------------------------------------------------------------------------------
> > > Task | Runtime ms | Average delay ms | Maximum delay ms |
> > > ------------------------------------------------------------------------------
> > > python:1754 | 0.197 ms | avg: 1731.727 ms | max: 3433.805 ms |
> > > python:1843 | 0.489 ms | avg: 1707.433 ms | max: 3622.955 ms |
> > > python:1715 | 0.220 ms | avg: 1707.125 ms | max: 3623.246 ms |
> > > python:1818 | 2.127 ms | avg: 1527.331 ms | max: 3622.553 ms |
> > > ...
> > > ...
> > >
> > > Processes flood makes crazy scheduler delay. and then the victim process
> > > can't run enough. Grr. Should we do?
> > >
> > > Fortunately, we already have anti process flood framework, autogroup!
> > > This patch reuse this framework and avoid kernel live lock.
> >
> > That's cool idea but I have a concern.
> >
> > You remove boosting priority in [2/5] and move victim tasks into autogroup.
> > If I understand autogroup right, victim process and threads in the
> > process take less schedule chance than now.
>
> Right. Icky cpu-cgroup rt-runtime default enforce me to seek another solution.
I was going to mention rt, and there's s/fork/clone as well.
> Today, I got private mail from Luis and he seems to have another improvement
> idea. so, I might drop this patch if his one works fine.
Perhaps if TIF_MEMDIE threads needs special treatment, preemption tests
could take that into account? (though I don't like touching fast path
for oddball cases)
-Mike
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* [PATCH 4/5] mm: introduce wait_on_page_locked_killable
2011-03-22 11:04 ` [patch 0/5] oom: a few anti fork bomb patches KOSAKI Motohiro
` (2 preceding siblings ...)
2011-03-22 11:08 ` [PATCH 3/5] oom: create oom autogroup KOSAKI Motohiro
@ 2011-03-22 11:08 ` KOSAKI Motohiro
2011-03-23 7:44 ` KAMEZAWA Hiroyuki
2011-03-24 15:04 ` Minchan Kim
2011-03-22 11:09 ` [PATCH 5/5] x86,mm: make pagefault killable KOSAKI Motohiro
4 siblings, 2 replies; 59+ messages in thread
From: KOSAKI Motohiro @ 2011-03-22 11:08 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin, Hugh Dickins,
KAMEZAWA Hiroyuki
commit 2687a356 (Add lock_page_killable) introduced killable
lock_page(). Similarly this patch introdues killable
wait_on_page_locked().
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
include/linux/pagemap.h | 9 +++++++++
mm/filemap.c | 11 +++++++++++
2 files changed, 20 insertions(+), 0 deletions(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e407601..49f9315 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -369,6 +369,15 @@ static inline int lock_page_or_retry(struct page *page, struct mm_struct *mm,
*/
extern void wait_on_page_bit(struct page *page, int bit_nr);
+extern int wait_on_page_bit_killable(struct page *page, int bit_nr);
+
+static inline int wait_on_page_locked_killable(struct page *page)
+{
+ if (PageLocked(page))
+ return wait_on_page_bit_killable(page, PG_locked);
+ return 0;
+}
+
/*
* Wait for a page to be unlocked.
*
diff --git a/mm/filemap.c b/mm/filemap.c
index a6cfecf..f5f9ac2 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -608,6 +608,17 @@ void wait_on_page_bit(struct page *page, int bit_nr)
}
EXPORT_SYMBOL(wait_on_page_bit);
+int wait_on_page_bit_killable(struct page *page, int bit_nr)
+{
+ DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
+
+ if (!test_bit(bit_nr, &page->flags))
+ return 0;
+
+ return __wait_on_bit(page_waitqueue(page), &wait, sync_page_killable,
+ TASK_KILLABLE);
+}
+
/**
* add_page_wait_queue - Add an arbitrary waiter to a page's wait queue
* @page: Page defining the wait queue of interest
--
1.6.5.2
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [PATCH 4/5] mm: introduce wait_on_page_locked_killable
2011-03-22 11:08 ` [PATCH 4/5] mm: introduce wait_on_page_locked_killable KOSAKI Motohiro
@ 2011-03-23 7:44 ` KAMEZAWA Hiroyuki
2011-03-24 15:04 ` Minchan Kim
1 sibling, 0 replies; 59+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-03-23 7:44 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin, Hugh Dickins
On Tue, 22 Mar 2011 20:08:41 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> commit 2687a356 (Add lock_page_killable) introduced killable
> lock_page(). Similarly this patch introdues killable
> wait_on_page_locked().
>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 4/5] mm: introduce wait_on_page_locked_killable
2011-03-22 11:08 ` [PATCH 4/5] mm: introduce wait_on_page_locked_killable KOSAKI Motohiro
2011-03-23 7:44 ` KAMEZAWA Hiroyuki
@ 2011-03-24 15:04 ` Minchan Kim
1 sibling, 0 replies; 59+ messages in thread
From: Minchan Kim @ 2011-03-24 15:04 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin, Hugh Dickins,
KAMEZAWA Hiroyuki
On Tue, Mar 22, 2011 at 08:08:41PM +0900, KOSAKI Motohiro wrote:
> commit 2687a356 (Add lock_page_killable) introduced killable
> lock_page(). Similarly this patch introdues killable
> wait_on_page_locked().
>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* [PATCH 5/5] x86,mm: make pagefault killable
2011-03-22 11:04 ` [patch 0/5] oom: a few anti fork bomb patches KOSAKI Motohiro
` (3 preceding siblings ...)
2011-03-22 11:08 ` [PATCH 4/5] mm: introduce wait_on_page_locked_killable KOSAKI Motohiro
@ 2011-03-22 11:09 ` KOSAKI Motohiro
2011-03-23 7:49 ` KAMEZAWA Hiroyuki
` (2 more replies)
4 siblings, 3 replies; 59+ messages in thread
From: KOSAKI Motohiro @ 2011-03-22 11:09 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin, Hugh Dickins,
KAMEZAWA Hiroyuki
When oom killer occured, almost processes are getting stuck following
two points.
1) __alloc_pages_nodemask
2) __lock_page_or_retry
1) is not much problematic because TIF_MEMDIE lead to make allocation
failure and get out from page allocator. 2) is more problematic. When
OOM situation, Zones typically don't have page cache at all and Memory
starvation might lead to reduce IO performance largely. When fork bomb
occur, TIF_MEMDIE task don't die quickly mean fork bomb may create
new process quickly rather than oom-killer kill it. Then, the system
may become livelock.
This patch makes pagefault interruptible by SIGKILL.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
arch/x86/mm/fault.c | 9 +++++++++
include/linux/mm.h | 1 +
mm/filemap.c | 22 +++++++++++++++++-----
3 files changed, 27 insertions(+), 5 deletions(-)
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 20e3f87..797c7d0 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1035,6 +1035,7 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
if (user_mode_vm(regs)) {
local_irq_enable();
error_code |= PF_USER;
+ flags |= FAULT_FLAG_KILLABLE;
} else {
if (regs->flags & X86_EFLAGS_IF)
local_irq_enable();
@@ -1138,6 +1139,14 @@ good_area:
}
/*
+ * Pagefault was interrupted by SIGKILL. We have no reason to
+ * continue pagefault.
+ */
+ if ((flags & FAULT_FLAG_KILLABLE) && (fault & VM_FAULT_RETRY) &&
+ fatal_signal_pending(current))
+ return;
+
+ /*
* Major/minor page fault accounting is only done on the
* initial attempt. If we go through a retry, it is extremely
* likely that the page will be found in page cache at that point.
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0716517..9e7c567 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -152,6 +152,7 @@ extern pgprot_t protection_map[16];
#define FAULT_FLAG_MKWRITE 0x04 /* Fault was mkwrite of existing pte */
#define FAULT_FLAG_ALLOW_RETRY 0x08 /* Retry fault if blocking */
#define FAULT_FLAG_RETRY_NOWAIT 0x10 /* Don't drop mmap_sem and wait when retrying */
+#define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */
/*
* This interface is used by x86 PAT code to identify a pfn mapping that is
diff --git a/mm/filemap.c b/mm/filemap.c
index f5f9ac2..affba94 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -719,15 +719,27 @@ void __lock_page_nosync(struct page *page)
int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
unsigned int flags)
{
- if (!(flags & FAULT_FLAG_ALLOW_RETRY)) {
- __lock_page(page);
- return 1;
- } else {
+ int ret;
+
+ if (flags & FAULT_FLAG_ALLOW_RETRY) {
if (!(flags & FAULT_FLAG_RETRY_NOWAIT)) {
up_read(&mm->mmap_sem);
- wait_on_page_locked(page);
+ if (flags & FAULT_FLAG_KILLABLE)
+ wait_on_page_locked_killable(page);
+ else
+ wait_on_page_locked(page);
}
return 0;
+ } else {
+ if (flags & FAULT_FLAG_KILLABLE) {
+ ret = __lock_page_killable(page);
+ if (ret) {
+ up_read(&mm->mmap_sem);
+ return 0;
+ }
+ } else
+ __lock_page(page);
+ return 1;
}
}
--
1.6.5.2
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [PATCH 5/5] x86,mm: make pagefault killable
2011-03-22 11:09 ` [PATCH 5/5] x86,mm: make pagefault killable KOSAKI Motohiro
@ 2011-03-23 7:49 ` KAMEZAWA Hiroyuki
2011-03-23 8:09 ` KOSAKI Motohiro
2011-03-24 15:10 ` Minchan Kim
2011-03-24 17:13 ` Oleg Nesterov
2 siblings, 1 reply; 59+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-03-23 7:49 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin, Hugh Dickins
On Tue, 22 Mar 2011 20:09:29 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> When oom killer occured, almost processes are getting stuck following
> two points.
>
> 1) __alloc_pages_nodemask
> 2) __lock_page_or_retry
>
> 1) is not much problematic because TIF_MEMDIE lead to make allocation
> failure and get out from page allocator. 2) is more problematic. When
> OOM situation, Zones typically don't have page cache at all and Memory
> starvation might lead to reduce IO performance largely. When fork bomb
> occur, TIF_MEMDIE task don't die quickly mean fork bomb may create
> new process quickly rather than oom-killer kill it. Then, the system
> may become livelock.
>
> This patch makes pagefault interruptible by SIGKILL.
>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
> arch/x86/mm/fault.c | 9 +++++++++
> include/linux/mm.h | 1 +
> mm/filemap.c | 22 +++++++++++++++++-----
> 3 files changed, 27 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 20e3f87..797c7d0 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1035,6 +1035,7 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
> if (user_mode_vm(regs)) {
> local_irq_enable();
> error_code |= PF_USER;
> + flags |= FAULT_FLAG_KILLABLE;
> } else {
> if (regs->flags & X86_EFLAGS_IF)
> local_irq_enable();
> @@ -1138,6 +1139,14 @@ good_area:
> }
>
> /*
> + * Pagefault was interrupted by SIGKILL. We have no reason to
> + * continue pagefault.
> + */
> + if ((flags & FAULT_FLAG_KILLABLE) && (fault & VM_FAULT_RETRY) &&
> + fatal_signal_pending(current))
> + return;
> +
Hmm? up_read(&mm->mmap_sem) ?
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 5/5] x86,mm: make pagefault killable
2011-03-23 7:49 ` KAMEZAWA Hiroyuki
@ 2011-03-23 8:09 ` KOSAKI Motohiro
2011-03-23 14:34 ` Linus Torvalds
0 siblings, 1 reply; 59+ messages in thread
From: KOSAKI Motohiro @ 2011-03-23 8:09 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: kosaki.motohiro, linux-kernel, Andrew Morton, David Rientjes,
Linus Torvalds, Rik van Riel, Oleg Nesterov, linux-mm,
Andrey Vagin, Hugh Dickins
> On Tue, 22 Mar 2011 20:09:29 +0900 (JST)
> KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
>
> > When oom killer occured, almost processes are getting stuck following
> > two points.
> >
> > 1) __alloc_pages_nodemask
> > 2) __lock_page_or_retry
> >
> > 1) is not much problematic because TIF_MEMDIE lead to make allocation
> > failure and get out from page allocator. 2) is more problematic. When
> > OOM situation, Zones typically don't have page cache at all and Memory
> > starvation might lead to reduce IO performance largely. When fork bomb
> > occur, TIF_MEMDIE task don't die quickly mean fork bomb may create
> > new process quickly rather than oom-killer kill it. Then, the system
> > may become livelock.
> >
> > This patch makes pagefault interruptible by SIGKILL.
> >
> > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > ---
> > arch/x86/mm/fault.c | 9 +++++++++
> > include/linux/mm.h | 1 +
> > mm/filemap.c | 22 +++++++++++++++++-----
> > 3 files changed, 27 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > index 20e3f87..797c7d0 100644
> > --- a/arch/x86/mm/fault.c
> > +++ b/arch/x86/mm/fault.c
> > @@ -1035,6 +1035,7 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
> > if (user_mode_vm(regs)) {
> > local_irq_enable();
> > error_code |= PF_USER;
> > + flags |= FAULT_FLAG_KILLABLE;
> > } else {
> > if (regs->flags & X86_EFLAGS_IF)
> > local_irq_enable();
> > @@ -1138,6 +1139,14 @@ good_area:
> > }
> >
> > /*
> > + * Pagefault was interrupted by SIGKILL. We have no reason to
> > + * continue pagefault.
> > + */
> > + if ((flags & FAULT_FLAG_KILLABLE) && (fault & VM_FAULT_RETRY) &&
> > + fatal_signal_pending(current))
> > + return;
> > +
>
> Hmm? up_read(&mm->mmap_sem) ?
When __lock_page_or_retry() return 0, It call up_read(mmap_sem) in this
function.
I agree this is strange (or ugly). but I don't want change this spec in
this time.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 5/5] x86,mm: make pagefault killable
2011-03-23 8:09 ` KOSAKI Motohiro
@ 2011-03-23 14:34 ` Linus Torvalds
0 siblings, 0 replies; 59+ messages in thread
From: Linus Torvalds @ 2011-03-23 14:34 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: KAMEZAWA Hiroyuki, linux-kernel, Andrew Morton, David Rientjes,
Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin, Hugh Dickins
On Wed, Mar 23, 2011 at 1:09 AM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
>
> When __lock_page_or_retry() return 0, It call up_read(mmap_sem) in this
> function.
Indeed.
> I agree this is strange (or ugly). but I don't want change this spec in
> this time.
I agree that it is strange, and I don't like functions that touch
locks that they didn't take themselves, but since the original point
of the whole thing was to wait for the page without holding the
mmap_sem lock, that function has to do the up_read() early.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 5/5] x86,mm: make pagefault killable
2011-03-22 11:09 ` [PATCH 5/5] x86,mm: make pagefault killable KOSAKI Motohiro
2011-03-23 7:49 ` KAMEZAWA Hiroyuki
@ 2011-03-24 15:10 ` Minchan Kim
2011-03-24 17:13 ` Oleg Nesterov
2 siblings, 0 replies; 59+ messages in thread
From: Minchan Kim @ 2011-03-24 15:10 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
Rik van Riel, Oleg Nesterov, linux-mm, Andrey Vagin, Hugh Dickins,
KAMEZAWA Hiroyuki
On Tue, Mar 22, 2011 at 08:09:29PM +0900, KOSAKI Motohiro wrote:
> When oom killer occured, almost processes are getting stuck following
> two points.
>
> 1) __alloc_pages_nodemask
> 2) __lock_page_or_retry
>
> 1) is not much problematic because TIF_MEMDIE lead to make allocation
> failure and get out from page allocator. 2) is more problematic. When
> OOM situation, Zones typically don't have page cache at all and Memory
> starvation might lead to reduce IO performance largely. When fork bomb
> occur, TIF_MEMDIE task don't die quickly mean fork bomb may create
> new process quickly rather than oom-killer kill it. Then, the system
> may become livelock.
>
> This patch makes pagefault interruptible by SIGKILL.
>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Looks like a cool idea.
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 5/5] x86,mm: make pagefault killable
2011-03-22 11:09 ` [PATCH 5/5] x86,mm: make pagefault killable KOSAKI Motohiro
2011-03-23 7:49 ` KAMEZAWA Hiroyuki
2011-03-24 15:10 ` Minchan Kim
@ 2011-03-24 17:13 ` Oleg Nesterov
2011-03-24 17:34 ` Linus Torvalds
2 siblings, 1 reply; 59+ messages in thread
From: Oleg Nesterov @ 2011-03-24 17:13 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: linux-kernel, Andrew Morton, David Rientjes, Linus Torvalds,
Rik van Riel, linux-mm, Andrey Vagin, Hugh Dickins,
KAMEZAWA Hiroyuki
On 03/22, KOSAKI Motohiro wrote:
>
> This patch makes pagefault interruptible by SIGKILL.
Not a comment, but the question...
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1035,6 +1035,7 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
> if (user_mode_vm(regs)) {
> local_irq_enable();
> error_code |= PF_USER;
> + flags |= FAULT_FLAG_KILLABLE;
OK, this is clear.
I am wondering, can't we set FAULT_FLAG_KILLABLE unconditionally
but check PF_USER when we get VM_FAULT_RETRY? I mean,
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
if (!(error_code & PF_USER))
no_context(...);
return;
}
Probably not... but I can't find any example of in-kernel fault which
can be broken by -EFAULT if current was killed.
mm_release()->put_user(clear_child_tid) should be fine...
Just curious, I feel I missed something obvious.
Oleg.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 5/5] x86,mm: make pagefault killable
2011-03-24 17:13 ` Oleg Nesterov
@ 2011-03-24 17:34 ` Linus Torvalds
2011-03-28 7:00 ` KOSAKI Motohiro
0 siblings, 1 reply; 59+ messages in thread
From: Linus Torvalds @ 2011-03-24 17:34 UTC (permalink / raw)
To: Oleg Nesterov
Cc: KOSAKI Motohiro, linux-kernel, Andrew Morton, David Rientjes,
Rik van Riel, linux-mm, Andrey Vagin, Hugh Dickins,
KAMEZAWA Hiroyuki
On Thu, Mar 24, 2011 at 10:13 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>
> I am wondering, can't we set FAULT_FLAG_KILLABLE unconditionally
> but check PF_USER when we get VM_FAULT_RETRY? I mean,
>
> if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
> if (!(error_code & PF_USER))
> no_context(...);
> return;
> }
I agree, we should do this.
> Probably not... but I can't find any example of in-kernel fault which
> can be broken by -EFAULT if current was killed.
There's no way that can validly break anything, since any such
codepath has to be able to handle -EFAULT for other reasons anyway.
The only issue is whether we're ok with a regular write() system call
(for example) not being atomic in the presence of a fatal signal. So
it does change semantics, but I think it changes it in a good way
(technically POSIX requires atomicity, but on the other hand,
technically POSIX also doesn't talk about the process being killed,
and writes would still be atomic for the case where they actually
return. Not to mention NFS etc where writes have never been atomic
anyway, so a program that relies on strict "all or nothing" write
behavior is fundamentally broken to begin with).
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 5/5] x86,mm: make pagefault killable
2011-03-24 17:34 ` Linus Torvalds
@ 2011-03-28 7:00 ` KOSAKI Motohiro
0 siblings, 0 replies; 59+ messages in thread
From: KOSAKI Motohiro @ 2011-03-28 7:00 UTC (permalink / raw)
To: Linus Torvalds
Cc: kosaki.motohiro, Oleg Nesterov, linux-kernel, Andrew Morton,
David Rientjes, Rik van Riel, linux-mm, Andrey Vagin,
Hugh Dickins, KAMEZAWA Hiroyuki
> On Thu, Mar 24, 2011 at 10:13 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > I am wondering, can't we set FAULT_FLAG_KILLABLE unconditionally
> > but check PF_USER when we get VM_FAULT_RETRY? I mean,
> >
> > if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
> > if (!(error_code & PF_USER))
> > no_context(...);
> > return;
> > }
>
> I agree, we should do this.
>
> > Probably not... but I can't find any example of in-kernel fault which
> > can be broken by -EFAULT if current was killed.
>
> There's no way that can validly break anything, since any such
> codepath has to be able to handle -EFAULT for other reasons anyway.
>
> The only issue is whether we're ok with a regular write() system call
> (for example) not being atomic in the presence of a fatal signal. So
> it does change semantics, but I think it changes it in a good way
> (technically POSIX requires atomicity, but on the other hand,
> technically POSIX also doesn't talk about the process being killed,
> and writes would still be atomic for the case where they actually
> return. Not to mention NFS etc where writes have never been atomic
> anyway, so a program that relies on strict "all or nothing" write
> behavior is fundamentally broken to begin with).
Ok, I didn't have enough brave. Will do.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 59+ messages in thread