* [PATCH] mm: do not sleep in balance_pgdat if there's no i/o congestion @ 2012-12-19 23:17 Zlatko Calusic 2012-12-19 23:25 ` Zlatko Calusic 2012-12-20 11:12 ` Mel Gorman 0 siblings, 2 replies; 16+ messages in thread From: Zlatko Calusic @ 2012-12-19 23:17 UTC (permalink / raw) To: Linus Torvalds, Andrew Morton, Mel Gorman, Hugh Dickins Cc: linux-mm, Linux Kernel Mailing List On a 4GB RAM machine, where Normal zone is much smaller than DMA32 zone, the Normal zone gets fragmented in time. This requires relatively more pressure in balance_pgdat to get the zone above the required watermark. Unfortunately, the congestion_wait() call in there slows it down for a completely wrong reason, expecting that there's a lot of writeback/swapout, even when there's none (much more common). After a few days, when fragmentation progresses, this flawed logic translates to a very high CPU iowait times, even though there's no I/O congestion at all. If THP is enabled, the problem occurs sooner, but I was able to see it even on !THP kernels, just by giving it a bit more time to occur. The proper way to deal with this is to not wait, unless there's congestion. Thanks to Mel Gorman, we already have the function that perfectly fits the job. The patch was tested on a machine which nicely revealed the problem after only 1 day of uptime, and it's been working great. --- mm/vmscan.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index b7ed376..4588d1d 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2546,7 +2546,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining, static unsigned long balance_pgdat(pg_data_t *pgdat, int order, int *classzone_idx) { - int all_zones_ok; + struct zone *unbalanced_zone; unsigned long balanced; int i; int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */ @@ -2580,7 +2580,7 @@ loop_again: unsigned long lru_pages = 0; int has_under_min_watermark_zone = 0; - all_zones_ok = 1; + unbalanced_zone = NULL; balanced = 0; /* @@ -2719,7 +2719,7 @@ loop_again: } if (!zone_balanced(zone, testorder, 0, end_zone)) { - all_zones_ok = 0; + unbalanced_zone = zone; /* * We are still under min water mark. This * means that we have a GFP_ATOMIC allocation @@ -2752,7 +2752,7 @@ loop_again: pfmemalloc_watermark_ok(pgdat)) wake_up(&pgdat->pfmemalloc_wait); - if (all_zones_ok || (order && pgdat_balanced(pgdat, balanced, *classzone_idx))) + if (!unbalanced_zone || (order && pgdat_balanced(pgdat, balanced, *classzone_idx))) break; /* kswapd: all done */ /* * OK, kswapd is getting into trouble. Take a nap, then take @@ -2762,7 +2762,7 @@ loop_again: if (has_under_min_watermark_zone) count_vm_event(KSWAPD_SKIP_CONGESTION_WAIT); else - congestion_wait(BLK_RW_ASYNC, HZ/10); + wait_iff_congested(unbalanced_zone, BLK_RW_ASYNC, HZ/10); } /* @@ -2781,7 +2781,7 @@ out: * high-order: Balanced zones must make up at least 25% of the node * for the node to be balanced */ - if (!(all_zones_ok || (order && pgdat_balanced(pgdat, balanced, *classzone_idx)))) { + if (unbalanced_zone && (!order || !pgdat_balanced(pgdat, balanced, *classzone_idx))) { cond_resched(); try_to_freeze(); -- 1.7.10.4 -- Zlatko -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH] mm: do not sleep in balance_pgdat if there's no i/o congestion 2012-12-19 23:17 [PATCH] mm: do not sleep in balance_pgdat if there's no i/o congestion Zlatko Calusic @ 2012-12-19 23:25 ` Zlatko Calusic 2012-12-21 11:51 ` Hillf Danton 2012-12-20 11:12 ` Mel Gorman 1 sibling, 1 reply; 16+ messages in thread From: Zlatko Calusic @ 2012-12-19 23:25 UTC (permalink / raw) To: Linus Torvalds, Andrew Morton, Mel Gorman, Hugh Dickins Cc: linux-mm, Linux Kernel Mailing List On a 4GB RAM machine, where Normal zone is much smaller than DMA32 zone, the Normal zone gets fragmented in time. This requires relatively more pressure in balance_pgdat to get the zone above the required watermark. Unfortunately, the congestion_wait() call in there slows it down for a completely wrong reason, expecting that there's a lot of writeback/swapout, even when there's none (much more common). After a few days, when fragmentation progresses, this flawed logic translates to a very high CPU iowait times, even though there's no I/O congestion at all. If THP is enabled, the problem occurs sooner, but I was able to see it even on !THP kernels, just by giving it a bit more time to occur. The proper way to deal with this is to not wait, unless there's congestion. Thanks to Mel Gorman, we already have the function that perfectly fits the job. The patch was tested on a machine which nicely revealed the problem after only 1 day of uptime, and it's been working great. Signed-off-by: Zlatko Calusic <zlatko.calusic@iskon.hr> --- mm/vmscan.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index b7ed376..4588d1d 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2546,7 +2546,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining, static unsigned long balance_pgdat(pg_data_t *pgdat, int order, int *classzone_idx) { - int all_zones_ok; + struct zone *unbalanced_zone; unsigned long balanced; int i; int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */ @@ -2580,7 +2580,7 @@ loop_again: unsigned long lru_pages = 0; int has_under_min_watermark_zone = 0; - all_zones_ok = 1; + unbalanced_zone = NULL; balanced = 0; /* @@ -2719,7 +2719,7 @@ loop_again: } if (!zone_balanced(zone, testorder, 0, end_zone)) { - all_zones_ok = 0; + unbalanced_zone = zone; /* * We are still under min water mark. This * means that we have a GFP_ATOMIC allocation @@ -2752,7 +2752,7 @@ loop_again: pfmemalloc_watermark_ok(pgdat)) wake_up(&pgdat->pfmemalloc_wait); - if (all_zones_ok || (order && pgdat_balanced(pgdat, balanced, *classzone_idx))) + if (!unbalanced_zone || (order && pgdat_balanced(pgdat, balanced, *classzone_idx))) break; /* kswapd: all done */ /* * OK, kswapd is getting into trouble. Take a nap, then take @@ -2762,7 +2762,7 @@ loop_again: if (has_under_min_watermark_zone) count_vm_event(KSWAPD_SKIP_CONGESTION_WAIT); else - congestion_wait(BLK_RW_ASYNC, HZ/10); + wait_iff_congested(unbalanced_zone, BLK_RW_ASYNC, HZ/10); } /* @@ -2781,7 +2781,7 @@ out: * high-order: Balanced zones must make up at least 25% of the node * for the node to be balanced */ - if (!(all_zones_ok || (order && pgdat_balanced(pgdat, balanced, *classzone_idx)))) { + if (unbalanced_zone && (!order || !pgdat_balanced(pgdat, balanced, *classzone_idx))) { cond_resched(); try_to_freeze(); -- 1.7.10.4 -- Zlatko (this time with proper Signed-off-by line) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH] mm: do not sleep in balance_pgdat if there's no i/o congestion 2012-12-19 23:25 ` Zlatko Calusic @ 2012-12-21 11:51 ` Hillf Danton 2012-12-27 15:42 ` Zlatko Calusic 0 siblings, 1 reply; 16+ messages in thread From: Hillf Danton @ 2012-12-21 11:51 UTC (permalink / raw) To: Zlatko Calusic Cc: Linus Torvalds, Andrew Morton, Mel Gorman, Hugh Dickins, linux-mm, Linux Kernel Mailing List On Thu, Dec 20, 2012 at 7:25 AM, Zlatko Calusic <zlatko.calusic@iskon.hr> wrote: > static unsigned long balance_pgdat(pg_data_t *pgdat, int order, > int *classzone_idx) > { > - int all_zones_ok; > + struct zone *unbalanced_zone; nit: less hunks if not erase that mark Hillf > unsigned long balanced; > int i; > int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */ > @@ -2580,7 +2580,7 @@ loop_again: > unsigned long lru_pages = 0; > int has_under_min_watermark_zone = 0; > > - all_zones_ok = 1; > + unbalanced_zone = NULL; > balanced = 0; > > /* > @@ -2719,7 +2719,7 @@ loop_again: > } > > if (!zone_balanced(zone, testorder, 0, end_zone)) { > - all_zones_ok = 0; > + unbalanced_zone = zone; > /* > * We are still under min water mark. This > * means that we have a GFP_ATOMIC allocation > @@ -2752,7 +2752,7 @@ loop_again: > pfmemalloc_watermark_ok(pgdat)) > wake_up(&pgdat->pfmemalloc_wait); > > - if (all_zones_ok || (order && pgdat_balanced(pgdat, balanced, *classzone_idx))) > + if (!unbalanced_zone || (order && pgdat_balanced(pgdat, balanced, *classzone_idx))) > break; /* kswapd: all done */ > /* > * OK, kswapd is getting into trouble. Take a nap, then take > @@ -2762,7 +2762,7 @@ loop_again: > if (has_under_min_watermark_zone) > count_vm_event(KSWAPD_SKIP_CONGESTION_WAIT); > else > - congestion_wait(BLK_RW_ASYNC, HZ/10); > + wait_iff_congested(unbalanced_zone, BLK_RW_ASYNC, HZ/10); > } > > /* > @@ -2781,7 +2781,7 @@ out: > * high-order: Balanced zones must make up at least 25% of the node > * for the node to be balanced > */ > - if (!(all_zones_ok || (order && pgdat_balanced(pgdat, balanced, *classzone_idx)))) { > + if (unbalanced_zone && (!order || !pgdat_balanced(pgdat, balanced, *classzone_idx))) { > cond_resched(); > > try_to_freeze(); > -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm: do not sleep in balance_pgdat if there's no i/o congestion 2012-12-21 11:51 ` Hillf Danton @ 2012-12-27 15:42 ` Zlatko Calusic 2012-12-29 7:25 ` Hillf Danton 0 siblings, 1 reply; 16+ messages in thread From: Zlatko Calusic @ 2012-12-27 15:42 UTC (permalink / raw) To: Hillf Danton Cc: Linus Torvalds, Andrew Morton, Mel Gorman, Hugh Dickins, linux-mm, Linux Kernel Mailing List On 21.12.2012 12:51, Hillf Danton wrote: > On Thu, Dec 20, 2012 at 7:25 AM, Zlatko Calusic <zlatko.calusic@iskon.hr> wrote: >> static unsigned long balance_pgdat(pg_data_t *pgdat, int order, >> int *classzone_idx) >> { >> - int all_zones_ok; >> + struct zone *unbalanced_zone; > > nit: less hunks if not erase that mark > > Hillf This one left unanswered and forgotten because I didn't understand what you meant. Could you elaborate? -- Zlatko -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm: do not sleep in balance_pgdat if there's no i/o congestion 2012-12-27 15:42 ` Zlatko Calusic @ 2012-12-29 7:25 ` Hillf Danton 2012-12-29 12:11 ` Zlatko Calusic 0 siblings, 1 reply; 16+ messages in thread From: Hillf Danton @ 2012-12-29 7:25 UTC (permalink / raw) To: Zlatko Calusic Cc: Linus Torvalds, Andrew Morton, Mel Gorman, Hugh Dickins, linux-mm, Linux Kernel Mailing List On Thu, Dec 27, 2012 at 11:42 PM, Zlatko Calusic <zlatko.calusic@iskon.hr> wrote: > On 21.12.2012 12:51, Hillf Danton wrote: >> >> On Thu, Dec 20, 2012 at 7:25 AM, Zlatko Calusic <zlatko.calusic@iskon.hr> >> wrote: >>> >>> static unsigned long balance_pgdat(pg_data_t *pgdat, int order, >>> int >>> *classzone_idx) >>> { >>> - int all_zones_ok; >>> + struct zone *unbalanced_zone; >> >> >> nit: less hunks if not erase that mark >> >> Hillf > > > This one left unanswered and forgotten because I didn't understand what you > meant. Could you elaborate? > Sure, the patch looks simpler(and nicer) if we dont erase all_zones_ok. Hillf -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm: do not sleep in balance_pgdat if there's no i/o congestion 2012-12-29 7:25 ` Hillf Danton @ 2012-12-29 12:11 ` Zlatko Calusic 0 siblings, 0 replies; 16+ messages in thread From: Zlatko Calusic @ 2012-12-29 12:11 UTC (permalink / raw) To: Hillf Danton Cc: Linus Torvalds, Andrew Morton, Mel Gorman, Hugh Dickins, linux-mm, Linux Kernel Mailing List On 29.12.2012 08:25, Hillf Danton wrote: > On Thu, Dec 27, 2012 at 11:42 PM, Zlatko Calusic > <zlatko.calusic@iskon.hr> wrote: >> On 21.12.2012 12:51, Hillf Danton wrote: >>> >>> On Thu, Dec 20, 2012 at 7:25 AM, Zlatko Calusic <zlatko.calusic@iskon.hr> >>> wrote: >>>> >>>> static unsigned long balance_pgdat(pg_data_t *pgdat, int order, >>>> int >>>> *classzone_idx) >>>> { >>>> - int all_zones_ok; >>>> + struct zone *unbalanced_zone; >>> >>> >>> nit: less hunks if not erase that mark >>> >>> Hillf >> >> >> This one left unanswered and forgotten because I didn't understand what you >> meant. Could you elaborate? >> > Sure, the patch looks simpler(and nicer) if we dont > erase all_zones_ok. > Ah, yes. I gave it a good thought. But, when I introduced unbalanced_zone it just didn't make much sense to me to have two variables with very similar meaning. If I decided to keep all_zones_ok, it would be either: all_zones_ok = true unbalanced_zone = NULL (meaning: if no zone in unbalanced, then all zones must be ok) or all_zones_ok = false unbalanced_zone = struct zone * (meaning: if there's an unbalanced zone, then certainly not all zones are ok) So I decided to use only unbalanced_zone (because I had to!), and remove all_zones_ok to avoid redundancy. I hope it makes sense. If you check my latest (and still queued) optimization (mm: avoid calling pgdat_balanced() needlessly), there again popped up a need for a boolean, but I called it pgdat_is_balanced this time, just to match the name of two other functions. It could've also been called all_zones_ok if you prefer the name? Of course, I have no strong feelings about the name, both are OK, so if you want me to redo the patch, just say. Generally speaking, while I always attempt to make a smaller patch (less hunks and less changes = easier to review), before that I'll always try to make the code that results from the commit cleaner, simpler, more readable. For example, I'll always check that I don't mess with whitespace needlessly, unless I think it's actually desirable, here's just one example: "mm: avoid calling pgdat_balanced() needlessly" changes --- } while (--sc.priority >= 0); out: if (!pgdat_balanced(pgdat, order, *classzone_idx)) { --- to --- } while (--sc.priority >= 0); out: if (!pgdat_is_balanced) { --- because I find the latter more correct place for the label "out". Thanks for the comment. -- Zlatko -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm: do not sleep in balance_pgdat if there's no i/o congestion 2012-12-19 23:17 [PATCH] mm: do not sleep in balance_pgdat if there's no i/o congestion Zlatko Calusic 2012-12-19 23:25 ` Zlatko Calusic @ 2012-12-20 11:12 ` Mel Gorman 2012-12-20 20:58 ` Andrew Morton 1 sibling, 1 reply; 16+ messages in thread From: Mel Gorman @ 2012-12-20 11:12 UTC (permalink / raw) To: Zlatko Calusic Cc: Linus Torvalds, Andrew Morton, Hugh Dickins, linux-mm, Linux Kernel Mailing List On Thu, Dec 20, 2012 at 12:17:07AM +0100, Zlatko Calusic wrote: > On a 4GB RAM machine, where Normal zone is much smaller than > DMA32 zone, the Normal zone gets fragmented in time. This requires > relatively more pressure in balance_pgdat to get the zone above the > required watermark. Unfortunately, the congestion_wait() call in there > slows it down for a completely wrong reason, expecting that there's > a lot of writeback/swapout, even when there's none (much more common). > After a few days, when fragmentation progresses, this flawed logic > translates to a very high CPU iowait times, even though there's no > I/O congestion at all. If THP is enabled, the problem occurs sooner, > but I was able to see it even on !THP kernels, just by giving it a bit > more time to occur. > > The proper way to deal with this is to not wait, unless there's > congestion. Thanks to Mel Gorman, we already have the function that > perfectly fits the job. The patch was tested on a machine which > nicely revealed the problem after only 1 day of uptime, and it's been > working great. > --- > mm/vmscan.c | 12 ++++++------ > 1 file changed, 6 insertions(+), 6 deletions(-) > Acked-by: Mel Gorman <mgorman@suse.de -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm: do not sleep in balance_pgdat if there's no i/o congestion 2012-12-20 11:12 ` Mel Gorman @ 2012-12-20 20:58 ` Andrew Morton 2012-12-22 18:54 ` [PATCH] mm: modify pgdat_balanced() so that it also handles order=0 Zlatko Calusic 0 siblings, 1 reply; 16+ messages in thread From: Andrew Morton @ 2012-12-20 20:58 UTC (permalink / raw) To: Mel Gorman Cc: Zlatko Calusic, Linus Torvalds, Hugh Dickins, linux-mm, Linux Kernel Mailing List On Thu, 20 Dec 2012 11:12:08 +0000 Mel Gorman <mgorman@suse.de> wrote: > On Thu, Dec 20, 2012 at 12:17:07AM +0100, Zlatko Calusic wrote: > > On a 4GB RAM machine, where Normal zone is much smaller than > > DMA32 zone, the Normal zone gets fragmented in time. This requires > > relatively more pressure in balance_pgdat to get the zone above the > > required watermark. Unfortunately, the congestion_wait() call in there > > slows it down for a completely wrong reason, expecting that there's > > a lot of writeback/swapout, even when there's none (much more common). > > After a few days, when fragmentation progresses, this flawed logic > > translates to a very high CPU iowait times, even though there's no > > I/O congestion at all. If THP is enabled, the problem occurs sooner, > > but I was able to see it even on !THP kernels, just by giving it a bit > > more time to occur. > > > > The proper way to deal with this is to not wait, unless there's > > congestion. Thanks to Mel Gorman, we already have the function that > > perfectly fits the job. The patch was tested on a machine which > > nicely revealed the problem after only 1 day of uptime, and it's been > > working great. > > --- > > mm/vmscan.c | 12 ++++++------ > > 1 file changed, 6 insertions(+), 6 deletions(-) > > > > Acked-by: Mel Gorman <mgorman@suse.de There seems to be some complexity/duplication here between the new unbalanced_zone() and pgdat_balanced(). Can we modify pgdat_balanced() so that it also handles order=0, then do - if (!unbalanced_zone || (order && pgdat_balanced(pgdat, balanced, *classzone_idx))) + if (!pgdat_balanced(...)) ? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH] mm: modify pgdat_balanced() so that it also handles order=0 2012-12-20 20:58 ` Andrew Morton @ 2012-12-22 18:54 ` Zlatko Calusic 2012-12-23 14:12 ` [PATCH v2] " Zlatko Calusic 0 siblings, 1 reply; 16+ messages in thread From: Zlatko Calusic @ 2012-12-22 18:54 UTC (permalink / raw) To: Andrew Morton, Mel Gorman Cc: Linus Torvalds, Hugh Dickins, linux-mm, Linux Kernel Mailing List On 20.12.2012 21:58, Andrew Morton wrote: > There seems to be some complexity/duplication here between the new > unbalanced_zone() and pgdat_balanced(). > > Can we modify pgdat_balanced() so that it also handles order=0, then do > > - if (!unbalanced_zone || (order && pgdat_balanced(pgdat, balanced, *classzone_idx))) > + if (!pgdat_balanced(...)) > > ? > Makes sense, I like the idea! Took me some time to wrap my mind around all the logic in balance_pgdat(), while writing my previous patch. Also had to revert one if-case logic to avoid double negation, which would be even harder to grok. But unbalanced_zone (var, not a function!) has to stay because wait_iff_congested() needs a struct zone* param. Here's my take on the subject: ---8<--- mm: modify pgdat_balanced() so that it also handles order=0 Teach pgdat_balanced() about order-0 allocations so that we can simplify code in a few places in vmstat.c. Signed-off-by: Zlatko Calusic <zlatko.calusic@iskon.hr> --- mm/vmscan.c | 101 +++++++++++++++++++++++++++--------------------------------- 1 file changed, 45 insertions(+), 56 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index adc7e90..0d15d99 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2452,12 +2452,16 @@ static bool zone_balanced(struct zone *zone, int order, } /* - * pgdat_balanced is used when checking if a node is balanced for high-order - * allocations. Only zones that meet watermarks and are in a zone allowed - * by the callers classzone_idx are added to balanced_pages. The total of - * balanced pages must be at least 25% of the zones allowed by classzone_idx - * for the node to be considered balanced. Forcing all zones to be balanced - * for high orders can cause excessive reclaim when there are imbalanced zones. + * pgdat_balanced() is used when checking if a node is balanced. + * + * For order-0, all zones must be balanced! + * + * For high-order allocations only zones that meet watermarks and are in a + * zone allowed by the callers classzone_idx are added to balanced_pages. The + * total of balanced pages must be at least 25% of the zones allowed by + * classzone_idx for the node to be considered balanced. Forcing all zones to + * be balanced for high orders can cause excessive reclaim when there are + * imbalanced zones. * The choice of 25% is due to * o a 16M DMA zone that is balanced will not balance a zone on any * reasonable sized machine @@ -2467,17 +2471,43 @@ static bool zone_balanced(struct zone *zone, int order, * Similarly, on x86-64 the Normal zone would need to be at least 1G * to balance a node on its own. These seemed like reasonable ratios. */ -static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced_pages, - int classzone_idx) +static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx) { unsigned long present_pages = 0; + unsigned long balanced_pages = 0; int i; - for (i = 0; i <= classzone_idx; i++) - present_pages += pgdat->node_zones[i].present_pages; + /* Check the watermark levels */ + for (i = 0; i <= classzone_idx; i++) { + struct zone *zone = pgdat->node_zones + i; + + if (!populated_zone(zone)) + continue; + + present_pages += zone->present_pages; + + /* + * A special case here: + * + * balance_pgdat() skips over all_unreclaimable after + * DEF_PRIORITY. Effectively, it considers them balanced so + * they must be considered balanced here as well! + */ + if (zone->all_unreclaimable) { + balanced_pages += zone->present_pages; + continue; + } + + if (zone_balanced(zone, order, 0, i)) + balanced_pages += zone->present_pages; + else if (!order) + return false; + } - /* A special case here: if zone has no page, we think it's balanced */ - return balanced_pages >= (present_pages >> 2); + if (order) + return balanced_pages >= (present_pages >> 2); + else + return true; } /* @@ -2511,39 +2541,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining, return false; } - /* Check the watermark levels */ - for (i = 0; i <= classzone_idx; i++) { - struct zone *zone = pgdat->node_zones + i; - - if (!populated_zone(zone)) - continue; - - /* - * balance_pgdat() skips over all_unreclaimable after - * DEF_PRIORITY. Effectively, it considers them balanced so - * they must be considered balanced here as well if kswapd - * is to sleep - */ - if (zone->all_unreclaimable) { - balanced += zone->present_pages; - continue; - } - - if (!zone_balanced(zone, order, 0, i)) - all_zones_ok = false; - else - balanced += zone->present_pages; - } - - /* - * For high-order requests, the balanced zones must contain at least - * 25% of the nodes pages for kswapd to sleep. For order-0, all zones - * must be balanced - */ - if (order) - return pgdat_balanced(pgdat, balanced, classzone_idx); - else - return all_zones_ok; + return pgdat_balanced(pgdat, order, classzone_idx); } /* @@ -2571,7 +2569,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, int *classzone_idx) { struct zone *unbalanced_zone; - unsigned long balanced; int i; int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */ unsigned long total_scanned; @@ -2605,7 +2602,6 @@ loop_again: int has_under_min_watermark_zone = 0; unbalanced_zone = NULL; - balanced = 0; /* * Scan in the highmem->dma direction for the highest @@ -2761,8 +2757,6 @@ loop_again: * speculatively avoid congestion waits */ zone_clear_flag(zone, ZONE_CONGESTED); - if (i <= *classzone_idx) - balanced += zone->present_pages; } } @@ -2776,7 +2770,7 @@ loop_again: pfmemalloc_watermark_ok(pgdat)) wake_up(&pgdat->pfmemalloc_wait); - if (!unbalanced_zone || (order && pgdat_balanced(pgdat, balanced, *classzone_idx))) + if (pgdat_balanced(pgdat, order, *classzone_idx)) break; /* kswapd: all done */ /* * OK, kswapd is getting into trouble. Take a nap, then take @@ -2800,12 +2794,7 @@ loop_again: } while (--sc.priority >= 0); out: - /* - * order-0: All zones must meet high watermark for a balanced node - * high-order: Balanced zones must make up at least 25% of the node - * for the node to be balanced - */ - if (unbalanced_zone && (!order || !pgdat_balanced(pgdat, balanced, *classzone_idx))) { + if (!pgdat_balanced(pgdat, order, *classzone_idx)) { cond_resched(); try_to_freeze(); -- 1.8.1.rc0 -- Zlatko -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v2] mm: modify pgdat_balanced() so that it also handles order=0 2012-12-22 18:54 ` [PATCH] mm: modify pgdat_balanced() so that it also handles order=0 Zlatko Calusic @ 2012-12-23 14:12 ` Zlatko Calusic 2012-12-26 15:07 ` [PATCH] mm: avoid calling pgdat_balanced() needlessly Zlatko Calusic 0 siblings, 1 reply; 16+ messages in thread From: Zlatko Calusic @ 2012-12-23 14:12 UTC (permalink / raw) To: Andrew Morton, Mel Gorman Cc: Linus Torvalds, Hugh Dickins, linux-mm, Linux Kernel Mailing List On 22.12.2012 19:54, Zlatko Calusic wrote: > On 20.12.2012 21:58, Andrew Morton wrote: >> There seems to be some complexity/duplication here between the new >> unbalanced_zone() and pgdat_balanced(). >> >> Can we modify pgdat_balanced() so that it also handles order=0, then do >> >> - if (!unbalanced_zone || (order && pgdat_balanced(pgdat, balanced, *classzone_idx))) >> + if (!pgdat_balanced(...)) >> >> ? >> > > Makes sense, I like the idea! Took me some time to wrap my mind around > all the logic in balance_pgdat(), while writing my previous patch. Also had > to revert one if-case logic to avoid double negation, which would be even > harder to grok. But unbalanced_zone (var, not a function!) has to stay because > wait_iff_congested() needs a struct zone* param. Here's my take on the subject: > And now also with 3 unused variables removed, no other changes. prepare_kswapd_sleep() now looks so beautiful. ;) I've been testing the patch on 3 different machines, and no problem at all. One of those I pushed hard, and it survived. ---8<--- mm: modify pgdat_balanced() so that it also handles order=0 Teach pgdat_balanced() about order-0 allocations so that we can simplify code in a few places in vmstat.c. Signed-off-by: Zlatko Calusic <zlatko.calusic@iskon.hr> --- mm/vmscan.c | 105 ++++++++++++++++++++++++++---------------------------------- 1 file changed, 45 insertions(+), 60 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index adc7e90..23291b9 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2452,12 +2452,16 @@ static bool zone_balanced(struct zone *zone, int order, } /* - * pgdat_balanced is used when checking if a node is balanced for high-order - * allocations. Only zones that meet watermarks and are in a zone allowed - * by the callers classzone_idx are added to balanced_pages. The total of - * balanced pages must be at least 25% of the zones allowed by classzone_idx - * for the node to be considered balanced. Forcing all zones to be balanced - * for high orders can cause excessive reclaim when there are imbalanced zones. + * pgdat_balanced() is used when checking if a node is balanced. + * + * For order-0, all zones must be balanced! + * + * For high-order allocations only zones that meet watermarks and are in a + * zone allowed by the callers classzone_idx are added to balanced_pages. The + * total of balanced pages must be at least 25% of the zones allowed by + * classzone_idx for the node to be considered balanced. Forcing all zones to + * be balanced for high orders can cause excessive reclaim when there are + * imbalanced zones. * The choice of 25% is due to * o a 16M DMA zone that is balanced will not balance a zone on any * reasonable sized machine @@ -2467,17 +2471,43 @@ static bool zone_balanced(struct zone *zone, int order, * Similarly, on x86-64 the Normal zone would need to be at least 1G * to balance a node on its own. These seemed like reasonable ratios. */ -static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced_pages, - int classzone_idx) +static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx) { unsigned long present_pages = 0; + unsigned long balanced_pages = 0; int i; - for (i = 0; i <= classzone_idx; i++) - present_pages += pgdat->node_zones[i].present_pages; + /* Check the watermark levels */ + for (i = 0; i <= classzone_idx; i++) { + struct zone *zone = pgdat->node_zones + i; - /* A special case here: if zone has no page, we think it's balanced */ - return balanced_pages >= (present_pages >> 2); + if (!populated_zone(zone)) + continue; + + present_pages += zone->present_pages; + + /* + * A special case here: + * + * balance_pgdat() skips over all_unreclaimable after + * DEF_PRIORITY. Effectively, it considers them balanced so + * they must be considered balanced here as well! + */ + if (zone->all_unreclaimable) { + balanced_pages += zone->present_pages; + continue; + } + + if (zone_balanced(zone, order, 0, i)) + balanced_pages += zone->present_pages; + else if (!order) + return false; + } + + if (order) + return balanced_pages >= (present_pages >> 2); + else + return true; } /* @@ -2489,10 +2519,6 @@ static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced_pages, static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining, int classzone_idx) { - int i; - unsigned long balanced = 0; - bool all_zones_ok = true; - /* If a direct reclaimer woke kswapd within HZ/10, it's premature */ if (remaining) return false; @@ -2511,39 +2537,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining, return false; } - /* Check the watermark levels */ - for (i = 0; i <= classzone_idx; i++) { - struct zone *zone = pgdat->node_zones + i; - - if (!populated_zone(zone)) - continue; - - /* - * balance_pgdat() skips over all_unreclaimable after - * DEF_PRIORITY. Effectively, it considers them balanced so - * they must be considered balanced here as well if kswapd - * is to sleep - */ - if (zone->all_unreclaimable) { - balanced += zone->present_pages; - continue; - } - - if (!zone_balanced(zone, order, 0, i)) - all_zones_ok = false; - else - balanced += zone->present_pages; - } - - /* - * For high-order requests, the balanced zones must contain at least - * 25% of the nodes pages for kswapd to sleep. For order-0, all zones - * must be balanced - */ - if (order) - return pgdat_balanced(pgdat, balanced, classzone_idx); - else - return all_zones_ok; + return pgdat_balanced(pgdat, order, classzone_idx); } /* @@ -2571,7 +2565,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, int *classzone_idx) { struct zone *unbalanced_zone; - unsigned long balanced; int i; int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */ unsigned long total_scanned; @@ -2605,7 +2598,6 @@ loop_again: int has_under_min_watermark_zone = 0; unbalanced_zone = NULL; - balanced = 0; /* * Scan in the highmem->dma direction for the highest @@ -2761,8 +2753,6 @@ loop_again: * speculatively avoid congestion waits */ zone_clear_flag(zone, ZONE_CONGESTED); - if (i <= *classzone_idx) - balanced += zone->present_pages; } } @@ -2776,7 +2766,7 @@ loop_again: pfmemalloc_watermark_ok(pgdat)) wake_up(&pgdat->pfmemalloc_wait); - if (!unbalanced_zone || (order && pgdat_balanced(pgdat, balanced, *classzone_idx))) + if (pgdat_balanced(pgdat, order, *classzone_idx)) break; /* kswapd: all done */ /* * OK, kswapd is getting into trouble. Take a nap, then take @@ -2800,12 +2790,7 @@ loop_again: } while (--sc.priority >= 0); out: - /* - * order-0: All zones must meet high watermark for a balanced node - * high-order: Balanced zones must make up at least 25% of the node - * for the node to be balanced - */ - if (unbalanced_zone && (!order || !pgdat_balanced(pgdat, balanced, *classzone_idx))) { + if (!pgdat_balanced(pgdat, order, *classzone_idx)) { cond_resched(); try_to_freeze(); -- 1.8.1.rc0 -- Zlatko -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH] mm: avoid calling pgdat_balanced() needlessly 2012-12-23 14:12 ` [PATCH v2] " Zlatko Calusic @ 2012-12-26 15:07 ` Zlatko Calusic 2012-12-28 2:16 ` [PATCH] mm: fix null pointer dereference in wait_iff_congested() Zlatko Calusic 0 siblings, 1 reply; 16+ messages in thread From: Zlatko Calusic @ 2012-12-26 15:07 UTC (permalink / raw) To: Andrew Morton, Mel Gorman Cc: Linus Torvalds, Hugh Dickins, linux-mm, Linux Kernel Mailing List Now that balance_pgdat() is slightly tidied up, thanks to more capable pgdat_balanced(), it's become obvious that pgdat_balanced() is called to check the status, then break the loop if pgdat is balanced, just to be immediately called again. The second call is completely unnecessary, of course. The patch introduces pgdat_is_balanced boolean, which helps resolve the above suboptimal behavior, with the added benefit of slightly better documenting one other place in the function where we jump and skip lots of code. Signed-off-by: Zlatko Calusic <zlatko.calusic@iskon.hr> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> --- mm/vmscan.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 23291b9..02bcfa3 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2564,6 +2564,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining, static unsigned long balance_pgdat(pg_data_t *pgdat, int order, int *classzone_idx) { + bool pgdat_is_balanced = false; struct zone *unbalanced_zone; int i; int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */ @@ -2638,8 +2639,11 @@ loop_again: zone_clear_flag(zone, ZONE_CONGESTED); } } - if (i < 0) + + if (i < 0) { + pgdat_is_balanced = true; goto out; + } for (i = 0; i <= end_zone; i++) { struct zone *zone = pgdat->node_zones + i; @@ -2766,8 +2770,11 @@ loop_again: pfmemalloc_watermark_ok(pgdat)) wake_up(&pgdat->pfmemalloc_wait); - if (pgdat_balanced(pgdat, order, *classzone_idx)) + if (pgdat_balanced(pgdat, order, *classzone_idx)) { + pgdat_is_balanced = true; break; /* kswapd: all done */ + } + /* * OK, kswapd is getting into trouble. Take a nap, then take * another pass across the zones. @@ -2788,9 +2795,9 @@ loop_again: if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX) break; } while (--sc.priority >= 0); -out: - if (!pgdat_balanced(pgdat, order, *classzone_idx)) { +out: + if (!pgdat_is_balanced) { cond_resched(); try_to_freeze(); -- 1.8.1.rc0 -- Zlatko -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH] mm: fix null pointer dereference in wait_iff_congested() 2012-12-26 15:07 ` [PATCH] mm: avoid calling pgdat_balanced() needlessly Zlatko Calusic @ 2012-12-28 2:16 ` Zlatko Calusic 2012-12-28 2:49 ` Minchan Kim 2012-12-29 8:45 ` Sedat Dilek 0 siblings, 2 replies; 16+ messages in thread From: Zlatko Calusic @ 2012-12-28 2:16 UTC (permalink / raw) To: Andrew Morton, Mel Gorman Cc: Linus Torvalds, Hugh Dickins, linux-mm, Linux Kernel Mailing List, Zhouping Liu, Sedat Dilek From: Zlatko Calusic <zlatko.calusic@iskon.hr> The unintended consequence of commit 4ae0a48b is that wait_iff_congested() can now be called with NULL struct zone* producing kernel oops like this: BUG: unable to handle kernel NULL pointer dereference IP: [<ffffffff811542d9>] wait_iff_congested+0x59/0x140 This trivial patch fixes it. Reported-by: Zhouping Liu <zliu@redhat.com> Reported-and-tested-by: Sedat Dilek <sedat.dilek@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Zlatko Calusic <zlatko.calusic@iskon.hr> --- mm/vmscan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 02bcfa3..e55ce55 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2782,7 +2782,7 @@ loop_again: if (total_scanned && (sc.priority < DEF_PRIORITY - 2)) { if (has_under_min_watermark_zone) count_vm_event(KSWAPD_SKIP_CONGESTION_WAIT); - else + else if (unbalanced_zone) wait_iff_congested(unbalanced_zone, BLK_RW_ASYNC, HZ/10); } -- 1.8.1.rc3 -- Zlatko -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH] mm: fix null pointer dereference in wait_iff_congested() 2012-12-28 2:16 ` [PATCH] mm: fix null pointer dereference in wait_iff_congested() Zlatko Calusic @ 2012-12-28 2:49 ` Minchan Kim 2012-12-28 13:29 ` Zlatko Calusic 2012-12-29 8:45 ` Sedat Dilek 1 sibling, 1 reply; 16+ messages in thread From: Minchan Kim @ 2012-12-28 2:49 UTC (permalink / raw) To: Zlatko Calusic Cc: Andrew Morton, Mel Gorman, Linus Torvalds, Hugh Dickins, linux-mm, Linux Kernel Mailing List, Zhouping Liu, Sedat Dilek Hello Zlatko, On Fri, Dec 28, 2012 at 03:16:38AM +0100, Zlatko Calusic wrote: > From: Zlatko Calusic <zlatko.calusic@iskon.hr> > > The unintended consequence of commit 4ae0a48b is that > wait_iff_congested() can now be called with NULL struct zone* > producing kernel oops like this: For good description, it would be better to write simple pseudo code flow to show how NULL-zone pass into wait_iff_congested because kswapd code flow is too complex. As I see the code, we have following line above wait_iff_congested. if (!unbalanced_zone || blah blah) break; How can NULL unbalanced_zone reach wait_iff_congested? > > BUG: unable to handle kernel NULL pointer dereference > IP: [<ffffffff811542d9>] wait_iff_congested+0x59/0x140 > > This trivial patch fixes it. > > Reported-by: Zhouping Liu <zliu@redhat.com> > Reported-and-tested-by: Sedat Dilek <sedat.dilek@gmail.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Mel Gorman <mgorman@suse.de> > Cc: Hugh Dickins <hughd@google.com> > Signed-off-by: Zlatko Calusic <zlatko.calusic@iskon.hr> > --- > mm/vmscan.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 02bcfa3..e55ce55 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2782,7 +2782,7 @@ loop_again: > if (total_scanned && (sc.priority < DEF_PRIORITY - 2)) { > if (has_under_min_watermark_zone) > count_vm_event(KSWAPD_SKIP_CONGESTION_WAIT); > - else > + else if (unbalanced_zone) > wait_iff_congested(unbalanced_zone, BLK_RW_ASYNC, HZ/10); > } > > -- > 1.8.1.rc3 > > -- > Zlatko > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm: fix null pointer dereference in wait_iff_congested() 2012-12-28 2:49 ` Minchan Kim @ 2012-12-28 13:29 ` Zlatko Calusic 2012-12-31 0:50 ` Minchan Kim 0 siblings, 1 reply; 16+ messages in thread From: Zlatko Calusic @ 2012-12-28 13:29 UTC (permalink / raw) To: Minchan Kim Cc: Andrew Morton, Mel Gorman, Linus Torvalds, Hugh Dickins, linux-mm, Linux Kernel Mailing List, Zhouping Liu, Sedat Dilek On 28.12.2012 03:49, Minchan Kim wrote: > Hello Zlatko, > > On Fri, Dec 28, 2012 at 03:16:38AM +0100, Zlatko Calusic wrote: >> From: Zlatko Calusic <zlatko.calusic@iskon.hr> >> >> The unintended consequence of commit 4ae0a48b is that >> wait_iff_congested() can now be called with NULL struct zone* >> producing kernel oops like this: > > For good description, it would be better to write simple pseudo code > flow to show how NULL-zone pass into wait_iff_congested because > kswapd code flow is too complex. > > As I see the code, we have following line above wait_iff_congested. > > if (!unbalanced_zone || blah blah) > break; > > How can NULL unbalanced_zone reach wait_iff_congested? > Hello Minchan, and thanks for the comment. That line was there before commit 4ae0a48b got in, and you're right, it's what was protecting wait_iff_congested() from being called with NULL zone*. But then all that logic got colapsed to a simple pgdat_balanced() call and that's when I introduced the bug, I lost the protection. What I _think_ is happening (pseudo code following...) is that after scanning the zone in the dma->highmem direction, and concluding that all zones are balanced (unbalanced_zone remains NULL!), wake_up(&pgdat->pfmemalloc_wait) wakes up a lot of memory hungry processes (especially true in various aggressive test/benchmarks) that immediately drain and unbalance one or more zones. Then pgdat_balanced() call which immediately follows will be false, but we still have unbalanced_zone = NULL, rememeber? Oops... But, all that is a speculation that I can't prove atm. Of course, if anybody thinks that's a credible explanation, I could add it as a commit comment, or even as a code comment, but I didn't want to be overly imaginative. The fix itself is simple and real. Regards, -- Zlatko -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm: fix null pointer dereference in wait_iff_congested() 2012-12-28 13:29 ` Zlatko Calusic @ 2012-12-31 0:50 ` Minchan Kim 0 siblings, 0 replies; 16+ messages in thread From: Minchan Kim @ 2012-12-31 0:50 UTC (permalink / raw) To: Zlatko Calusic Cc: Andrew Morton, Mel Gorman, Linus Torvalds, Hugh Dickins, linux-mm, Linux Kernel Mailing List, Zhouping Liu, Sedat Dilek Hi Zlatko, On Fri, Dec 28, 2012 at 02:29:11PM +0100, Zlatko Calusic wrote: > On 28.12.2012 03:49, Minchan Kim wrote: > >Hello Zlatko, > > > >On Fri, Dec 28, 2012 at 03:16:38AM +0100, Zlatko Calusic wrote: > >>From: Zlatko Calusic <zlatko.calusic@iskon.hr> > >> > >>The unintended consequence of commit 4ae0a48b is that > >>wait_iff_congested() can now be called with NULL struct zone* > >>producing kernel oops like this: > > > >For good description, it would be better to write simple pseudo code > >flow to show how NULL-zone pass into wait_iff_congested because > >kswapd code flow is too complex. > > > >As I see the code, we have following line above wait_iff_congested. > > > >if (!unbalanced_zone || blah blah) > > break; > > > >How can NULL unbalanced_zone reach wait_iff_congested? > > > > Hello Minchan, and thanks for the comment. > > That line was there before commit 4ae0a48b got in, and you're right, Argh, I didn't see 4ae0a48b in 3.8-rc1. > it's what was protecting wait_iff_congested() from being called with > NULL zone*. But then all that logic got colapsed to a simple > pgdat_balanced() call and that's when I introduced the bug, I lost > the protection. > > What I _think_ is happening (pseudo code following...) is that after > scanning the zone in the dma->highmem direction, and concluding that > all zones are balanced (unbalanced_zone remains NULL!), > wake_up(&pgdat->pfmemalloc_wait) wakes up a lot of memory hungry > processes (especially true in various aggressive test/benchmarks) > that immediately drain and unbalance one or more zones. Then > pgdat_balanced() call which immediately follows will be false, but > we still have unbalanced_zone = NULL, rememeber? Oops... > > But, all that is a speculation that I can't prove atm. Of course, if > anybody thinks that's a credible explanation, I could add it as a > commit comment, or even as a code comment, but I didn't want to be > overly imaginative. The fix itself is simple and real. Never mind. My confusing is caused my missing 4ae0a48b in lasest tree. Thanks, Zlatko. > > Regards, > -- > Zlatko > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm: fix null pointer dereference in wait_iff_congested() 2012-12-28 2:16 ` [PATCH] mm: fix null pointer dereference in wait_iff_congested() Zlatko Calusic 2012-12-28 2:49 ` Minchan Kim @ 2012-12-29 8:45 ` Sedat Dilek 1 sibling, 0 replies; 16+ messages in thread From: Sedat Dilek @ 2012-12-29 8:45 UTC (permalink / raw) To: Zlatko Calusic Cc: Andrew Morton, Mel Gorman, Linus Torvalds, Hugh Dickins, linux-mm, Linux Kernel Mailing List, Zhouping Liu Just FYI: This patch landed upstream [1]. Thanks for all involved people. - Sedat - [1] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=ecccd1248d6e6986130ffcc3b0d003cb46a485c0 On Fri, Dec 28, 2012 at 3:16 AM, Zlatko Calusic <zlatko.calusic@iskon.hr> wrote: > From: Zlatko Calusic <zlatko.calusic@iskon.hr> > > The unintended consequence of commit 4ae0a48b is that > wait_iff_congested() can now be called with NULL struct zone* > producing kernel oops like this: > > BUG: unable to handle kernel NULL pointer dereference > IP: [<ffffffff811542d9>] wait_iff_congested+0x59/0x140 > > This trivial patch fixes it. > > Reported-by: Zhouping Liu <zliu@redhat.com> > Reported-and-tested-by: Sedat Dilek <sedat.dilek@gmail.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Mel Gorman <mgorman@suse.de> > Cc: Hugh Dickins <hughd@google.com> > Signed-off-by: Zlatko Calusic <zlatko.calusic@iskon.hr> > --- > mm/vmscan.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 02bcfa3..e55ce55 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2782,7 +2782,7 @@ loop_again: > if (total_scanned && (sc.priority < DEF_PRIORITY - 2)) { > if (has_under_min_watermark_zone) > count_vm_event(KSWAPD_SKIP_CONGESTION_WAIT); > - else > + else if (unbalanced_zone) > wait_iff_congested(unbalanced_zone, BLK_RW_ASYNC, HZ/10); > } > > -- > 1.8.1.rc3 > > -- > Zlatko -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2012-12-31 0:50 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-12-19 23:17 [PATCH] mm: do not sleep in balance_pgdat if there's no i/o congestion Zlatko Calusic 2012-12-19 23:25 ` Zlatko Calusic 2012-12-21 11:51 ` Hillf Danton 2012-12-27 15:42 ` Zlatko Calusic 2012-12-29 7:25 ` Hillf Danton 2012-12-29 12:11 ` Zlatko Calusic 2012-12-20 11:12 ` Mel Gorman 2012-12-20 20:58 ` Andrew Morton 2012-12-22 18:54 ` [PATCH] mm: modify pgdat_balanced() so that it also handles order=0 Zlatko Calusic 2012-12-23 14:12 ` [PATCH v2] " Zlatko Calusic 2012-12-26 15:07 ` [PATCH] mm: avoid calling pgdat_balanced() needlessly Zlatko Calusic 2012-12-28 2:16 ` [PATCH] mm: fix null pointer dereference in wait_iff_congested() Zlatko Calusic 2012-12-28 2:49 ` Minchan Kim 2012-12-28 13:29 ` Zlatko Calusic 2012-12-31 0:50 ` Minchan Kim 2012-12-29 8:45 ` Sedat Dilek
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).