[PATCH] mm: do not sleep in balance_pgdat if there's no i/o congestion

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm: do not sleep in balance_pgdat if there's no i/o congestion
@ 2012-12-19 23:17 Zlatko Calusic
  2012-12-19 23:25 ` Zlatko Calusic
  2012-12-20 11:12 ` Mel Gorman
  0 siblings, 2 replies; 16+ messages in thread
From: Zlatko Calusic @ 2012-12-19 23:17 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Mel Gorman, Hugh Dickins
  Cc: linux-mm, Linux Kernel Mailing List

On a 4GB RAM machine, where Normal zone is much smaller than
DMA32 zone, the Normal zone gets fragmented in time. This requires
relatively more pressure in balance_pgdat to get the zone above the
required watermark. Unfortunately, the congestion_wait() call in there
slows it down for a completely wrong reason, expecting that there's
a lot of writeback/swapout, even when there's none (much more common).
After a few days, when fragmentation progresses, this flawed logic
translates to a very high CPU iowait times, even though there's no
I/O congestion at all. If THP is enabled, the problem occurs sooner,
but I was able to see it even on !THP kernels, just by giving it a bit
more time to occur.

The proper way to deal with this is to not wait, unless there's
congestion. Thanks to Mel Gorman, we already have the function that
perfectly fits the job. The patch was tested on a machine which
nicely revealed the problem after only 1 day of uptime, and it's been
working great.
---
 mm/vmscan.c |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b7ed376..4588d1d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2546,7 +2546,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
 static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 							int *classzone_idx)
 {
-	int all_zones_ok;
+	struct zone *unbalanced_zone;
 	unsigned long balanced;
 	int i;
 	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
@@ -2580,7 +2580,7 @@ loop_again:
 		unsigned long lru_pages = 0;
 		int has_under_min_watermark_zone = 0;
 
-		all_zones_ok = 1;
+		unbalanced_zone = NULL;
 		balanced = 0;
 
 		/*
@@ -2719,7 +2719,7 @@ loop_again:
 			}
 
 			if (!zone_balanced(zone, testorder, 0, end_zone)) {
-				all_zones_ok = 0;
+				unbalanced_zone = zone;
 				/*
 				 * We are still under min water mark.  This
 				 * means that we have a GFP_ATOMIC allocation
@@ -2752,7 +2752,7 @@ loop_again:
 				pfmemalloc_watermark_ok(pgdat))
 			wake_up(&pgdat->pfmemalloc_wait);
 
-		if (all_zones_ok || (order && pgdat_balanced(pgdat, balanced, *classzone_idx)))
+		if (!unbalanced_zone || (order && pgdat_balanced(pgdat, balanced, *classzone_idx)))
 			break;		/* kswapd: all done */
 		/*
 		 * OK, kswapd is getting into trouble.  Take a nap, then take
@@ -2762,7 +2762,7 @@ loop_again:
 			if (has_under_min_watermark_zone)
 				count_vm_event(KSWAPD_SKIP_CONGESTION_WAIT);
 			else
-				congestion_wait(BLK_RW_ASYNC, HZ/10);
+				wait_iff_congested(unbalanced_zone, BLK_RW_ASYNC, HZ/10);
 		}
 
 		/*
@@ -2781,7 +2781,7 @@ out:
 	 * high-order: Balanced zones must make up at least 25% of the node
 	 *             for the node to be balanced
 	 */
-	if (!(all_zones_ok || (order && pgdat_balanced(pgdat, balanced, *classzone_idx)))) {
+	if (unbalanced_zone && (!order || !pgdat_balanced(pgdat, balanced, *classzone_idx))) {
 		cond_resched();
 
 		try_to_freeze();
-- 
1.7.10.4

-- 
Zlatko

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: do not sleep in balance_pgdat if there's no i/o congestion
  2012-12-19 23:17 [PATCH] mm: do not sleep in balance_pgdat if there's no i/o congestion Zlatko Calusic
@ 2012-12-19 23:25 ` Zlatko Calusic
  2012-12-21 11:51   ` Hillf Danton
  2012-12-20 11:12 ` Mel Gorman
  1 sibling, 1 reply; 16+ messages in thread
From: Zlatko Calusic @ 2012-12-19 23:25 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Mel Gorman, Hugh Dickins
  Cc: linux-mm, Linux Kernel Mailing List

On a 4GB RAM machine, where Normal zone is much smaller than
DMA32 zone, the Normal zone gets fragmented in time. This requires
relatively more pressure in balance_pgdat to get the zone above the
required watermark. Unfortunately, the congestion_wait() call in there
slows it down for a completely wrong reason, expecting that there's
a lot of writeback/swapout, even when there's none (much more common).
After a few days, when fragmentation progresses, this flawed logic
translates to a very high CPU iowait times, even though there's no
I/O congestion at all. If THP is enabled, the problem occurs sooner,
but I was able to see it even on !THP kernels, just by giving it a bit
more time to occur.

The proper way to deal with this is to not wait, unless there's
congestion. Thanks to Mel Gorman, we already have the function that
perfectly fits the job. The patch was tested on a machine which
nicely revealed the problem after only 1 day of uptime, and it's been
working great.

Signed-off-by: Zlatko Calusic <zlatko.calusic@iskon.hr>
---
 mm/vmscan.c |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b7ed376..4588d1d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2546,7 +2546,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
 static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 							int *classzone_idx)
 {
-	int all_zones_ok;
+	struct zone *unbalanced_zone;
 	unsigned long balanced;
 	int i;
 	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
@@ -2580,7 +2580,7 @@ loop_again:
 		unsigned long lru_pages = 0;
 		int has_under_min_watermark_zone = 0;
 
-		all_zones_ok = 1;
+		unbalanced_zone = NULL;
 		balanced = 0;
 
 		/*
@@ -2719,7 +2719,7 @@ loop_again:
 			}
 
 			if (!zone_balanced(zone, testorder, 0, end_zone)) {
-				all_zones_ok = 0;
+				unbalanced_zone = zone;
 				/*
 				 * We are still under min water mark.  This
 				 * means that we have a GFP_ATOMIC allocation
@@ -2752,7 +2752,7 @@ loop_again:
 				pfmemalloc_watermark_ok(pgdat))
 			wake_up(&pgdat->pfmemalloc_wait);
 
-		if (all_zones_ok || (order && pgdat_balanced(pgdat, balanced, *classzone_idx)))
+		if (!unbalanced_zone || (order && pgdat_balanced(pgdat, balanced, *classzone_idx)))
 			break;		/* kswapd: all done */
 		/*
 		 * OK, kswapd is getting into trouble.  Take a nap, then take
@@ -2762,7 +2762,7 @@ loop_again:
 			if (has_under_min_watermark_zone)
 				count_vm_event(KSWAPD_SKIP_CONGESTION_WAIT);
 			else
-				congestion_wait(BLK_RW_ASYNC, HZ/10);
+				wait_iff_congested(unbalanced_zone, BLK_RW_ASYNC, HZ/10);
 		}
 
 		/*
@@ -2781,7 +2781,7 @@ out:
 	 * high-order: Balanced zones must make up at least 25% of the node
 	 *             for the node to be balanced
 	 */
-	if (!(all_zones_ok || (order && pgdat_balanced(pgdat, balanced, *classzone_idx)))) {
+	if (unbalanced_zone && (!order || !pgdat_balanced(pgdat, balanced, *classzone_idx))) {
 		cond_resched();
 
 		try_to_freeze();
-- 1.7.10.4

-- 
Zlatko (this time with proper Signed-off-by line)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: do not sleep in balance_pgdat if there's no i/o congestion
  2012-12-19 23:25 ` Zlatko Calusic
@ 2012-12-21 11:51   ` Hillf Danton
  2012-12-27 15:42     ` Zlatko Calusic
  0 siblings, 1 reply; 16+ messages in thread
From: Hillf Danton @ 2012-12-21 11:51 UTC (permalink / raw)
  To: Zlatko Calusic
  Cc: Linus Torvalds, Andrew Morton, Mel Gorman, Hugh Dickins, linux-mm,
	Linux Kernel Mailing List

On Thu, Dec 20, 2012 at 7:25 AM, Zlatko Calusic <zlatko.calusic@iskon.hr> wrote:
>  static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>                                                         int *classzone_idx)
>  {
> -       int all_zones_ok;
> +       struct zone *unbalanced_zone;

nit: less hunks if not erase that mark

Hillf
>         unsigned long balanced;
>         int i;
>         int end_zone = 0;       /* Inclusive.  0 = ZONE_DMA */
> @@ -2580,7 +2580,7 @@ loop_again:
>                 unsigned long lru_pages = 0;
>                 int has_under_min_watermark_zone = 0;
>
> -               all_zones_ok = 1;
> +               unbalanced_zone = NULL;
>                 balanced = 0;
>
>                 /*
> @@ -2719,7 +2719,7 @@ loop_again:
>                         }
>
>                         if (!zone_balanced(zone, testorder, 0, end_zone)) {
> -                               all_zones_ok = 0;
> +                               unbalanced_zone = zone;
>                                 /*
>                                  * We are still under min water mark.  This
>                                  * means that we have a GFP_ATOMIC allocation
> @@ -2752,7 +2752,7 @@ loop_again:
>                                 pfmemalloc_watermark_ok(pgdat))
>                         wake_up(&pgdat->pfmemalloc_wait);
>
> -               if (all_zones_ok || (order && pgdat_balanced(pgdat, balanced, *classzone_idx)))
> +               if (!unbalanced_zone || (order && pgdat_balanced(pgdat, balanced, *classzone_idx)))
>                         break;          /* kswapd: all done */
>                 /*
>                  * OK, kswapd is getting into trouble.  Take a nap, then take
> @@ -2762,7 +2762,7 @@ loop_again:
>                         if (has_under_min_watermark_zone)
>                                 count_vm_event(KSWAPD_SKIP_CONGESTION_WAIT);
>                         else
> -                               congestion_wait(BLK_RW_ASYNC, HZ/10);
> +                               wait_iff_congested(unbalanced_zone, BLK_RW_ASYNC, HZ/10);
>                 }
>
>                 /*
> @@ -2781,7 +2781,7 @@ out:
>          * high-order: Balanced zones must make up at least 25% of the node
>          *             for the node to be balanced
>          */
> -       if (!(all_zones_ok || (order && pgdat_balanced(pgdat, balanced, *classzone_idx)))) {
> +       if (unbalanced_zone && (!order || !pgdat_balanced(pgdat, balanced, *classzone_idx))) {
>                 cond_resched();
>
>                 try_to_freeze();
> -- 1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: do not sleep in balance_pgdat if there's no i/o congestion
  2012-12-21 11:51   ` Hillf Danton
@ 2012-12-27 15:42     ` Zlatko Calusic
  2012-12-29  7:25       ` Hillf Danton
  0 siblings, 1 reply; 16+ messages in thread
From: Zlatko Calusic @ 2012-12-27 15:42 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Linus Torvalds, Andrew Morton, Mel Gorman, Hugh Dickins, linux-mm,
	Linux Kernel Mailing List

On 21.12.2012 12:51, Hillf Danton wrote:
> On Thu, Dec 20, 2012 at 7:25 AM, Zlatko Calusic <zlatko.calusic@iskon.hr> wrote:
>>   static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>>                                                          int *classzone_idx)
>>   {
>> -       int all_zones_ok;
>> +       struct zone *unbalanced_zone;
>
> nit: less hunks if not erase that mark
>
> Hillf

This one left unanswered and forgotten because I didn't understand what 
you meant. Could you elaborate?

-- 
Zlatko

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: do not sleep in balance_pgdat if there's no i/o congestion
  2012-12-27 15:42     ` Zlatko Calusic
@ 2012-12-29  7:25       ` Hillf Danton
  2012-12-29 12:11         ` Zlatko Calusic
  0 siblings, 1 reply; 16+ messages in thread
From: Hillf Danton @ 2012-12-29  7:25 UTC (permalink / raw)
  To: Zlatko Calusic
  Cc: Linus Torvalds, Andrew Morton, Mel Gorman, Hugh Dickins, linux-mm,
	Linux Kernel Mailing List

On Thu, Dec 27, 2012 at 11:42 PM, Zlatko Calusic
<zlatko.calusic@iskon.hr> wrote:
> On 21.12.2012 12:51, Hillf Danton wrote:
>>
>> On Thu, Dec 20, 2012 at 7:25 AM, Zlatko Calusic <zlatko.calusic@iskon.hr>
>> wrote:
>>>
>>>   static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>>>                                                          int
>>> *classzone_idx)
>>>   {
>>> -       int all_zones_ok;
>>> +       struct zone *unbalanced_zone;
>>
>>
>> nit: less hunks if not erase that mark
>>
>> Hillf
>
>
> This one left unanswered and forgotten because I didn't understand what you
> meant. Could you elaborate?
>
Sure, the patch looks simpler(and nicer) if we dont
erase all_zones_ok.

Hillf

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: do not sleep in balance_pgdat if there's no i/o congestion
  2012-12-29  7:25       ` Hillf Danton
@ 2012-12-29 12:11         ` Zlatko Calusic
  0 siblings, 0 replies; 16+ messages in thread
From: Zlatko Calusic @ 2012-12-29 12:11 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Linus Torvalds, Andrew Morton, Mel Gorman, Hugh Dickins, linux-mm,
	Linux Kernel Mailing List

On 29.12.2012 08:25, Hillf Danton wrote:
> On Thu, Dec 27, 2012 at 11:42 PM, Zlatko Calusic
> <zlatko.calusic@iskon.hr> wrote:
>> On 21.12.2012 12:51, Hillf Danton wrote:
>>>
>>> On Thu, Dec 20, 2012 at 7:25 AM, Zlatko Calusic <zlatko.calusic@iskon.hr>
>>> wrote:
>>>>
>>>>    static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>>>>                                                           int
>>>> *classzone_idx)
>>>>    {
>>>> -       int all_zones_ok;
>>>> +       struct zone *unbalanced_zone;
>>>
>>>
>>> nit: less hunks if not erase that mark
>>>
>>> Hillf
>>
>>
>> This one left unanswered and forgotten because I didn't understand what you
>> meant. Could you elaborate?
>>
> Sure, the patch looks simpler(and nicer) if we dont
> erase all_zones_ok.
>

Ah, yes. I gave it a good thought. But, when I introduced 
unbalanced_zone it just didn't make much sense to me to have two 
variables with very similar meaning. If I decided to keep all_zones_ok, 
it would be either:

all_zones_ok = true
unbalanced_zone = NULL
(meaning: if no zone in unbalanced, then all zones must be ok)

or

all_zones_ok = false
unbalanced_zone = struct zone *
(meaning: if there's an unbalanced zone, then certainly not all zones 
are ok)

So I decided to use only unbalanced_zone (because I had to!), and remove 
all_zones_ok to avoid redundancy. I hope it makes sense.

If you check my latest (and still queued) optimization (mm: avoid 
calling pgdat_balanced() needlessly), there again popped up a need for a 
boolean, but I called it pgdat_is_balanced this time, just to match the 
name of two other functions. It could've also been called all_zones_ok 
if you prefer the name? Of course, I have no strong feelings about the 
name, both are OK, so if you want me to redo the patch, just say.

Generally speaking, while I always attempt to make a smaller patch (less 
hunks and less changes = easier to review), before that I'll always try 
to make the code that results from the commit cleaner, simpler, more 
readable.

For example, I'll always check that I don't mess with whitespace 
needlessly, unless I think it's actually desirable, here's just one example:

"mm: avoid calling pgdat_balanced() needlessly" changes

---
         } while (--sc.priority >= 0);
out:

         if (!pgdat_balanced(pgdat, order, *classzone_idx)) {
---

to

---
         } while (--sc.priority >= 0);

out:
         if (!pgdat_is_balanced) {
---

because I find the latter more correct place for the label "out".

Thanks for the comment.
-- 
Zlatko

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: do not sleep in balance_pgdat if there's no i/o congestion
  2012-12-19 23:17 [PATCH] mm: do not sleep in balance_pgdat if there's no i/o congestion Zlatko Calusic
  2012-12-19 23:25 ` Zlatko Calusic
@ 2012-12-20 11:12 ` Mel Gorman
  2012-12-20 20:58   ` Andrew Morton
  1 sibling, 1 reply; 16+ messages in thread
From: Mel Gorman @ 2012-12-20 11:12 UTC (permalink / raw)
  To: Zlatko Calusic
  Cc: Linus Torvalds, Andrew Morton, Hugh Dickins, linux-mm,
	Linux Kernel Mailing List

On Thu, Dec 20, 2012 at 12:17:07AM +0100, Zlatko Calusic wrote:
> On a 4GB RAM machine, where Normal zone is much smaller than
> DMA32 zone, the Normal zone gets fragmented in time. This requires
> relatively more pressure in balance_pgdat to get the zone above the
> required watermark. Unfortunately, the congestion_wait() call in there
> slows it down for a completely wrong reason, expecting that there's
> a lot of writeback/swapout, even when there's none (much more common).
> After a few days, when fragmentation progresses, this flawed logic
> translates to a very high CPU iowait times, even though there's no
> I/O congestion at all. If THP is enabled, the problem occurs sooner,
> but I was able to see it even on !THP kernels, just by giving it a bit
> more time to occur.
> 
> The proper way to deal with this is to not wait, unless there's
> congestion. Thanks to Mel Gorman, we already have the function that
> perfectly fits the job. The patch was tested on a machine which
> nicely revealed the problem after only 1 day of uptime, and it's been
> working great.
> ---
>  mm/vmscan.c |   12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 

Acked-by: Mel Gorman <mgorman@suse.de

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: do not sleep in balance_pgdat if there's no i/o congestion
  2012-12-20 11:12 ` Mel Gorman
@ 2012-12-20 20:58   ` Andrew Morton
  2012-12-22 18:54     ` [PATCH] mm: modify pgdat_balanced() so that it also handles order=0 Zlatko Calusic
  0 siblings, 1 reply; 16+ messages in thread
From: Andrew Morton @ 2012-12-20 20:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Zlatko Calusic, Linus Torvalds, Hugh Dickins, linux-mm,
	Linux Kernel Mailing List

On Thu, 20 Dec 2012 11:12:08 +0000
Mel Gorman <mgorman@suse.de> wrote:

> On Thu, Dec 20, 2012 at 12:17:07AM +0100, Zlatko Calusic wrote:
> > On a 4GB RAM machine, where Normal zone is much smaller than
> > DMA32 zone, the Normal zone gets fragmented in time. This requires
> > relatively more pressure in balance_pgdat to get the zone above the
> > required watermark. Unfortunately, the congestion_wait() call in there
> > slows it down for a completely wrong reason, expecting that there's
> > a lot of writeback/swapout, even when there's none (much more common).
> > After a few days, when fragmentation progresses, this flawed logic
> > translates to a very high CPU iowait times, even though there's no
> > I/O congestion at all. If THP is enabled, the problem occurs sooner,
> > but I was able to see it even on !THP kernels, just by giving it a bit
> > more time to occur.
> > 
> > The proper way to deal with this is to not wait, unless there's
> > congestion. Thanks to Mel Gorman, we already have the function that
> > perfectly fits the job. The patch was tested on a machine which
> > nicely revealed the problem after only 1 day of uptime, and it's been
> > working great.
> > ---
> >  mm/vmscan.c |   12 ++++++------
> >  1 file changed, 6 insertions(+), 6 deletions(-)
> > 
> 
> Acked-by: Mel Gorman <mgorman@suse.de

There seems to be some complexity/duplication here between the new
unbalanced_zone() and pgdat_balanced().

Can we modify pgdat_balanced() so that it also handles order=0, then do

-		if (!unbalanced_zone || (order && pgdat_balanced(pgdat, balanced, *classzone_idx)))
+		if (!pgdat_balanced(...))

?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH] mm: modify pgdat_balanced() so that it also handles order=0
  2012-12-20 20:58   ` Andrew Morton
@ 2012-12-22 18:54     ` Zlatko Calusic
  2012-12-23 14:12       ` [PATCH v2] " Zlatko Calusic
  0 siblings, 1 reply; 16+ messages in thread
From: Zlatko Calusic @ 2012-12-22 18:54 UTC (permalink / raw)
  To: Andrew Morton, Mel Gorman
  Cc: Linus Torvalds, Hugh Dickins, linux-mm, Linux Kernel Mailing List

On 20.12.2012 21:58, Andrew Morton wrote:
> There seems to be some complexity/duplication here between the new
> unbalanced_zone() and pgdat_balanced().
> 
> Can we modify pgdat_balanced() so that it also handles order=0, then do
> 
> -		if (!unbalanced_zone || (order && pgdat_balanced(pgdat, balanced, *classzone_idx)))
> +		if (!pgdat_balanced(...))
> 
> ?
> 

Makes sense, I like the idea! Took me some time to wrap my mind around
all the logic in balance_pgdat(), while writing my previous patch. Also had
to revert one if-case logic to avoid double negation, which would be even
harder to grok. But unbalanced_zone (var, not a function!) has to stay because
wait_iff_congested() needs a struct zone* param. Here's my take on the subject:

---8<---
mm: modify pgdat_balanced() so that it also handles order=0

Teach pgdat_balanced() about order-0 allocations so that we can simplify
code in a few places in vmstat.c.

Signed-off-by: Zlatko Calusic <zlatko.calusic@iskon.hr>
---
 mm/vmscan.c | 101 +++++++++++++++++++++++++++---------------------------------
 1 file changed, 45 insertions(+), 56 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index adc7e90..0d15d99 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2452,12 +2452,16 @@ static bool zone_balanced(struct zone *zone, int order,
 }
 
 /*
- * pgdat_balanced is used when checking if a node is balanced for high-order
- * allocations. Only zones that meet watermarks and are in a zone allowed
- * by the callers classzone_idx are added to balanced_pages. The total of
- * balanced pages must be at least 25% of the zones allowed by classzone_idx
- * for the node to be considered balanced. Forcing all zones to be balanced
- * for high orders can cause excessive reclaim when there are imbalanced zones.
+ * pgdat_balanced() is used when checking if a node is balanced.
+ *
+ * For order-0, all zones must be balanced!
+ *
+ * For high-order allocations only zones that meet watermarks and are in a
+ * zone allowed by the callers classzone_idx are added to balanced_pages. The
+ * total of balanced pages must be at least 25% of the zones allowed by
+ * classzone_idx for the node to be considered balanced. Forcing all zones to
+ * be balanced for high orders can cause excessive reclaim when there are
+ * imbalanced zones.
  * The choice of 25% is due to
  *   o a 16M DMA zone that is balanced will not balance a zone on any
  *     reasonable sized machine
@@ -2467,17 +2471,43 @@ static bool zone_balanced(struct zone *zone, int order,
  *     Similarly, on x86-64 the Normal zone would need to be at least 1G
  *     to balance a node on its own. These seemed like reasonable ratios.
  */
-static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced_pages,
-						int classzone_idx)
+static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
 {
 	unsigned long present_pages = 0;
+	unsigned long balanced_pages = 0;
 	int i;
 
-	for (i = 0; i <= classzone_idx; i++)
-		present_pages += pgdat->node_zones[i].present_pages;
+	/* Check the watermark levels */
+	for (i = 0; i <= classzone_idx; i++) {
+		struct zone *zone = pgdat->node_zones + i;
+
+		if (!populated_zone(zone))
+			continue;
+
+		present_pages += zone->present_pages;
+
+		/*
+		 * A special case here:
+		 *
+		 * balance_pgdat() skips over all_unreclaimable after
+		 * DEF_PRIORITY. Effectively, it considers them balanced so
+		 * they must be considered balanced here as well!
+		 */
+		if (zone->all_unreclaimable) {
+			balanced_pages += zone->present_pages;
+			continue;
+		}
+
+		if (zone_balanced(zone, order, 0, i))
+			balanced_pages += zone->present_pages;
+		else if (!order)
+			return false;
+	}
 
-	/* A special case here: if zone has no page, we think it's balanced */
-	return balanced_pages >= (present_pages >> 2);
+	if (order)
+		return balanced_pages >= (present_pages >> 2);
+	else
+		return true;
 }
 
 /*
@@ -2511,39 +2541,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
 		return false;
 	}
 
-	/* Check the watermark levels */
-	for (i = 0; i <= classzone_idx; i++) {
-		struct zone *zone = pgdat->node_zones + i;
-
-		if (!populated_zone(zone))
-			continue;
-
-		/*
-		 * balance_pgdat() skips over all_unreclaimable after
-		 * DEF_PRIORITY. Effectively, it considers them balanced so
-		 * they must be considered balanced here as well if kswapd
-		 * is to sleep
-		 */
-		if (zone->all_unreclaimable) {
-			balanced += zone->present_pages;
-			continue;
-		}
-
-		if (!zone_balanced(zone, order, 0, i))
-			all_zones_ok = false;
-		else
-			balanced += zone->present_pages;
-	}
-
-	/*
-	 * For high-order requests, the balanced zones must contain at least
-	 * 25% of the nodes pages for kswapd to sleep. For order-0, all zones
-	 * must be balanced
-	 */
-	if (order)
-		return pgdat_balanced(pgdat, balanced, classzone_idx);
-	else
-		return all_zones_ok;
+	return pgdat_balanced(pgdat, order, classzone_idx);
 }
 
 /*
@@ -2571,7 +2569,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 							int *classzone_idx)
 {
 	struct zone *unbalanced_zone;
-	unsigned long balanced;
 	int i;
 	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
 	unsigned long total_scanned;
@@ -2605,7 +2602,6 @@ loop_again:
 		int has_under_min_watermark_zone = 0;
 
 		unbalanced_zone = NULL;
-		balanced = 0;
 
 		/*
 		 * Scan in the highmem->dma direction for the highest
@@ -2761,8 +2757,6 @@ loop_again:
 				 * speculatively avoid congestion waits
 				 */
 				zone_clear_flag(zone, ZONE_CONGESTED);
-				if (i <= *classzone_idx)
-					balanced += zone->present_pages;
 			}
 
 		}
@@ -2776,7 +2770,7 @@ loop_again:
 				pfmemalloc_watermark_ok(pgdat))
 			wake_up(&pgdat->pfmemalloc_wait);
 
-		if (!unbalanced_zone || (order && pgdat_balanced(pgdat, balanced, *classzone_idx)))
+		if (pgdat_balanced(pgdat, order, *classzone_idx))
 			break;		/* kswapd: all done */
 		/*
 		 * OK, kswapd is getting into trouble.  Take a nap, then take
@@ -2800,12 +2794,7 @@ loop_again:
 	} while (--sc.priority >= 0);
 out:
 
-	/*
-	 * order-0: All zones must meet high watermark for a balanced node
-	 * high-order: Balanced zones must make up at least 25% of the node
-	 *             for the node to be balanced
-	 */
-	if (unbalanced_zone && (!order || !pgdat_balanced(pgdat, balanced, *classzone_idx))) {
+	if (!pgdat_balanced(pgdat, order, *classzone_idx)) {
 		cond_resched();
 
 		try_to_freeze();
-- 
1.8.1.rc0

-- 
Zlatko

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2] mm: modify pgdat_balanced() so that it also handles order=0
  2012-12-22 18:54     ` [PATCH] mm: modify pgdat_balanced() so that it also handles order=0 Zlatko Calusic
@ 2012-12-23 14:12       ` Zlatko Calusic
  2012-12-26 15:07         ` [PATCH] mm: avoid calling pgdat_balanced() needlessly Zlatko Calusic
  0 siblings, 1 reply; 16+ messages in thread
From: Zlatko Calusic @ 2012-12-23 14:12 UTC (permalink / raw)
  To: Andrew Morton, Mel Gorman
  Cc: Linus Torvalds, Hugh Dickins, linux-mm, Linux Kernel Mailing List

On 22.12.2012 19:54, Zlatko Calusic wrote:
> On 20.12.2012 21:58, Andrew Morton wrote:
>> There seems to be some complexity/duplication here between the new
>> unbalanced_zone() and pgdat_balanced().
>>
>> Can we modify pgdat_balanced() so that it also handles order=0, then do
>>
>> -		if (!unbalanced_zone || (order && pgdat_balanced(pgdat, balanced, *classzone_idx)))
>> +		if (!pgdat_balanced(...))
>>
>> ?
>>
> 
> Makes sense, I like the idea! Took me some time to wrap my mind around
> all the logic in balance_pgdat(), while writing my previous patch. Also had
> to revert one if-case logic to avoid double negation, which would be even
> harder to grok. But unbalanced_zone (var, not a function!) has to stay because
> wait_iff_congested() needs a struct zone* param. Here's my take on the subject:
> 

And now also with 3 unused variables removed, no other changes.
prepare_kswapd_sleep() now looks so beautiful. ;)

I've been testing the patch on 3 different machines, and no problem at all. One of
those I pushed hard, and it survived.

---8<---
mm: modify pgdat_balanced() so that it also handles order=0

Teach pgdat_balanced() about order-0 allocations so that we can simplify
code in a few places in vmstat.c.

Signed-off-by: Zlatko Calusic <zlatko.calusic@iskon.hr>
---
 mm/vmscan.c | 105 ++++++++++++++++++++++++++----------------------------------
 1 file changed, 45 insertions(+), 60 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index adc7e90..23291b9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2452,12 +2452,16 @@ static bool zone_balanced(struct zone *zone, int order,
 }
 
 /*
- * pgdat_balanced is used when checking if a node is balanced for high-order
- * allocations. Only zones that meet watermarks and are in a zone allowed
- * by the callers classzone_idx are added to balanced_pages. The total of
- * balanced pages must be at least 25% of the zones allowed by classzone_idx
- * for the node to be considered balanced. Forcing all zones to be balanced
- * for high orders can cause excessive reclaim when there are imbalanced zones.
+ * pgdat_balanced() is used when checking if a node is balanced.
+ *
+ * For order-0, all zones must be balanced!
+ *
+ * For high-order allocations only zones that meet watermarks and are in a
+ * zone allowed by the callers classzone_idx are added to balanced_pages. The
+ * total of balanced pages must be at least 25% of the zones allowed by
+ * classzone_idx for the node to be considered balanced. Forcing all zones to
+ * be balanced for high orders can cause excessive reclaim when there are
+ * imbalanced zones.
  * The choice of 25% is due to
  *   o a 16M DMA zone that is balanced will not balance a zone on any
  *     reasonable sized machine
@@ -2467,17 +2471,43 @@ static bool zone_balanced(struct zone *zone, int order,
  *     Similarly, on x86-64 the Normal zone would need to be at least 1G
  *     to balance a node on its own. These seemed like reasonable ratios.
  */
-static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced_pages,
-						int classzone_idx)
+static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
 {
 	unsigned long present_pages = 0;
+	unsigned long balanced_pages = 0;
 	int i;
 
-	for (i = 0; i <= classzone_idx; i++)
-		present_pages += pgdat->node_zones[i].present_pages;
+	/* Check the watermark levels */
+	for (i = 0; i <= classzone_idx; i++) {
+		struct zone *zone = pgdat->node_zones + i;
 
-	/* A special case here: if zone has no page, we think it's balanced */
-	return balanced_pages >= (present_pages >> 2);
+		if (!populated_zone(zone))
+			continue;
+
+		present_pages += zone->present_pages;
+
+		/*
+		 * A special case here:
+		 *
+		 * balance_pgdat() skips over all_unreclaimable after
+		 * DEF_PRIORITY. Effectively, it considers them balanced so
+		 * they must be considered balanced here as well!
+		 */
+		if (zone->all_unreclaimable) {
+			balanced_pages += zone->present_pages;
+			continue;
+		}
+
+		if (zone_balanced(zone, order, 0, i))
+			balanced_pages += zone->present_pages;
+		else if (!order)
+			return false;
+	}
+
+	if (order)
+		return balanced_pages >= (present_pages >> 2);
+	else
+		return true;
 }
 
 /*
@@ -2489,10 +2519,6 @@ static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced_pages,
 static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
 					int classzone_idx)
 {
-	int i;
-	unsigned long balanced = 0;
-	bool all_zones_ok = true;
-
 	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
 	if (remaining)
 		return false;
@@ -2511,39 +2537,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
 		return false;
 	}
 
-	/* Check the watermark levels */
-	for (i = 0; i <= classzone_idx; i++) {
-		struct zone *zone = pgdat->node_zones + i;
-
-		if (!populated_zone(zone))
-			continue;
-
-		/*
-		 * balance_pgdat() skips over all_unreclaimable after
-		 * DEF_PRIORITY. Effectively, it considers them balanced so
-		 * they must be considered balanced here as well if kswapd
-		 * is to sleep
-		 */
-		if (zone->all_unreclaimable) {
-			balanced += zone->present_pages;
-			continue;
-		}
-
-		if (!zone_balanced(zone, order, 0, i))
-			all_zones_ok = false;
-		else
-			balanced += zone->present_pages;
-	}
-
-	/*
-	 * For high-order requests, the balanced zones must contain at least
-	 * 25% of the nodes pages for kswapd to sleep. For order-0, all zones
-	 * must be balanced
-	 */
-	if (order)
-		return pgdat_balanced(pgdat, balanced, classzone_idx);
-	else
-		return all_zones_ok;
+	return pgdat_balanced(pgdat, order, classzone_idx);
 }
 
 /*
@@ -2571,7 +2565,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 							int *classzone_idx)
 {
 	struct zone *unbalanced_zone;
-	unsigned long balanced;
 	int i;
 	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
 	unsigned long total_scanned;
@@ -2605,7 +2598,6 @@ loop_again:
 		int has_under_min_watermark_zone = 0;
 
 		unbalanced_zone = NULL;
-		balanced = 0;
 
 		/*
 		 * Scan in the highmem->dma direction for the highest
@@ -2761,8 +2753,6 @@ loop_again:
 				 * speculatively avoid congestion waits
 				 */
 				zone_clear_flag(zone, ZONE_CONGESTED);
-				if (i <= *classzone_idx)
-					balanced += zone->present_pages;
 			}
 
 		}
@@ -2776,7 +2766,7 @@ loop_again:
 				pfmemalloc_watermark_ok(pgdat))
 			wake_up(&pgdat->pfmemalloc_wait);
 
-		if (!unbalanced_zone || (order && pgdat_balanced(pgdat, balanced, *classzone_idx)))
+		if (pgdat_balanced(pgdat, order, *classzone_idx))
 			break;		/* kswapd: all done */
 		/*
 		 * OK, kswapd is getting into trouble.  Take a nap, then take
@@ -2800,12 +2790,7 @@ loop_again:
 	} while (--sc.priority >= 0);
 out:
 
-	/*
-	 * order-0: All zones must meet high watermark for a balanced node
-	 * high-order: Balanced zones must make up at least 25% of the node
-	 *             for the node to be balanced
-	 */
-	if (unbalanced_zone && (!order || !pgdat_balanced(pgdat, balanced, *classzone_idx))) {
+	if (!pgdat_balanced(pgdat, order, *classzone_idx)) {
 		cond_resched();
 
 		try_to_freeze();
-- 
1.8.1.rc0

-- 
Zlatko

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH] mm: avoid calling pgdat_balanced() needlessly
  2012-12-23 14:12       ` [PATCH v2] " Zlatko Calusic
@ 2012-12-26 15:07         ` Zlatko Calusic
  2012-12-28  2:16           ` [PATCH] mm: fix null pointer dereference in wait_iff_congested() Zlatko Calusic
  0 siblings, 1 reply; 16+ messages in thread
From: Zlatko Calusic @ 2012-12-26 15:07 UTC (permalink / raw)
  To: Andrew Morton, Mel Gorman
  Cc: Linus Torvalds, Hugh Dickins, linux-mm, Linux Kernel Mailing List

Now that balance_pgdat() is slightly tidied up, thanks to more capable
pgdat_balanced(), it's become obvious that pgdat_balanced() is called
to check the status, then break the loop if pgdat is balanced, just to
be immediately called again. The second call is completely unnecessary,
of course.

The patch introduces pgdat_is_balanced boolean, which helps resolve the
above suboptimal behavior, with the added benefit of slightly better
documenting one other place in the function where we jump and skip lots
of code.

Signed-off-by: Zlatko Calusic <zlatko.calusic@iskon.hr>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
---
 mm/vmscan.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 23291b9..02bcfa3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2564,6 +2564,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
 static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 							int *classzone_idx)
 {
+	bool pgdat_is_balanced = false;
 	struct zone *unbalanced_zone;
 	int i;
 	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
@@ -2638,8 +2639,11 @@ loop_again:
 				zone_clear_flag(zone, ZONE_CONGESTED);
 			}
 		}
-		if (i < 0)
+
+		if (i < 0) {
+			pgdat_is_balanced = true;
 			goto out;
+		}
 
 		for (i = 0; i <= end_zone; i++) {
 			struct zone *zone = pgdat->node_zones + i;
@@ -2766,8 +2770,11 @@ loop_again:
 				pfmemalloc_watermark_ok(pgdat))
 			wake_up(&pgdat->pfmemalloc_wait);
 
-		if (pgdat_balanced(pgdat, order, *classzone_idx))
+		if (pgdat_balanced(pgdat, order, *classzone_idx)) {
+			pgdat_is_balanced = true;
 			break;		/* kswapd: all done */
+		}
+
 		/*
 		 * OK, kswapd is getting into trouble.  Take a nap, then take
 		 * another pass across the zones.
@@ -2788,9 +2795,9 @@ loop_again:
 		if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
 			break;
 	} while (--sc.priority >= 0);
-out:
 
-	if (!pgdat_balanced(pgdat, order, *classzone_idx)) {
+out:
+	if (!pgdat_is_balanced) {
 		cond_resched();
 
 		try_to_freeze();
-- 
1.8.1.rc0

-- 
Zlatko

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH] mm: fix null pointer dereference in wait_iff_congested()
  2012-12-26 15:07         ` [PATCH] mm: avoid calling pgdat_balanced() needlessly Zlatko Calusic
@ 2012-12-28  2:16           ` Zlatko Calusic
  2012-12-28  2:49             ` Minchan Kim
  2012-12-29  8:45             ` Sedat Dilek
  0 siblings, 2 replies; 16+ messages in thread
From: Zlatko Calusic @ 2012-12-28  2:16 UTC (permalink / raw)
  To: Andrew Morton, Mel Gorman
  Cc: Linus Torvalds, Hugh Dickins, linux-mm, Linux Kernel Mailing List,
	Zhouping Liu, Sedat Dilek

From: Zlatko Calusic <zlatko.calusic@iskon.hr>

The unintended consequence of commit 4ae0a48b is that
wait_iff_congested() can now be called with NULL struct zone*
producing kernel oops like this:

BUG: unable to handle kernel NULL pointer dereference
IP: [<ffffffff811542d9>] wait_iff_congested+0x59/0x140

This trivial patch fixes it.

Reported-by: Zhouping Liu <zliu@redhat.com>
Reported-and-tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Zlatko Calusic <zlatko.calusic@iskon.hr>
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 02bcfa3..e55ce55 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2782,7 +2782,7 @@ loop_again:
 		if (total_scanned && (sc.priority < DEF_PRIORITY - 2)) {
 			if (has_under_min_watermark_zone)
 				count_vm_event(KSWAPD_SKIP_CONGESTION_WAIT);
-			else
+			else if (unbalanced_zone)
 				wait_iff_congested(unbalanced_zone, BLK_RW_ASYNC, HZ/10);
 		}
 
-- 
1.8.1.rc3

-- 
Zlatko

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: fix null pointer dereference in wait_iff_congested()
  2012-12-28  2:16           ` [PATCH] mm: fix null pointer dereference in wait_iff_congested() Zlatko Calusic
@ 2012-12-28  2:49             ` Minchan Kim
  2012-12-28 13:29               ` Zlatko Calusic
  2012-12-29  8:45             ` Sedat Dilek
  1 sibling, 1 reply; 16+ messages in thread
From: Minchan Kim @ 2012-12-28  2:49 UTC (permalink / raw)
  To: Zlatko Calusic
  Cc: Andrew Morton, Mel Gorman, Linus Torvalds, Hugh Dickins, linux-mm,
	Linux Kernel Mailing List, Zhouping Liu, Sedat Dilek

Hello Zlatko,

On Fri, Dec 28, 2012 at 03:16:38AM +0100, Zlatko Calusic wrote:
> From: Zlatko Calusic <zlatko.calusic@iskon.hr>
> 
> The unintended consequence of commit 4ae0a48b is that
> wait_iff_congested() can now be called with NULL struct zone*
> producing kernel oops like this:

For good description, it would be better to write simple pseudo code
flow to show how NULL-zone pass into wait_iff_congested because
kswapd code flow is too complex.

As I see the code, we have following line above wait_iff_congested.

if (!unbalanced_zone || blah blah)
        break;

How can NULL unbalanced_zone reach wait_iff_congested?

> 
> BUG: unable to handle kernel NULL pointer dereference
> IP: [<ffffffff811542d9>] wait_iff_congested+0x59/0x140
> 
> This trivial patch fixes it.
> 
> Reported-by: Zhouping Liu <zliu@redhat.com>
> Reported-and-tested-by: Sedat Dilek <sedat.dilek@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Hugh Dickins <hughd@google.com>
> Signed-off-by: Zlatko Calusic <zlatko.calusic@iskon.hr>
> ---
>  mm/vmscan.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 02bcfa3..e55ce55 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2782,7 +2782,7 @@ loop_again:
>  		if (total_scanned && (sc.priority < DEF_PRIORITY - 2)) {
>  			if (has_under_min_watermark_zone)
>  				count_vm_event(KSWAPD_SKIP_CONGESTION_WAIT);
> -			else
> +			else if (unbalanced_zone)
>  				wait_iff_congested(unbalanced_zone, BLK_RW_ASYNC, HZ/10);
>  		}
>  
> -- 
> 1.8.1.rc3
> 
> -- 
> Zlatko
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: fix null pointer dereference in wait_iff_congested()
  2012-12-28  2:49             ` Minchan Kim
@ 2012-12-28 13:29               ` Zlatko Calusic
  2012-12-31  0:50                 ` Minchan Kim
  0 siblings, 1 reply; 16+ messages in thread
From: Zlatko Calusic @ 2012-12-28 13:29 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Mel Gorman, Linus Torvalds, Hugh Dickins, linux-mm,
	Linux Kernel Mailing List, Zhouping Liu, Sedat Dilek

On 28.12.2012 03:49, Minchan Kim wrote:
> Hello Zlatko,
>
> On Fri, Dec 28, 2012 at 03:16:38AM +0100, Zlatko Calusic wrote:
>> From: Zlatko Calusic <zlatko.calusic@iskon.hr>
>>
>> The unintended consequence of commit 4ae0a48b is that
>> wait_iff_congested() can now be called with NULL struct zone*
>> producing kernel oops like this:
>
> For good description, it would be better to write simple pseudo code
> flow to show how NULL-zone pass into wait_iff_congested because
> kswapd code flow is too complex.
>
> As I see the code, we have following line above wait_iff_congested.
>
> if (!unbalanced_zone || blah blah)
>          break;
>
> How can NULL unbalanced_zone reach wait_iff_congested?
>

Hello Minchan, and thanks for the comment.

That line was there before commit 4ae0a48b got in, and you're right, 
it's what was protecting wait_iff_congested() from being called with 
NULL zone*. But then all that logic got colapsed to a simple 
pgdat_balanced() call and that's when I introduced the bug, I lost the 
protection.

What I _think_ is happening (pseudo code following...) is that after 
scanning the zone in the dma->highmem direction, and concluding that all 
zones are balanced (unbalanced_zone remains NULL!), 
wake_up(&pgdat->pfmemalloc_wait) wakes up a lot of memory hungry 
processes (especially true in various aggressive test/benchmarks) that 
immediately drain and unbalance one or more zones. Then pgdat_balanced() 
call which immediately follows will be false, but we still have 
unbalanced_zone = NULL, rememeber? Oops...

But, all that is a speculation that I can't prove atm. Of course, if 
anybody thinks that's a credible explanation, I could add it as a commit 
comment, or even as a code comment, but I didn't want to be overly 
imaginative. The fix itself is simple and real.

Regards,
-- 
Zlatko

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: fix null pointer dereference in wait_iff_congested()
  2012-12-28 13:29               ` Zlatko Calusic
@ 2012-12-31  0:50                 ` Minchan Kim
  0 siblings, 0 replies; 16+ messages in thread
From: Minchan Kim @ 2012-12-31  0:50 UTC (permalink / raw)
  To: Zlatko Calusic
  Cc: Andrew Morton, Mel Gorman, Linus Torvalds, Hugh Dickins, linux-mm,
	Linux Kernel Mailing List, Zhouping Liu, Sedat Dilek

Hi Zlatko,

On Fri, Dec 28, 2012 at 02:29:11PM +0100, Zlatko Calusic wrote:
> On 28.12.2012 03:49, Minchan Kim wrote:
> >Hello Zlatko,
> >
> >On Fri, Dec 28, 2012 at 03:16:38AM +0100, Zlatko Calusic wrote:
> >>From: Zlatko Calusic <zlatko.calusic@iskon.hr>
> >>
> >>The unintended consequence of commit 4ae0a48b is that
> >>wait_iff_congested() can now be called with NULL struct zone*
> >>producing kernel oops like this:
> >
> >For good description, it would be better to write simple pseudo code
> >flow to show how NULL-zone pass into wait_iff_congested because
> >kswapd code flow is too complex.
> >
> >As I see the code, we have following line above wait_iff_congested.
> >
> >if (!unbalanced_zone || blah blah)
> >         break;
> >
> >How can NULL unbalanced_zone reach wait_iff_congested?
> >
> 
> Hello Minchan, and thanks for the comment.
> 
> That line was there before commit 4ae0a48b got in, and you're right,

Argh, I didn't see 4ae0a48b in 3.8-rc1.

> it's what was protecting wait_iff_congested() from being called with
> NULL zone*. But then all that logic got colapsed to a simple
> pgdat_balanced() call and that's when I introduced the bug, I lost
> the protection.
> 
> What I _think_ is happening (pseudo code following...) is that after
> scanning the zone in the dma->highmem direction, and concluding that
> all zones are balanced (unbalanced_zone remains NULL!),
> wake_up(&pgdat->pfmemalloc_wait) wakes up a lot of memory hungry
> processes (especially true in various aggressive test/benchmarks)
> that immediately drain and unbalance one or more zones. Then
> pgdat_balanced() call which immediately follows will be false, but
> we still have unbalanced_zone = NULL, rememeber? Oops...
> 
> But, all that is a speculation that I can't prove atm. Of course, if
> anybody thinks that's a credible explanation, I could add it as a
> commit comment, or even as a code comment, but I didn't want to be
> overly imaginative. The fix itself is simple and real.

Never mind. My confusing is caused my missing 4ae0a48b in lasest tree.
Thanks, Zlatko.

> 
> Regards,
> -- 
> Zlatko
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: fix null pointer dereference in wait_iff_congested()
  2012-12-28  2:16           ` [PATCH] mm: fix null pointer dereference in wait_iff_congested() Zlatko Calusic
  2012-12-28  2:49             ` Minchan Kim
@ 2012-12-29  8:45             ` Sedat Dilek
  1 sibling, 0 replies; 16+ messages in thread
From: Sedat Dilek @ 2012-12-29  8:45 UTC (permalink / raw)
  To: Zlatko Calusic
  Cc: Andrew Morton, Mel Gorman, Linus Torvalds, Hugh Dickins, linux-mm,
	Linux Kernel Mailing List, Zhouping Liu

Just FYI:

This patch landed upstream [1].
Thanks for all involved people.

- Sedat -

[1] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=ecccd1248d6e6986130ffcc3b0d003cb46a485c0

On Fri, Dec 28, 2012 at 3:16 AM, Zlatko Calusic <zlatko.calusic@iskon.hr> wrote:
> From: Zlatko Calusic <zlatko.calusic@iskon.hr>
>
> The unintended consequence of commit 4ae0a48b is that
> wait_iff_congested() can now be called with NULL struct zone*
> producing kernel oops like this:
>
> BUG: unable to handle kernel NULL pointer dereference
> IP: [<ffffffff811542d9>] wait_iff_congested+0x59/0x140
>
> This trivial patch fixes it.
>
> Reported-by: Zhouping Liu <zliu@redhat.com>
> Reported-and-tested-by: Sedat Dilek <sedat.dilek@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Hugh Dickins <hughd@google.com>
> Signed-off-by: Zlatko Calusic <zlatko.calusic@iskon.hr>
> ---
>  mm/vmscan.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 02bcfa3..e55ce55 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2782,7 +2782,7 @@ loop_again:
>                 if (total_scanned && (sc.priority < DEF_PRIORITY - 2)) {
>                         if (has_under_min_watermark_zone)
>                                 count_vm_event(KSWAPD_SKIP_CONGESTION_WAIT);
> -                       else
> +                       else if (unbalanced_zone)
>                                 wait_iff_congested(unbalanced_zone, BLK_RW_ASYNC, HZ/10);
>                 }
>
> --
> 1.8.1.rc3
>
> --
> Zlatko

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2012-12-31  0:50 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-12-19 23:17 [PATCH] mm: do not sleep in balance_pgdat if there's no i/o congestion Zlatko Calusic
2012-12-19 23:25 ` Zlatko Calusic
2012-12-21 11:51   ` Hillf Danton
2012-12-27 15:42     ` Zlatko Calusic
2012-12-29  7:25       ` Hillf Danton
2012-12-29 12:11         ` Zlatko Calusic
2012-12-20 11:12 ` Mel Gorman
2012-12-20 20:58   ` Andrew Morton
2012-12-22 18:54     ` [PATCH] mm: modify pgdat_balanced() so that it also handles order=0 Zlatko Calusic
2012-12-23 14:12       ` [PATCH v2] " Zlatko Calusic
2012-12-26 15:07         ` [PATCH] mm: avoid calling pgdat_balanced() needlessly Zlatko Calusic
2012-12-28  2:16           ` [PATCH] mm: fix null pointer dereference in wait_iff_congested() Zlatko Calusic
2012-12-28  2:49             ` Minchan Kim
2012-12-28 13:29               ` Zlatko Calusic
2012-12-31  0:50                 ` Minchan Kim
2012-12-29  8:45             ` Sedat Dilek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).