linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [patch 1/2]vmscan: correct all_unreclaimable for zone without lru pages
@ 2011-09-27  7:23 Shaohua Li
  2011-09-27  9:28 ` Michal Hocko
  2011-09-28  6:57 ` Minchan Kim
  0 siblings, 2 replies; 20+ messages in thread
From: Shaohua Li @ 2011-09-27  7:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Michal Hocko, mel, Rik van Riel, linux-mm

I saw DMA zone always has ->all_unreclaimable set. The reason is the high zones
are big, so zone_watermark_ok/_safe() will always return false with a high
classzone_idx for DMA zone, because DMA zone's lowmem_reserve is big for a high
classzone_idx. When kswapd runs into DMA zone, it doesn't scan/reclaim any
pages(no pages in lru), but mark the zone as all_unreclaimable. This can
happen in other low zones too.
This is confusing and can potentially cause oom. Say a low zone has
all_unreclaimable when high zone hasn't enough memory. Then allocating
some pages in low zone(for example reading blkdev with highmem support),
then run into direct reclaim. Since the zone has all_unreclaimable set,
direct reclaim might reclaim nothing and an oom reported. If
all_unreclaimable is unset, the zone can actually reclaim some pages.
If all_unreclaimable is unset, in the inner loop of balance_pgdat we always have
all_zones_ok 0 when checking a low zone's watermark. If high zone watermark isn't
good, there is no problem. Otherwise, we might loop one more time in the outer
loop, but since high zone watermark is ok, the end_zone will be lower, then low
zone's watermark check will be ok and the outer loop will break. So looks this
doesn't bring any problem.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>

---
 mm/vmscan.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Index: linux/mm/vmscan.c
===================================================================
--- linux.orig/mm/vmscan.c	2011-09-27 13:46:31.000000000 +0800
+++ linux/mm/vmscan.c	2011-09-27 15:09:29.000000000 +0800
@@ -2565,7 +2565,9 @@ loop_again:
 				sc.nr_reclaimed += reclaim_state->reclaimed_slab;
 				total_scanned += sc.nr_scanned;
 
-				if (nr_slab == 0 && !zone_reclaimable(zone))
+				if (nr_slab == 0 && !zone_reclaimable(zone) &&
+				    !zone_watermark_ok_safe(zone, order,
+				    high_wmark_pages(zone) + balance_gap, 0, 0))
 					zone->all_unreclaimable = 1;
 			}
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 1/2]vmscan: correct all_unreclaimable for zone without lru pages
  2011-09-27  7:23 [patch 1/2]vmscan: correct all_unreclaimable for zone without lru pages Shaohua Li
@ 2011-09-27  9:28 ` Michal Hocko
  2011-09-28  0:46   ` Shaohua Li
  2011-09-28  6:57 ` Minchan Kim
  1 sibling, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2011-09-27  9:28 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Andrew Morton, mel, Rik van Riel, linux-mm

On Tue 27-09-11 15:23:04, Shaohua Li wrote:
[...]
> Index: linux/mm/vmscan.c
> ===================================================================
> --- linux.orig/mm/vmscan.c	2011-09-27 13:46:31.000000000 +0800
> +++ linux/mm/vmscan.c	2011-09-27 15:09:29.000000000 +0800
> @@ -2565,7 +2565,9 @@ loop_again:
>  				sc.nr_reclaimed += reclaim_state->reclaimed_slab;
>  				total_scanned += sc.nr_scanned;
>  
> -				if (nr_slab == 0 && !zone_reclaimable(zone))
> +				if (nr_slab == 0 && !zone_reclaimable(zone) &&
> +				    !zone_watermark_ok_safe(zone, order,
> +				    high_wmark_pages(zone) + balance_gap, 0, 0))

Hardcoded ZONE_DMA for zone_watermark_ok_safe? Shouldn't this be i for
classzone_idx?

>  					zone->all_unreclaimable = 1;
>  			}
>  

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 1/2]vmscan: correct all_unreclaimable for zone without lru pages
  2011-09-27  9:28 ` Michal Hocko
@ 2011-09-28  0:46   ` Shaohua Li
  0 siblings, 0 replies; 20+ messages in thread
From: Shaohua Li @ 2011-09-28  0:46 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Andrew Morton, mel, Rik van Riel, linux-mm

On Tue, 2011-09-27 at 17:28 +0800, Michal Hocko wrote:
> On Tue 27-09-11 15:23:04, Shaohua Li wrote:
> [...]
> > Index: linux/mm/vmscan.c
> > ===================================================================
> > --- linux.orig/mm/vmscan.c	2011-09-27 13:46:31.000000000 +0800
> > +++ linux/mm/vmscan.c	2011-09-27 15:09:29.000000000 +0800
> > @@ -2565,7 +2565,9 @@ loop_again:
> >  				sc.nr_reclaimed += reclaim_state->reclaimed_slab;
> >  				total_scanned += sc.nr_scanned;
> >  
> > -				if (nr_slab == 0 && !zone_reclaimable(zone))
> > +				if (nr_slab == 0 && !zone_reclaimable(zone) &&
> > +				    !zone_watermark_ok_safe(zone, order,
> > +				    high_wmark_pages(zone) + balance_gap, 0, 0))
> 
> Hardcoded ZONE_DMA for zone_watermark_ok_safe? Shouldn't this be i for
> classzone_idx?
i or 0 are the same here for lowmem_reserve (both have 0 value),
actually a lot of code are using 0 for zone_watermark_ok

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 1/2]vmscan: correct all_unreclaimable for zone without lru pages
  2011-09-27  7:23 [patch 1/2]vmscan: correct all_unreclaimable for zone without lru pages Shaohua Li
  2011-09-27  9:28 ` Michal Hocko
@ 2011-09-28  6:57 ` Minchan Kim
  2011-09-28  7:08   ` Shaohua Li
  1 sibling, 1 reply; 20+ messages in thread
From: Minchan Kim @ 2011-09-28  6:57 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Andrew Morton, Michal Hocko, mel, Rik van Riel, linux-mm

On Tue, Sep 27, 2011 at 03:23:04PM +0800, Shaohua Li wrote:
> I saw DMA zone always has ->all_unreclaimable set. The reason is the high zones
> are big, so zone_watermark_ok/_safe() will always return false with a high
> classzone_idx for DMA zone, because DMA zone's lowmem_reserve is big for a high
> classzone_idx. When kswapd runs into DMA zone, it doesn't scan/reclaim any
> pages(no pages in lru), but mark the zone as all_unreclaimable. This can
> happen in other low zones too.

Good catch!

> This is confusing and can potentially cause oom. Say a low zone has
> all_unreclaimable when high zone hasn't enough memory. Then allocating
> some pages in low zone(for example reading blkdev with highmem support),
> then run into direct reclaim. Since the zone has all_unreclaimable set,
> direct reclaim might reclaim nothing and an oom reported. If
> all_unreclaimable is unset, the zone can actually reclaim some pages.
> If all_unreclaimable is unset, in the inner loop of balance_pgdat we always have
> all_zones_ok 0 when checking a low zone's watermark. If high zone watermark isn't
> good, there is no problem. Otherwise, we might loop one more time in the outer
> loop, but since high zone watermark is ok, the end_zone will be lower, then low
> zone's watermark check will be ok and the outer loop will break. So looks this
> doesn't bring any problem.

I think it would be better to correct zone_reclaimable.
My point is zone_reclaimable should consider zone->pages_scanned.
The point of the function is how many pages scanned VS how many pages remained in LRU.
If reclaimer doesn't scan the zone at all because of no lru pages, it shouldn't tell
the zone is all_unreclaimable.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4480f67..0749b6e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2150,7 +2150,18 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
 
 static bool zone_reclaimable(struct zone *zone)
 {
-       return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
+       bool reclaimable = true;
+       /*
+        * Sometime lower(ex, DMA) zone may have no lru page
+        * while it has a big lowmem_reserve for higher zone.
+        * In such case, the zone may set all_unreclaimable
+        * when it is used for fallback high zone. But it wouldn't
+        * be reset as it has no freeable/scannable page.
+        * So, let's return *true* in case of no scanning.
+        */
+       if (zone->pages_scanned)
+               reclaimable = zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
+       return reclaimable;
 }

-- 
Kinds regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [patch 1/2]vmscan: correct all_unreclaimable for zone without lru pages
  2011-09-28  6:57 ` Minchan Kim
@ 2011-09-28  7:08   ` Shaohua Li
  2011-09-28 17:57     ` Minchan Kim
  0 siblings, 1 reply; 20+ messages in thread
From: Shaohua Li @ 2011-09-28  7:08 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Andrew Morton, Michal Hocko, mel, Rik van Riel, linux-mm

On Wed, 2011-09-28 at 14:57 +0800, Minchan Kim wrote:
> On Tue, Sep 27, 2011 at 03:23:04PM +0800, Shaohua Li wrote:
> > I saw DMA zone always has ->all_unreclaimable set. The reason is the high zones
> > are big, so zone_watermark_ok/_safe() will always return false with a high
> > classzone_idx for DMA zone, because DMA zone's lowmem_reserve is big for a high
> > classzone_idx. When kswapd runs into DMA zone, it doesn't scan/reclaim any
> > pages(no pages in lru), but mark the zone as all_unreclaimable. This can
> > happen in other low zones too.
> 
> Good catch!
> 
> > This is confusing and can potentially cause oom. Say a low zone has
> > all_unreclaimable when high zone hasn't enough memory. Then allocating
> > some pages in low zone(for example reading blkdev with highmem support),
> > then run into direct reclaim. Since the zone has all_unreclaimable set,
> > direct reclaim might reclaim nothing and an oom reported. If
> > all_unreclaimable is unset, the zone can actually reclaim some pages.
> > If all_unreclaimable is unset, in the inner loop of balance_pgdat we always have
> > all_zones_ok 0 when checking a low zone's watermark. If high zone watermark isn't
> > good, there is no problem. Otherwise, we might loop one more time in the outer
> > loop, but since high zone watermark is ok, the end_zone will be lower, then low
> > zone's watermark check will be ok and the outer loop will break. So looks this
> > doesn't bring any problem.
> 
> I think it would be better to correct zone_reclaimable.
> My point is zone_reclaimable should consider zone->pages_scanned.
> The point of the function is how many pages scanned VS how many pages remained in LRU.
> If reclaimer doesn't scan the zone at all because of no lru pages, it shouldn't tell
> the zone is all_unreclaimable.
actually this is exact my first version of the patch. The problem is if
a zone is true unreclaimable (used by kenrel pages or whatever), we will
have zone->pages_scanned 0 too. I thought we should set
all_unreclaimable in this case.

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 1/2]vmscan: correct all_unreclaimable for zone without lru pages
  2011-09-28  7:08   ` Shaohua Li
@ 2011-09-28 17:57     ` Minchan Kim
  2011-09-29  1:14       ` Shaohua Li
  0 siblings, 1 reply; 20+ messages in thread
From: Minchan Kim @ 2011-09-28 17:57 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Andrew Morton, Michal Hocko, mel, Rik van Riel, linux-mm

On Wed, Sep 28, 2011 at 03:08:31PM +0800, Shaohua Li wrote:
> On Wed, 2011-09-28 at 14:57 +0800, Minchan Kim wrote:
> > On Tue, Sep 27, 2011 at 03:23:04PM +0800, Shaohua Li wrote:
> > > I saw DMA zone always has ->all_unreclaimable set. The reason is the high zones
> > > are big, so zone_watermark_ok/_safe() will always return false with a high
> > > classzone_idx for DMA zone, because DMA zone's lowmem_reserve is big for a high
> > > classzone_idx. When kswapd runs into DMA zone, it doesn't scan/reclaim any
> > > pages(no pages in lru), but mark the zone as all_unreclaimable. This can
> > > happen in other low zones too.
> > 
> > Good catch!
> > 
> > > This is confusing and can potentially cause oom. Say a low zone has
> > > all_unreclaimable when high zone hasn't enough memory. Then allocating
> > > some pages in low zone(for example reading blkdev with highmem support),
> > > then run into direct reclaim. Since the zone has all_unreclaimable set,
> > > direct reclaim might reclaim nothing and an oom reported. If
> > > all_unreclaimable is unset, the zone can actually reclaim some pages.
> > > If all_unreclaimable is unset, in the inner loop of balance_pgdat we always have
> > > all_zones_ok 0 when checking a low zone's watermark. If high zone watermark isn't
> > > good, there is no problem. Otherwise, we might loop one more time in the outer
> > > loop, but since high zone watermark is ok, the end_zone will be lower, then low
> > > zone's watermark check will be ok and the outer loop will break. So looks this
> > > doesn't bring any problem.
> > 
> > I think it would be better to correct zone_reclaimable.
> > My point is zone_reclaimable should consider zone->pages_scanned.
> > The point of the function is how many pages scanned VS how many pages remained in LRU.
> > If reclaimer doesn't scan the zone at all because of no lru pages, it shouldn't tell
> > the zone is all_unreclaimable.
> actually this is exact my first version of the patch. The problem is if
> a zone is true unreclaimable (used by kenrel pages or whatever), we will
> have zone->pages_scanned 0 too. I thought we should set
> all_unreclaimable in this case.

Let's think the problem again.
Fundamental problem is that why the lower zone's lowmem_reserve for higher zone is huge big
that might be bigger than the zone's size.
I think we need the boundary for limiting lowmem_reseve.
So how about this?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2a25213..9267db4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5101,6 +5101,7 @@ static void setup_per_zone_lowmem_reserve(void)
                        idx = j;
                        while (idx) {
                                struct zone *lower_zone;
+                               unsigned long lowmem_reserve;
 
                                idx--;
 
@@ -5108,8 +5109,9 @@ static void setup_per_zone_lowmem_reserve(void)
                                        sysctl_lowmem_reserve_ratio[idx] = 1;
 
                                lower_zone = pgdat->node_zones + idx;
-                               lower_zone->lowmem_reserve[j] = present_pages /
-                                       sysctl_lowmem_reserve_ratio[idx];
+                               lowmem_reserve = present_pages / sysctl_lowmem_reserve_ratio[idx];
+                               lower_zone->lowmem_reserve[j] = min(lowmem_reserve,
+                                               lower_zone->present_pages - high_wmark_pages(zone));
                                present_pages += lower_zone->present_pages;
                        }
                }


> 
> Thanks,
> Shaohua
> 

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [patch 1/2]vmscan: correct all_unreclaimable for zone without lru pages
  2011-09-28 17:57     ` Minchan Kim
@ 2011-09-29  1:14       ` Shaohua Li
  2011-09-29  9:18         ` Minchan Kim
  0 siblings, 1 reply; 20+ messages in thread
From: Shaohua Li @ 2011-09-29  1:14 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Andrew Morton, Michal Hocko, mel, Rik van Riel, linux-mm

On Thu, 2011-09-29 at 01:57 +0800, Minchan Kim wrote:
> On Wed, Sep 28, 2011 at 03:08:31PM +0800, Shaohua Li wrote:
> > On Wed, 2011-09-28 at 14:57 +0800, Minchan Kim wrote:
> > > On Tue, Sep 27, 2011 at 03:23:04PM +0800, Shaohua Li wrote:
> > > > I saw DMA zone always has ->all_unreclaimable set. The reason is the high zones
> > > > are big, so zone_watermark_ok/_safe() will always return false with a high
> > > > classzone_idx for DMA zone, because DMA zone's lowmem_reserve is big for a high
> > > > classzone_idx. When kswapd runs into DMA zone, it doesn't scan/reclaim any
> > > > pages(no pages in lru), but mark the zone as all_unreclaimable. This can
> > > > happen in other low zones too.
> > > 
> > > Good catch!
> > > 
> > > > This is confusing and can potentially cause oom. Say a low zone has
> > > > all_unreclaimable when high zone hasn't enough memory. Then allocating
> > > > some pages in low zone(for example reading blkdev with highmem support),
> > > > then run into direct reclaim. Since the zone has all_unreclaimable set,
> > > > direct reclaim might reclaim nothing and an oom reported. If
> > > > all_unreclaimable is unset, the zone can actually reclaim some pages.
> > > > If all_unreclaimable is unset, in the inner loop of balance_pgdat we always have
> > > > all_zones_ok 0 when checking a low zone's watermark. If high zone watermark isn't
> > > > good, there is no problem. Otherwise, we might loop one more time in the outer
> > > > loop, but since high zone watermark is ok, the end_zone will be lower, then low
> > > > zone's watermark check will be ok and the outer loop will break. So looks this
> > > > doesn't bring any problem.
> > > 
> > > I think it would be better to correct zone_reclaimable.
> > > My point is zone_reclaimable should consider zone->pages_scanned.
> > > The point of the function is how many pages scanned VS how many pages remained in LRU.
> > > If reclaimer doesn't scan the zone at all because of no lru pages, it shouldn't tell
> > > the zone is all_unreclaimable.
> > actually this is exact my first version of the patch. The problem is if
> > a zone is true unreclaimable (used by kenrel pages or whatever), we will
> > have zone->pages_scanned 0 too. I thought we should set
> > all_unreclaimable in this case.
> 
> Let's think the problem again.
> Fundamental problem is that why the lower zone's lowmem_reserve for higher zone is huge big
> that might be bigger than the zone's size.
> I think we need the boundary for limiting lowmem_reseve.
> So how about this?
I didn't see a reason why high zone allocation should fallback to low
zone if high zone is big. Changing the lowmem_reserve can cause the
fallback. Has any rationale here?

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 1/2]vmscan: correct all_unreclaimable for zone without lru pages
  2011-09-29  1:14       ` Shaohua Li
@ 2011-09-29  9:18         ` Minchan Kim
  2011-09-30  2:12           ` Shaohua Li
  0 siblings, 1 reply; 20+ messages in thread
From: Minchan Kim @ 2011-09-29  9:18 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Andrew Morton, Michal Hocko, mel, Rik van Riel, linux-mm

On Thu, Sep 29, 2011 at 09:14:51AM +0800, Shaohua Li wrote:
> On Thu, 2011-09-29 at 01:57 +0800, Minchan Kim wrote:
> > On Wed, Sep 28, 2011 at 03:08:31PM +0800, Shaohua Li wrote:
> > > On Wed, 2011-09-28 at 14:57 +0800, Minchan Kim wrote:
> > > > On Tue, Sep 27, 2011 at 03:23:04PM +0800, Shaohua Li wrote:
> > > > > I saw DMA zone always has ->all_unreclaimable set. The reason is the high zones
> > > > > are big, so zone_watermark_ok/_safe() will always return false with a high
> > > > > classzone_idx for DMA zone, because DMA zone's lowmem_reserve is big for a high
> > > > > classzone_idx. When kswapd runs into DMA zone, it doesn't scan/reclaim any
> > > > > pages(no pages in lru), but mark the zone as all_unreclaimable. This can
> > > > > happen in other low zones too.
> > > > 
> > > > Good catch!
> > > > 
> > > > > This is confusing and can potentially cause oom. Say a low zone has
> > > > > all_unreclaimable when high zone hasn't enough memory. Then allocating
> > > > > some pages in low zone(for example reading blkdev with highmem support),
> > > > > then run into direct reclaim. Since the zone has all_unreclaimable set,
> > > > > direct reclaim might reclaim nothing and an oom reported. If
> > > > > all_unreclaimable is unset, the zone can actually reclaim some pages.
> > > > > If all_unreclaimable is unset, in the inner loop of balance_pgdat we always have
> > > > > all_zones_ok 0 when checking a low zone's watermark. If high zone watermark isn't
> > > > > good, there is no problem. Otherwise, we might loop one more time in the outer
> > > > > loop, but since high zone watermark is ok, the end_zone will be lower, then low
> > > > > zone's watermark check will be ok and the outer loop will break. So looks this
> > > > > doesn't bring any problem.
> > > > 
> > > > I think it would be better to correct zone_reclaimable.
> > > > My point is zone_reclaimable should consider zone->pages_scanned.
> > > > The point of the function is how many pages scanned VS how many pages remained in LRU.
> > > > If reclaimer doesn't scan the zone at all because of no lru pages, it shouldn't tell
> > > > the zone is all_unreclaimable.
> > > actually this is exact my first version of the patch. The problem is if
> > > a zone is true unreclaimable (used by kenrel pages or whatever), we will
> > > have zone->pages_scanned 0 too. I thought we should set
> > > all_unreclaimable in this case.
> > 
> > Let's think the problem again.
> > Fundamental problem is that why the lower zone's lowmem_reserve for higher zone is huge big
> > that might be bigger than the zone's size.
> > I think we need the boundary for limiting lowmem_reseve.
> > So how about this?
> I didn't see a reason why high zone allocation should fallback to low
> zone if high zone is big. Changing the lowmem_reserve can cause the
> fallback. Has any rationale here?

I try to think better solution than yours but I got failed. :(
The why I try to avoid your patch is that kswapd is very complicated these days so
I wanted to not add more logic for handling corner cases if we can solve it
other ways. But as I said, but I got failed.

It seems that it doesn't make sense that previous my patch that limit lowmem_reserve.
Because we can have higher zone which is very big size so that lowmem_zone[higher_zone] of
low zone could be bigger freely than the lowmem itself size.
It implies the low zone should be not used for higher allocation.
It has no reason to limit it. My brain was broken. :(

But I have a question about your patch still.
What happens if DMA zone sets zone->all_unreclaimable with 1?

You said as follows,

> This is confusing and can potentially cause oom. Say a low zone has
> all_unreclaimable when high zone hasn't enough memory. Then allocating
> some pages in low zone(for example reading blkdev with highmem support),
> then run into direct reclaim. Since the zone has all_unreclaimable set,

If low zone has enough pages for allocation, it cannot have entered reclaim.
It means now low zone doesn't have enough free pages for the order allocation.
So it's natural to enter reclaim path.

> direct reclaim might reclaim nothing and an oom reported. If

It's not correct "nothing". At least, it will do something in DEF_PRIORITY.

> all_unreclaimable is unset, the zone can actually reclaim some pages.

The reason of this problem is that the zone has no lru page, you said.
Then how could we reclaim some pages in the zone even if the zone's all_unreclaimable is unset?
You expect slab pages?
But our heuristic of setting all_unreclaimable in kswapd is that
we consider we can't reclaim any slab pages any more(ie, nr_slab == 0) as well as too many lru
scanning. So I think we should not depend on some luck which can reclaim some slab pages.

If I misunderstood your point, could you elaborate more?
The reason I am very picky about this is that I's really like to avoid omplicating kswapd without
any real problem.

> 
> Thanks,
> Shaohua
> 

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 1/2]vmscan: correct all_unreclaimable for zone without lru pages
  2011-09-29  9:18         ` Minchan Kim
@ 2011-09-30  2:12           ` Shaohua Li
  2011-10-01  6:59             ` Minchan Kim
  0 siblings, 1 reply; 20+ messages in thread
From: Shaohua Li @ 2011-09-30  2:12 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Andrew Morton, Michal Hocko, mel, Rik van Riel, linux-mm

On Thu, 2011-09-29 at 17:18 +0800, Minchan Kim wrote:
> On Thu, Sep 29, 2011 at 09:14:51AM +0800, Shaohua Li wrote:
> > On Thu, 2011-09-29 at 01:57 +0800, Minchan Kim wrote:
> > > On Wed, Sep 28, 2011 at 03:08:31PM +0800, Shaohua Li wrote:
> > > > On Wed, 2011-09-28 at 14:57 +0800, Minchan Kim wrote:
> > > > > On Tue, Sep 27, 2011 at 03:23:04PM +0800, Shaohua Li wrote:
> > > > > > I saw DMA zone always has ->all_unreclaimable set. The reason is the high zones
> > > > > > are big, so zone_watermark_ok/_safe() will always return false with a high
> > > > > > classzone_idx for DMA zone, because DMA zone's lowmem_reserve is big for a high
> > > > > > classzone_idx. When kswapd runs into DMA zone, it doesn't scan/reclaim any
> > > > > > pages(no pages in lru), but mark the zone as all_unreclaimable. This can
> > > > > > happen in other low zones too.
> > > > > 
> > > > > Good catch!
> > > > > 
> > > > > > This is confusing and can potentially cause oom. Say a low zone has
> > > > > > all_unreclaimable when high zone hasn't enough memory. Then allocating
> > > > > > some pages in low zone(for example reading blkdev with highmem support),
> > > > > > then run into direct reclaim. Since the zone has all_unreclaimable set,
> > > > > > direct reclaim might reclaim nothing and an oom reported. If
> > > > > > all_unreclaimable is unset, the zone can actually reclaim some pages.
> > > > > > If all_unreclaimable is unset, in the inner loop of balance_pgdat we always have
> > > > > > all_zones_ok 0 when checking a low zone's watermark. If high zone watermark isn't
> > > > > > good, there is no problem. Otherwise, we might loop one more time in the outer
> > > > > > loop, but since high zone watermark is ok, the end_zone will be lower, then low
> > > > > > zone's watermark check will be ok and the outer loop will break. So looks this
> > > > > > doesn't bring any problem.
> > > > > 
> > > > > I think it would be better to correct zone_reclaimable.
> > > > > My point is zone_reclaimable should consider zone->pages_scanned.
> > > > > The point of the function is how many pages scanned VS how many pages remained in LRU.
> > > > > If reclaimer doesn't scan the zone at all because of no lru pages, it shouldn't tell
> > > > > the zone is all_unreclaimable.
> > > > actually this is exact my first version of the patch. The problem is if
> > > > a zone is true unreclaimable (used by kenrel pages or whatever), we will
> > > > have zone->pages_scanned 0 too. I thought we should set
> > > > all_unreclaimable in this case.
> > > 
> > > Let's think the problem again.
> > > Fundamental problem is that why the lower zone's lowmem_reserve for higher zone is huge big
> > > that might be bigger than the zone's size.
> > > I think we need the boundary for limiting lowmem_reseve.
> > > So how about this?
> > I didn't see a reason why high zone allocation should fallback to low
> > zone if high zone is big. Changing the lowmem_reserve can cause the
> > fallback. Has any rationale here?
> 
> I try to think better solution than yours but I got failed. :(
> The why I try to avoid your patch is that kswapd is very complicated these days so
> I wanted to not add more logic for handling corner cases if we can solve it
> other ways. But as I said, but I got failed.
> 
> It seems that it doesn't make sense that previous my patch that limit lowmem_reserve.
> Because we can have higher zone which is very big size so that lowmem_zone[higher_zone] of
> low zone could be bigger freely than the lowmem itself size.
> It implies the low zone should be not used for higher allocation.
> It has no reason to limit it. My brain was broken. :(
> 
> But I have a question about your patch still.
> What happens if DMA zone sets zone->all_unreclaimable with 1?
> 
> You said as follows,
> 
> > This is confusing and can potentially cause oom. Say a low zone has
> > all_unreclaimable when high zone hasn't enough memory. Then allocating
> > some pages in low zone(for example reading blkdev with highmem support),
> > then run into direct reclaim. Since the zone has all_unreclaimable set,
> 
> If low zone has enough pages for allocation, it cannot have entered reclaim.
> It means now low zone doesn't have enough free pages for the order allocation.
> So it's natural to enter reclaim path.
> 
> > direct reclaim might reclaim nothing and an oom reported. If
> 
> It's not correct "nothing". At least, it will do something in DEF_PRIORITY.
it does something, but might not reclaim any pages, for example, it
starts page write, but page isn't in disk yet in DEF_PRIORITY and it
skip further reclaiming in !DEF_PRIORITY.

> > all_unreclaimable is unset, the zone can actually reclaim some pages.
> 
> The reason of this problem is that the zone has no lru page, you said.
> Then how could we reclaim some pages in the zone even if the zone's all_unreclaimable is unset?
> You expect slab pages?
The zone could have lru pages. Let's take an example, allocation from
ZONE_HIGHMEM, then kswapd runs, ZONE_NORMAL gets all_unreclaimable set
even it has free pages. Then we do write blkdev device, which use
ZONE_NORMAL for page cache. Some pages in ZONE_NORMAL are in lru, then
we run into direct page reclaim for ZONE_NORMAL. Since all_unreclaimable
is set and pages in ZONE_NORMAL lru are dirty, direct reclaim could
fail. But I'd agree this is a corner case.
Besides when I saw ZONE_DMA has a lot of free pages and
all_unreclaimable is set, it's really confusing.

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 1/2]vmscan: correct all_unreclaimable for zone without lru pages
  2011-09-30  2:12           ` Shaohua Li
@ 2011-10-01  6:59             ` Minchan Kim
  2011-10-08  3:09               ` Shaohua Li
  0 siblings, 1 reply; 20+ messages in thread
From: Minchan Kim @ 2011-10-01  6:59 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Andrew Morton, Michal Hocko, mel, Rik van Riel, linux-mm,
	Johannes Weiner, KOSAKI Motohiro, KAMEZAWA Hiroyuki

On Fri, Sep 30, 2011 at 10:12:23AM +0800, Shaohua Li wrote:
> On Thu, 2011-09-29 at 17:18 +0800, Minchan Kim wrote:
> > On Thu, Sep 29, 2011 at 09:14:51AM +0800, Shaohua Li wrote:
> > > On Thu, 2011-09-29 at 01:57 +0800, Minchan Kim wrote:
> > > > On Wed, Sep 28, 2011 at 03:08:31PM +0800, Shaohua Li wrote:
> > > > > On Wed, 2011-09-28 at 14:57 +0800, Minchan Kim wrote:
> > > > > > On Tue, Sep 27, 2011 at 03:23:04PM +0800, Shaohua Li wrote:
> > > > > > > I saw DMA zone always has ->all_unreclaimable set. The reason is the high zones
> > > > > > > are big, so zone_watermark_ok/_safe() will always return false with a high
> > > > > > > classzone_idx for DMA zone, because DMA zone's lowmem_reserve is big for a high
> > > > > > > classzone_idx. When kswapd runs into DMA zone, it doesn't scan/reclaim any
> > > > > > > pages(no pages in lru), but mark the zone as all_unreclaimable. This can
> > > > > > > happen in other low zones too.
> > > > > > 
> > > > > > Good catch!
> > > > > > 
> > > > > > > This is confusing and can potentially cause oom. Say a low zone has
> > > > > > > all_unreclaimable when high zone hasn't enough memory. Then allocating
> > > > > > > some pages in low zone(for example reading blkdev with highmem support),
> > > > > > > then run into direct reclaim. Since the zone has all_unreclaimable set,
> > > > > > > direct reclaim might reclaim nothing and an oom reported. If
> > > > > > > all_unreclaimable is unset, the zone can actually reclaim some pages.
> > > > > > > If all_unreclaimable is unset, in the inner loop of balance_pgdat we always have
> > > > > > > all_zones_ok 0 when checking a low zone's watermark. If high zone watermark isn't
> > > > > > > good, there is no problem. Otherwise, we might loop one more time in the outer
> > > > > > > loop, but since high zone watermark is ok, the end_zone will be lower, then low
> > > > > > > zone's watermark check will be ok and the outer loop will break. So looks this
> > > > > > > doesn't bring any problem.
> > > > > > 
> > > > > > I think it would be better to correct zone_reclaimable.
> > > > > > My point is zone_reclaimable should consider zone->pages_scanned.
> > > > > > The point of the function is how many pages scanned VS how many pages remained in LRU.
> > > > > > If reclaimer doesn't scan the zone at all because of no lru pages, it shouldn't tell
> > > > > > the zone is all_unreclaimable.
> > > > > actually this is exact my first version of the patch. The problem is if
> > > > > a zone is true unreclaimable (used by kenrel pages or whatever), we will
> > > > > have zone->pages_scanned 0 too. I thought we should set
> > > > > all_unreclaimable in this case.
> > > > 
> > > > Let's think the problem again.
> > > > Fundamental problem is that why the lower zone's lowmem_reserve for higher zone is huge big
> > > > that might be bigger than the zone's size.
> > > > I think we need the boundary for limiting lowmem_reseve.
> > > > So how about this?
> > > I didn't see a reason why high zone allocation should fallback to low
> > > zone if high zone is big. Changing the lowmem_reserve can cause the
> > > fallback. Has any rationale here?
> > 
> > I try to think better solution than yours but I got failed. :(
> > The why I try to avoid your patch is that kswapd is very complicated these days so
> > I wanted to not add more logic for handling corner cases if we can solve it
> > other ways. But as I said, but I got failed.
> > 
> > It seems that it doesn't make sense that previous my patch that limit lowmem_reserve.
> > Because we can have higher zone which is very big size so that lowmem_zone[higher_zone] of
> > low zone could be bigger freely than the lowmem itself size.
> > It implies the low zone should be not used for higher allocation.
> > It has no reason to limit it. My brain was broken. :(
> > 
> > But I have a question about your patch still.
> > What happens if DMA zone sets zone->all_unreclaimable with 1?
> > 
> > You said as follows,
> > 
> > > This is confusing and can potentially cause oom. Say a low zone has
> > > all_unreclaimable when high zone hasn't enough memory. Then allocating
> > > some pages in low zone(for example reading blkdev with highmem support),
> > > then run into direct reclaim. Since the zone has all_unreclaimable set,
> > 
> > If low zone has enough pages for allocation, it cannot have entered reclaim.
> > It means now low zone doesn't have enough free pages for the order allocation.
> > So it's natural to enter reclaim path.
> > 
> > > direct reclaim might reclaim nothing and an oom reported. If
> > 
> > It's not correct "nothing". At least, it will do something in DEF_PRIORITY.
> it does something, but might not reclaim any pages, for example, it
> starts page write, but page isn't in disk yet in DEF_PRIORITY and it
> skip further reclaiming in !DEF_PRIORITY.
> 
> > > all_unreclaimable is unset, the zone can actually reclaim some pages.
> > 
> > The reason of this problem is that the zone has no lru page, you said.
> > Then how could we reclaim some pages in the zone even if the zone's all_unreclaimable is unset?
> > You expect slab pages?
> The zone could have lru pages. Let's take an example, allocation from
> ZONE_HIGHMEM, then kswapd runs, ZONE_NORMAL gets all_unreclaimable set
> even it has free pages. Then we do write blkdev device, which use
> ZONE_NORMAL for page cache. Some pages in ZONE_NORMAL are in lru, then
> we run into direct page reclaim for ZONE_NORMAL. Since all_unreclaimable
> is set and pages in ZONE_NORMAL lru are dirty, direct reclaim could
> fail. But I'd agree this is a corner case.
> Besides when I saw ZONE_DMA has a lot of free pages and
> all_unreclaimable is set, it's really confusing.

Hi Shaohua,
Sorry for late response and Thanks for your explanation.
It's valuable to fix, I think.
How about this?

I hope other guys have a interest in the problem.
Cced them.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 1/2]vmscan: correct all_unreclaimable for zone without lru pages
  2011-10-01  6:59             ` Minchan Kim
@ 2011-10-08  3:09               ` Shaohua Li
  2011-10-08  4:32                 ` Minchan Kim
  0 siblings, 1 reply; 20+ messages in thread
From: Shaohua Li @ 2011-10-08  3:09 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Michal Hocko, mel, Rik van Riel, linux-mm,
	Johannes Weiner, KOSAKI Motohiro, KAMEZAWA Hiroyuki

On Sat, 2011-10-01 at 14:59 +0800, Minchan Kim wrote:
> On Fri, Sep 30, 2011 at 10:12:23AM +0800, Shaohua Li wrote:
> > On Thu, 2011-09-29 at 17:18 +0800, Minchan Kim wrote:
> > > On Thu, Sep 29, 2011 at 09:14:51AM +0800, Shaohua Li wrote:
> > > > On Thu, 2011-09-29 at 01:57 +0800, Minchan Kim wrote:
> > > > > On Wed, Sep 28, 2011 at 03:08:31PM +0800, Shaohua Li wrote:
> > > > > > On Wed, 2011-09-28 at 14:57 +0800, Minchan Kim wrote:
> > > > > > > On Tue, Sep 27, 2011 at 03:23:04PM +0800, Shaohua Li wrote:
> > > > > > > > I saw DMA zone always has ->all_unreclaimable set. The reason is the high zones
> > > > > > > > are big, so zone_watermark_ok/_safe() will always return false with a high
> > > > > > > > classzone_idx for DMA zone, because DMA zone's lowmem_reserve is big for a high
> > > > > > > > classzone_idx. When kswapd runs into DMA zone, it doesn't scan/reclaim any
> > > > > > > > pages(no pages in lru), but mark the zone as all_unreclaimable. This can
> > > > > > > > happen in other low zones too.
> > > > > > >
> > > > > > > Good catch!
> > > > > > >
> > > > > > > > This is confusing and can potentially cause oom. Say a low zone has
> > > > > > > > all_unreclaimable when high zone hasn't enough memory. Then allocating
> > > > > > > > some pages in low zone(for example reading blkdev with highmem support),
> > > > > > > > then run into direct reclaim. Since the zone has all_unreclaimable set,
> > > > > > > > direct reclaim might reclaim nothing and an oom reported. If
> > > > > > > > all_unreclaimable is unset, the zone can actually reclaim some pages.
> > > > > > > > If all_unreclaimable is unset, in the inner loop of balance_pgdat we always have
> > > > > > > > all_zones_ok 0 when checking a low zone's watermark. If high zone watermark isn't
> > > > > > > > good, there is no problem. Otherwise, we might loop one more time in the outer
> > > > > > > > loop, but since high zone watermark is ok, the end_zone will be lower, then low
> > > > > > > > zone's watermark check will be ok and the outer loop will break. So looks this
> > > > > > > > doesn't bring any problem.
> > > > > > >
> > > > > > > I think it would be better to correct zone_reclaimable.
> > > > > > > My point is zone_reclaimable should consider zone->pages_scanned.
> > > > > > > The point of the function is how many pages scanned VS how many pages remained in LRU.
> > > > > > > If reclaimer doesn't scan the zone at all because of no lru pages, it shouldn't tell
> > > > > > > the zone is all_unreclaimable.
> > > > > > actually this is exact my first version of the patch. The problem is if
> > > > > > a zone is true unreclaimable (used by kenrel pages or whatever), we will
> > > > > > have zone->pages_scanned 0 too. I thought we should set
> > > > > > all_unreclaimable in this case.
> > > > >
> > > > > Let's think the problem again.
> > > > > Fundamental problem is that why the lower zone's lowmem_reserve for higher zone is huge big
> > > > > that might be bigger than the zone's size.
> > > > > I think we need the boundary for limiting lowmem_reseve.
> > > > > So how about this?
> > > > I didn't see a reason why high zone allocation should fallback to low
> > > > zone if high zone is big. Changing the lowmem_reserve can cause the
> > > > fallback. Has any rationale here?
> > >
> > > I try to think better solution than yours but I got failed. :(
> > > The why I try to avoid your patch is that kswapd is very complicated these days so
> > > I wanted to not add more logic for handling corner cases if we can solve it
> > > other ways. But as I said, but I got failed.
> > >
> > > It seems that it doesn't make sense that previous my patch that limit lowmem_reserve.
> > > Because we can have higher zone which is very big size so that lowmem_zone[higher_zone] of
> > > low zone could be bigger freely than the lowmem itself size.
> > > It implies the low zone should be not used for higher allocation.
> > > It has no reason to limit it. My brain was broken. :(
> > >
> > > But I have a question about your patch still.
> > > What happens if DMA zone sets zone->all_unreclaimable with 1?
> > >
> > > You said as follows,
> > >
> > > > This is confusing and can potentially cause oom. Say a low zone has
> > > > all_unreclaimable when high zone hasn't enough memory. Then allocating
> > > > some pages in low zone(for example reading blkdev with highmem support),
> > > > then run into direct reclaim. Since the zone has all_unreclaimable set,
> > >
> > > If low zone has enough pages for allocation, it cannot have entered reclaim.
> > > It means now low zone doesn't have enough free pages for the order allocation.
> > > So it's natural to enter reclaim path.
> > >
> > > > direct reclaim might reclaim nothing and an oom reported. If
> > >
> > > It's not correct "nothing". At least, it will do something in DEF_PRIORITY.
> > it does something, but might not reclaim any pages, for example, it
> > starts page write, but page isn't in disk yet in DEF_PRIORITY and it
> > skip further reclaiming in !DEF_PRIORITY.
> >
> > > > all_unreclaimable is unset, the zone can actually reclaim some pages.
> > >
> > > The reason of this problem is that the zone has no lru page, you said.
> > > Then how could we reclaim some pages in the zone even if the zone's all_unreclaimable is unset?
> > > You expect slab pages?
> > The zone could have lru pages. Let's take an example, allocation from
> > ZONE_HIGHMEM, then kswapd runs, ZONE_NORMAL gets all_unreclaimable set
> > even it has free pages. Then we do write blkdev device, which use
> > ZONE_NORMAL for page cache. Some pages in ZONE_NORMAL are in lru, then
> > we run into direct page reclaim for ZONE_NORMAL. Since all_unreclaimable
> > is set and pages in ZONE_NORMAL lru are dirty, direct reclaim could
> > fail. But I'd agree this is a corner case.
> > Besides when I saw ZONE_DMA has a lot of free pages and
> > all_unreclaimable is set, it's really confusing.
> 
> Hi Shaohua,
> Sorry for late response and Thanks for your explanation.
> It's valuable to fix, I think.
> How about this?
> 
> I hope other guys have a interest in the problem.
> Cced them.
Hi,
it's a long holiday here, so I'm late, sorry.

> From 070d5b1a69921bc71c6aaa5445fb1d29ecb38f74 Mon Sep 17 00:00:00 2001
> From: Minchan Kim <minchan.kim@gmail.com>
> Date: Sat, 1 Oct 2011 15:26:08 +0900
> Subject: [RFC] vmscan: set all_unreclaimable of zone carefully
> 
> Shaohua Li reported all_unreclaimable of DMA zone is always set
> because the system has a big memory HIGH zone so that lowmem_reserve[HIGH]
> could be a big.
> 
> It could be a problem as follows
> 
> Assumption :
> 1. The system has a big high memory so that lowmem_reserve[HIGH] of DMA zone would be big.
> 2. HIGH/NORMAL zone are full but DMA zone has enough free pages.
> 
> Scenario
> 1. A request to allocate a page in HIGH zone.
> 2. HIGH/NORMAL zone already consumes lots of pages so that it would be fall-backed to DMA zone.
> 3. In DMA zone, allocator got failed, too becuase lowmem_reserve[HIGH] is very big so that it wakes up kswapd
> 4. kswapd would call shrink_zone while it see DMA zone since DMA zone's lowmem_reserve[HIGHMEM]
>    would be big so that it couldn't meet zone_watermark_ok_safe(high_wmark_pages(zone) + balance_gap,
>    *end_zone*)
> 5. DMA zone doesn't meet stop condition(nr_slab != 0, !zone_reclaimable) because the zone has small lru pages
>    and it doesn't have slab pages so that kswapd would set all_unreclaimable of the zone to *1* easily.
> 6. B request to allocate many pages in NORMAL zone but NORMAL zone has no free pages
>    so that it would be fall-backed to DMA zone.
> 7. DMA zone would allocates many pages for NORMAL zone because lowmem_reserve[NORMAL] is small.
>    These pages are used by application(ie, it menas LRU pages. Yes. Now DMA zone could have many reclaimable pages)
> 8. C request to allocate a page in NORMAL zone but he got failed because DMA zone doesn't have enough free pages.
>    (Most of pages in DMA zone are consumed by B)
> 9. Kswapd try to reclaim lru pages in DMA zone but got failed because all_unreclaimable of the zone is 1. Otherwise,
>    it could reclaim many pages which are used by B.
> 
> Of coures, we can do something in DEF_PRIORITY but it couldn't do enough because it can't raise
> synchronus reclaim in direct reclaim path if the zone has many dirty pages
> so that the process is killed by OOM.
> 
> The principal problem is caused by step 8.
> In step 8, we increased # of lru size very much but still the zone->all_unreclaimable is 1.
> If we increase lru size, it is valuable to try reclaiming again.
> The rationale is that we reset all_unreclaimable to 0 even if we free just a one page.
So this fixes the oom, but we still have DMA has all_unreclaimable set
always, because all_unreclaimable == zone_reclaimable_pages() + 1. Not a
problem?
What's wrong with my original patch? It appears reasonable if a zone has
a lot of free memory, don't set unreclaimable to it.

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 1/2]vmscan: correct all_unreclaimable for zone without lru pages
  2011-10-08  3:09               ` Shaohua Li
@ 2011-10-08  4:32                 ` Minchan Kim
  2011-10-08  5:48                   ` Shaohua Li
  0 siblings, 1 reply; 20+ messages in thread
From: Minchan Kim @ 2011-10-08  4:32 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Andrew Morton, Michal Hocko, mel, Rik van Riel, linux-mm,
	Johannes Weiner, KOSAKI Motohiro, KAMEZAWA Hiroyuki

On Sat, Oct 08, 2011 at 11:09:51AM +0800, Shaohua Li wrote:
> On Sat, 2011-10-01 at 14:59 +0800, Minchan Kim wrote:
> > On Fri, Sep 30, 2011 at 10:12:23AM +0800, Shaohua Li wrote:
> > > On Thu, 2011-09-29 at 17:18 +0800, Minchan Kim wrote:
> > > > On Thu, Sep 29, 2011 at 09:14:51AM +0800, Shaohua Li wrote:
> > > > > On Thu, 2011-09-29 at 01:57 +0800, Minchan Kim wrote:
> > > > > > On Wed, Sep 28, 2011 at 03:08:31PM +0800, Shaohua Li wrote:
> > > > > > > On Wed, 2011-09-28 at 14:57 +0800, Minchan Kim wrote:
> > > > > > > > On Tue, Sep 27, 2011 at 03:23:04PM +0800, Shaohua Li wrote:
> > > > > > > > > I saw DMA zone always has ->all_unreclaimable set. The reason is the high zones
> > > > > > > > > are big, so zone_watermark_ok/_safe() will always return false with a high
> > > > > > > > > classzone_idx for DMA zone, because DMA zone's lowmem_reserve is big for a high
> > > > > > > > > classzone_idx. When kswapd runs into DMA zone, it doesn't scan/reclaim any
> > > > > > > > > pages(no pages in lru), but mark the zone as all_unreclaimable. This can
> > > > > > > > > happen in other low zones too.
> > > > > > > >
> > > > > > > > Good catch!
> > > > > > > >
> > > > > > > > > This is confusing and can potentially cause oom. Say a low zone has
> > > > > > > > > all_unreclaimable when high zone hasn't enough memory. Then allocating
> > > > > > > > > some pages in low zone(for example reading blkdev with highmem support),
> > > > > > > > > then run into direct reclaim. Since the zone has all_unreclaimable set,
> > > > > > > > > direct reclaim might reclaim nothing and an oom reported. If
> > > > > > > > > all_unreclaimable is unset, the zone can actually reclaim some pages.
> > > > > > > > > If all_unreclaimable is unset, in the inner loop of balance_pgdat we always have
> > > > > > > > > all_zones_ok 0 when checking a low zone's watermark. If high zone watermark isn't
> > > > > > > > > good, there is no problem. Otherwise, we might loop one more time in the outer
> > > > > > > > > loop, but since high zone watermark is ok, the end_zone will be lower, then low
> > > > > > > > > zone's watermark check will be ok and the outer loop will break. So looks this
> > > > > > > > > doesn't bring any problem.
> > > > > > > >
> > > > > > > > I think it would be better to correct zone_reclaimable.
> > > > > > > > My point is zone_reclaimable should consider zone->pages_scanned.
> > > > > > > > The point of the function is how many pages scanned VS how many pages remained in LRU.
> > > > > > > > If reclaimer doesn't scan the zone at all because of no lru pages, it shouldn't tell
> > > > > > > > the zone is all_unreclaimable.
> > > > > > > actually this is exact my first version of the patch. The problem is if
> > > > > > > a zone is true unreclaimable (used by kenrel pages or whatever), we will
> > > > > > > have zone->pages_scanned 0 too. I thought we should set
> > > > > > > all_unreclaimable in this case.
> > > > > >
> > > > > > Let's think the problem again.
> > > > > > Fundamental problem is that why the lower zone's lowmem_reserve for higher zone is huge big
> > > > > > that might be bigger than the zone's size.
> > > > > > I think we need the boundary for limiting lowmem_reseve.
> > > > > > So how about this?
> > > > > I didn't see a reason why high zone allocation should fallback to low
> > > > > zone if high zone is big. Changing the lowmem_reserve can cause the
> > > > > fallback. Has any rationale here?
> > > >
> > > > I try to think better solution than yours but I got failed. :(
> > > > The why I try to avoid your patch is that kswapd is very complicated these days so
> > > > I wanted to not add more logic for handling corner cases if we can solve it
> > > > other ways. But as I said, but I got failed.
> > > >
> > > > It seems that it doesn't make sense that previous my patch that limit lowmem_reserve.
> > > > Because we can have higher zone which is very big size so that lowmem_zone[higher_zone] of
> > > > low zone could be bigger freely than the lowmem itself size.
> > > > It implies the low zone should be not used for higher allocation.
> > > > It has no reason to limit it. My brain was broken. :(
> > > >
> > > > But I have a question about your patch still.
> > > > What happens if DMA zone sets zone->all_unreclaimable with 1?
> > > >
> > > > You said as follows,
> > > >
> > > > > This is confusing and can potentially cause oom. Say a low zone has
> > > > > all_unreclaimable when high zone hasn't enough memory. Then allocating
> > > > > some pages in low zone(for example reading blkdev with highmem support),
> > > > > then run into direct reclaim. Since the zone has all_unreclaimable set,
> > > >
> > > > If low zone has enough pages for allocation, it cannot have entered reclaim.
> > > > It means now low zone doesn't have enough free pages for the order allocation.
> > > > So it's natural to enter reclaim path.
> > > >
> > > > > direct reclaim might reclaim nothing and an oom reported. If
> > > >
> > > > It's not correct "nothing". At least, it will do something in DEF_PRIORITY.
> > > it does something, but might not reclaim any pages, for example, it
> > > starts page write, but page isn't in disk yet in DEF_PRIORITY and it
> > > skip further reclaiming in !DEF_PRIORITY.
> > >
> > > > > all_unreclaimable is unset, the zone can actually reclaim some pages.
> > > >
> > > > The reason of this problem is that the zone has no lru page, you said.
> > > > Then how could we reclaim some pages in the zone even if the zone's all_unreclaimable is unset?
> > > > You expect slab pages?
> > > The zone could have lru pages. Let's take an example, allocation from
> > > ZONE_HIGHMEM, then kswapd runs, ZONE_NORMAL gets all_unreclaimable set
> > > even it has free pages. Then we do write blkdev device, which use
> > > ZONE_NORMAL for page cache. Some pages in ZONE_NORMAL are in lru, then
> > > we run into direct page reclaim for ZONE_NORMAL. Since all_unreclaimable
> > > is set and pages in ZONE_NORMAL lru are dirty, direct reclaim could
> > > fail. But I'd agree this is a corner case.
> > > Besides when I saw ZONE_DMA has a lot of free pages and
> > > all_unreclaimable is set, it's really confusing.
> > 
> > Hi Shaohua,
> > Sorry for late response and Thanks for your explanation.
> > It's valuable to fix, I think.
> > How about this?
> > 
> > I hope other guys have a interest in the problem.
> > Cced them.
> Hi,
> it's a long holiday here, so I'm late, sorry.

No problem. I coundn't access internet freely, either.

> 
> > From 070d5b1a69921bc71c6aaa5445fb1d29ecb38f74 Mon Sep 17 00:00:00 2001
> > From: Minchan Kim <minchan.kim@gmail.com>
> > Date: Sat, 1 Oct 2011 15:26:08 +0900
> > Subject: [RFC] vmscan: set all_unreclaimable of zone carefully
> > 
> > Shaohua Li reported all_unreclaimable of DMA zone is always set
> > because the system has a big memory HIGH zone so that lowmem_reserve[HIGH]
> > could be a big.
> > 
> > It could be a problem as follows
> > 
> > Assumption :
> > 1. The system has a big high memory so that lowmem_reserve[HIGH] of DMA zone would be big.
> > 2. HIGH/NORMAL zone are full but DMA zone has enough free pages.
> > 
> > Scenario
> > 1. A request to allocate a page in HIGH zone.
> > 2. HIGH/NORMAL zone already consumes lots of pages so that it would be fall-backed to DMA zone.
> > 3. In DMA zone, allocator got failed, too becuase lowmem_reserve[HIGH] is very big so that it wakes up kswapd
> > 4. kswapd would call shrink_zone while it see DMA zone since DMA zone's lowmem_reserve[HIGHMEM]
> >    would be big so that it couldn't meet zone_watermark_ok_safe(high_wmark_pages(zone) + balance_gap,
> >    *end_zone*)
> > 5. DMA zone doesn't meet stop condition(nr_slab != 0, !zone_reclaimable) because the zone has small lru pages
> >    and it doesn't have slab pages so that kswapd would set all_unreclaimable of the zone to *1* easily.
> > 6. B request to allocate many pages in NORMAL zone but NORMAL zone has no free pages
> >    so that it would be fall-backed to DMA zone.
> > 7. DMA zone would allocates many pages for NORMAL zone because lowmem_reserve[NORMAL] is small.
> >    These pages are used by application(ie, it menas LRU pages. Yes. Now DMA zone could have many reclaimable pages)
> > 8. C request to allocate a page in NORMAL zone but he got failed because DMA zone doesn't have enough free pages.
> >    (Most of pages in DMA zone are consumed by B)
> > 9. Kswapd try to reclaim lru pages in DMA zone but got failed because all_unreclaimable of the zone is 1. Otherwise,
> >    it could reclaim many pages which are used by B.
> > 
> > Of coures, we can do something in DEF_PRIORITY but it couldn't do enough because it can't raise
> > synchronus reclaim in direct reclaim path if the zone has many dirty pages
> > so that the process is killed by OOM.
> > 
> > The principal problem is caused by step 8.
> > In step 8, we increased # of lru size very much but still the zone->all_unreclaimable is 1.
> > If we increase lru size, it is valuable to try reclaiming again.
> > The rationale is that we reset all_unreclaimable to 0 even if we free just a one page.
> So this fixes the oom, but we still have DMA has all_unreclaimable set
> always, because all_unreclaimable == zone_reclaimable_pages() + 1. Not a
> problem?

I think we can fix it if it is needeed.

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 471b20b..ede852c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1019,7 +1019,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
                   "\n  all_unreclaimable: %u"
                   "\n  start_pfn:         %lu"
                   "\n  inactive_ratio:    %u",
-                  zone->all_unreclaimable,
+                  zone_unreclaimable(zone),
                   zone->zone_start_pfn,
                   zone->inactive_ratio);
        seq_putc(m, '\n');

I think it's not a big problem at the start.
all_unreclamable doesn't mean "we have no free page in the zone" but "we have no reclaimable pages any more in the zone".
It is possible in case of setting reserve memory very high for higher zones.
If you think it's awkard, we could add description about that in Documentation.

> What's wrong with my original patch? It appears reasonable if a zone has
> a lot of free memory, don't set unreclaimable to it.

As I said, all_unreclaimable doesn't mean "no free pages in the zone".
So we shouldn't add new dependency between all_unreclaimable and # of free pages.

And what's the purpose of high_wmark and some specific order check for it?
Does it mean really "the zone has no free memory"?

Your description try to explain about that and it seems to depend on outer/inner loop in balance_pgdat.
(But it's very hard for dump me to parse :( )
I don't like that fragile code. If we might change it in future?
Of course, we can add enough description about that but it means more complex.
I don't want to add more complexity in kswapd unless we have a good reason.

The most important thing is that the problem isn't related to # of free pages.
As I state in my patch, the problem is caused by not considering LRU size changing.
I would like to target a principal cause.

> 
> Thanks,
> Shaohua
> 

-- 
Kinds regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [patch 1/2]vmscan: correct all_unreclaimable for zone without lru pages
  2011-10-08  4:32                 ` Minchan Kim
@ 2011-10-08  5:48                   ` Shaohua Li
  2011-10-08  9:35                     ` Minchan Kim
  0 siblings, 1 reply; 20+ messages in thread
From: Shaohua Li @ 2011-10-08  5:48 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Michal Hocko, mel, Rik van Riel, linux-mm,
	Johannes Weiner, KOSAKI Motohiro, KAMEZAWA Hiroyuki

On Sat, 2011-10-08 at 12:32 +0800, Minchan Kim wrote:
> On Sat, Oct 08, 2011 at 11:09:51AM +0800, Shaohua Li wrote:
> > On Sat, 2011-10-01 at 14:59 +0800, Minchan Kim wrote:
> > > On Fri, Sep 30, 2011 at 10:12:23AM +0800, Shaohua Li wrote:
> > > > On Thu, 2011-09-29 at 17:18 +0800, Minchan Kim wrote:
> > > > > On Thu, Sep 29, 2011 at 09:14:51AM +0800, Shaohua Li wrote:
> > > > > > On Thu, 2011-09-29 at 01:57 +0800, Minchan Kim wrote:
> > > > > > > On Wed, Sep 28, 2011 at 03:08:31PM +0800, Shaohua Li wrote:
> > > > > > > > On Wed, 2011-09-28 at 14:57 +0800, Minchan Kim wrote:
> > > > > > > > > On Tue, Sep 27, 2011 at 03:23:04PM +0800, Shaohua Li wrote:
> > > > > > > > > > I saw DMA zone always has ->all_unreclaimable set. The reason is the high zones
> > > > > > > > > > are big, so zone_watermark_ok/_safe() will always return false with a high
> > > > > > > > > > classzone_idx for DMA zone, because DMA zone's lowmem_reserve is big for a high
> > > > > > > > > > classzone_idx. When kswapd runs into DMA zone, it doesn't scan/reclaim any
> > > > > > > > > > pages(no pages in lru), but mark the zone as all_unreclaimable. This can
> > > > > > > > > > happen in other low zones too.
> > > > > > > > >
> > > > > > > > > Good catch!
> > > > > > > > >
> > > > > > > > > > This is confusing and can potentially cause oom. Say a low zone has
> > > > > > > > > > all_unreclaimable when high zone hasn't enough memory. Then allocating
> > > > > > > > > > some pages in low zone(for example reading blkdev with highmem support),
> > > > > > > > > > then run into direct reclaim. Since the zone has all_unreclaimable set,
> > > > > > > > > > direct reclaim might reclaim nothing and an oom reported. If
> > > > > > > > > > all_unreclaimable is unset, the zone can actually reclaim some pages.
> > > > > > > > > > If all_unreclaimable is unset, in the inner loop of balance_pgdat we always have
> > > > > > > > > > all_zones_ok 0 when checking a low zone's watermark. If high zone watermark isn't
> > > > > > > > > > good, there is no problem. Otherwise, we might loop one more time in the outer
> > > > > > > > > > loop, but since high zone watermark is ok, the end_zone will be lower, then low
> > > > > > > > > > zone's watermark check will be ok and the outer loop will break. So looks this
> > > > > > > > > > doesn't bring any problem.
> > > > > > > > >
> > > > > > > > > I think it would be better to correct zone_reclaimable.
> > > > > > > > > My point is zone_reclaimable should consider zone->pages_scanned.
> > > > > > > > > The point of the function is how many pages scanned VS how many pages remained in LRU.
> > > > > > > > > If reclaimer doesn't scan the zone at all because of no lru pages, it shouldn't tell
> > > > > > > > > the zone is all_unreclaimable.
> > > > > > > > actually this is exact my first version of the patch. The problem is if
> > > > > > > > a zone is true unreclaimable (used by kenrel pages or whatever), we will
> > > > > > > > have zone->pages_scanned 0 too. I thought we should set
> > > > > > > > all_unreclaimable in this case.
> > > > > > >
> > > > > > > Let's think the problem again.
> > > > > > > Fundamental problem is that why the lower zone's lowmem_reserve for higher zone is huge big
> > > > > > > that might be bigger than the zone's size.
> > > > > > > I think we need the boundary for limiting lowmem_reseve.
> > > > > > > So how about this?
> > > > > > I didn't see a reason why high zone allocation should fallback to low
> > > > > > zone if high zone is big. Changing the lowmem_reserve can cause the
> > > > > > fallback. Has any rationale here?
> > > > >
> > > > > I try to think better solution than yours but I got failed. :(
> > > > > The why I try to avoid your patch is that kswapd is very complicated these days so
> > > > > I wanted to not add more logic for handling corner cases if we can solve it
> > > > > other ways. But as I said, but I got failed.
> > > > >
> > > > > It seems that it doesn't make sense that previous my patch that limit lowmem_reserve.
> > > > > Because we can have higher zone which is very big size so that lowmem_zone[higher_zone] of
> > > > > low zone could be bigger freely than the lowmem itself size.
> > > > > It implies the low zone should be not used for higher allocation.
> > > > > It has no reason to limit it. My brain was broken. :(
> > > > >
> > > > > But I have a question about your patch still.
> > > > > What happens if DMA zone sets zone->all_unreclaimable with 1?
> > > > >
> > > > > You said as follows,
> > > > >
> > > > > > This is confusing and can potentially cause oom. Say a low zone has
> > > > > > all_unreclaimable when high zone hasn't enough memory. Then allocating
> > > > > > some pages in low zone(for example reading blkdev with highmem support),
> > > > > > then run into direct reclaim. Since the zone has all_unreclaimable set,
> > > > >
> > > > > If low zone has enough pages for allocation, it cannot have entered reclaim.
> > > > > It means now low zone doesn't have enough free pages for the order allocation.
> > > > > So it's natural to enter reclaim path.
> > > > >
> > > > > > direct reclaim might reclaim nothing and an oom reported. If
> > > > >
> > > > > It's not correct "nothing". At least, it will do something in DEF_PRIORITY.
> > > > it does something, but might not reclaim any pages, for example, it
> > > > starts page write, but page isn't in disk yet in DEF_PRIORITY and it
> > > > skip further reclaiming in !DEF_PRIORITY.
> > > >
> > > > > > all_unreclaimable is unset, the zone can actually reclaim some pages.
> > > > >
> > > > > The reason of this problem is that the zone has no lru page, you said.
> > > > > Then how could we reclaim some pages in the zone even if the zone's all_unreclaimable is unset?
> > > > > You expect slab pages?
> > > > The zone could have lru pages. Let's take an example, allocation from
> > > > ZONE_HIGHMEM, then kswapd runs, ZONE_NORMAL gets all_unreclaimable set
> > > > even it has free pages. Then we do write blkdev device, which use
> > > > ZONE_NORMAL for page cache. Some pages in ZONE_NORMAL are in lru, then
> > > > we run into direct page reclaim for ZONE_NORMAL. Since all_unreclaimable
> > > > is set and pages in ZONE_NORMAL lru are dirty, direct reclaim could
> > > > fail. But I'd agree this is a corner case.
> > > > Besides when I saw ZONE_DMA has a lot of free pages and
> > > > all_unreclaimable is set, it's really confusing.
> > >
> > > Hi Shaohua,
> > > Sorry for late response and Thanks for your explanation.
> > > It's valuable to fix, I think.
> > > How about this?
> > >
> > > I hope other guys have a interest in the problem.
> > > Cced them.
> > Hi,
> > it's a long holiday here, so I'm late, sorry.
> 
> No problem. I coundn't access internet freely, either.
> 
> >
> > > From 070d5b1a69921bc71c6aaa5445fb1d29ecb38f74 Mon Sep 17 00:00:00 2001
> > > From: Minchan Kim <minchan.kim@gmail.com>
> > > Date: Sat, 1 Oct 2011 15:26:08 +0900
> > > Subject: [RFC] vmscan: set all_unreclaimable of zone carefully
> > >
> > > Shaohua Li reported all_unreclaimable of DMA zone is always set
> > > because the system has a big memory HIGH zone so that lowmem_reserve[HIGH]
> > > could be a big.
> > >
> > > It could be a problem as follows
> > >
> > > Assumption :
> > > 1. The system has a big high memory so that lowmem_reserve[HIGH] of DMA zone would be big.
> > > 2. HIGH/NORMAL zone are full but DMA zone has enough free pages.
> > >
> > > Scenario
> > > 1. A request to allocate a page in HIGH zone.
> > > 2. HIGH/NORMAL zone already consumes lots of pages so that it would be fall-backed to DMA zone.
> > > 3. In DMA zone, allocator got failed, too becuase lowmem_reserve[HIGH] is very big so that it wakes up kswapd
> > > 4. kswapd would call shrink_zone while it see DMA zone since DMA zone's lowmem_reserve[HIGHMEM]
> > >    would be big so that it couldn't meet zone_watermark_ok_safe(high_wmark_pages(zone) + balance_gap,
> > >    *end_zone*)
> > > 5. DMA zone doesn't meet stop condition(nr_slab != 0, !zone_reclaimable) because the zone has small lru pages
> > >    and it doesn't have slab pages so that kswapd would set all_unreclaimable of the zone to *1* easily.
> > > 6. B request to allocate many pages in NORMAL zone but NORMAL zone has no free pages
> > >    so that it would be fall-backed to DMA zone.
> > > 7. DMA zone would allocates many pages for NORMAL zone because lowmem_reserve[NORMAL] is small.
> > >    These pages are used by application(ie, it menas LRU pages. Yes. Now DMA zone could have many reclaimable pages)
> > > 8. C request to allocate a page in NORMAL zone but he got failed because DMA zone doesn't have enough free pages.
> > >    (Most of pages in DMA zone are consumed by B)
> > > 9. Kswapd try to reclaim lru pages in DMA zone but got failed because all_unreclaimable of the zone is 1. Otherwise,
> > >    it could reclaim many pages which are used by B.
> > >
> > > Of coures, we can do something in DEF_PRIORITY but it couldn't do enough because it can't raise
> > > synchronus reclaim in direct reclaim path if the zone has many dirty pages
> > > so that the process is killed by OOM.
> > >
> > > The principal problem is caused by step 8.
> > > In step 8, we increased # of lru size very much but still the zone->all_unreclaimable is 1.
> > > If we increase lru size, it is valuable to try reclaiming again.
> > > The rationale is that we reset all_unreclaimable to 0 even if we free just a one page.
> > So this fixes the oom, but we still have DMA has all_unreclaimable set
> > always, because all_unreclaimable == zone_reclaimable_pages() + 1. Not a
> > problem?
> 
> I think we can fix it if it is needeed.
> 
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 471b20b..ede852c 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1019,7 +1019,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>                    "\n  all_unreclaimable: %u"
>                    "\n  start_pfn:         %lu"
>                    "\n  inactive_ratio:    %u",
> -                  zone->all_unreclaimable,
> +                  zone_unreclaimable(zone),
>                    zone->zone_start_pfn,
>                    zone->inactive_ratio);
>         seq_putc(m, '\n');
> 
> I think it's not a big problem at the start.
> all_unreclamable doesn't mean "we have no free page in the zone" but "we have no reclaimable pages any more in the zone".
> It is possible in case of setting reserve memory very high for higher zones.
> If you think it's awkard, we could add description about that in Documentation.
this doesn't fix it, because all_unreclaimable ==
zone_reclaimable_pages() + 1, zone_unreclaimable() will still be true.

> > What's wrong with my original patch? It appears reasonable if a zone has
> > a lot of free memory, don't set unreclaimable to it.
> 
> As I said, all_unreclaimable doesn't mean "no free pages in the zone".
> So we shouldn't add new dependency between all_unreclaimable and # of free pages.
> 
> And what's the purpose of high_wmark and some specific order check for it?
> Does it mean really "the zone has no free memory"?
> 
> Your description try to explain about that and it seems to depend on outer/inner loop in balance_pgdat.
> (But it's very hard for dump me to parse :( )
> I don't like that fragile code. If we might change it in future?
> Of course, we can add enough description about that but it means more complex.
> I don't want to add more complexity in kswapd unless we have a good reason.
> 
> The most important thing is that the problem isn't related to # of free pages.
> As I state in my patch, the problem is caused by not considering LRU size changing.
> I would like to target a principal cause.
ok, if unreclaimable means no reclaimable pages. I'd consider the
original method (regards zone without lru pages as reclaimable), which
is simpler. My original objection that method will regard zone which is
full and has no lru pages as reclaimable, but as you said reclaimable
isn't related to free pages.
I can't get a clear meaning for all_unreclaimable with your new patch.

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 1/2]vmscan: correct all_unreclaimable for zone without lru pages
  2011-10-08  5:48                   ` Shaohua Li
@ 2011-10-08  9:35                     ` Minchan Kim
  2011-10-09  6:08                       ` Shaohua Li
  0 siblings, 1 reply; 20+ messages in thread
From: Minchan Kim @ 2011-10-08  9:35 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Andrew Morton, Michal Hocko, mel, Rik van Riel, linux-mm,
	Johannes Weiner, KOSAKI Motohiro, KAMEZAWA Hiroyuki

On Sat, Oct 08, 2011 at 01:48:21PM +0800, Shaohua Li wrote:
> On Sat, 2011-10-08 at 12:32 +0800, Minchan Kim wrote:
> > On Sat, Oct 08, 2011 at 11:09:51AM +0800, Shaohua Li wrote:
> > > On Sat, 2011-10-01 at 14:59 +0800, Minchan Kim wrote:
> > > > On Fri, Sep 30, 2011 at 10:12:23AM +0800, Shaohua Li wrote:
> > > > > On Thu, 2011-09-29 at 17:18 +0800, Minchan Kim wrote:
> > > > > > On Thu, Sep 29, 2011 at 09:14:51AM +0800, Shaohua Li wrote:
> > > > > > > On Thu, 2011-09-29 at 01:57 +0800, Minchan Kim wrote:
> > > > > > > > On Wed, Sep 28, 2011 at 03:08:31PM +0800, Shaohua Li wrote:
> > > > > > > > > On Wed, 2011-09-28 at 14:57 +0800, Minchan Kim wrote:
> > > > > > > > > > On Tue, Sep 27, 2011 at 03:23:04PM +0800, Shaohua Li wrote:
> > > > > > > > > > > I saw DMA zone always has ->all_unreclaimable set. The reason is the high zones
> > > > > > > > > > > are big, so zone_watermark_ok/_safe() will always return false with a high
> > > > > > > > > > > classzone_idx for DMA zone, because DMA zone's lowmem_reserve is big for a high
> > > > > > > > > > > classzone_idx. When kswapd runs into DMA zone, it doesn't scan/reclaim any
> > > > > > > > > > > pages(no pages in lru), but mark the zone as all_unreclaimable. This can
> > > > > > > > > > > happen in other low zones too.
> > > > > > > > > >
> > > > > > > > > > Good catch!
> > > > > > > > > >
> > > > > > > > > > > This is confusing and can potentially cause oom. Say a low zone has
> > > > > > > > > > > all_unreclaimable when high zone hasn't enough memory. Then allocating
> > > > > > > > > > > some pages in low zone(for example reading blkdev with highmem support),
> > > > > > > > > > > then run into direct reclaim. Since the zone has all_unreclaimable set,
> > > > > > > > > > > direct reclaim might reclaim nothing and an oom reported. If
> > > > > > > > > > > all_unreclaimable is unset, the zone can actually reclaim some pages.
> > > > > > > > > > > If all_unreclaimable is unset, in the inner loop of balance_pgdat we always have
> > > > > > > > > > > all_zones_ok 0 when checking a low zone's watermark. If high zone watermark isn't
> > > > > > > > > > > good, there is no problem. Otherwise, we might loop one more time in the outer
> > > > > > > > > > > loop, but since high zone watermark is ok, the end_zone will be lower, then low
> > > > > > > > > > > zone's watermark check will be ok and the outer loop will break. So looks this
> > > > > > > > > > > doesn't bring any problem.
> > > > > > > > > >
> > > > > > > > > > I think it would be better to correct zone_reclaimable.
> > > > > > > > > > My point is zone_reclaimable should consider zone->pages_scanned.
> > > > > > > > > > The point of the function is how many pages scanned VS how many pages remained in LRU.
> > > > > > > > > > If reclaimer doesn't scan the zone at all because of no lru pages, it shouldn't tell
> > > > > > > > > > the zone is all_unreclaimable.
> > > > > > > > > actually this is exact my first version of the patch. The problem is if
> > > > > > > > > a zone is true unreclaimable (used by kenrel pages or whatever), we will
> > > > > > > > > have zone->pages_scanned 0 too. I thought we should set
> > > > > > > > > all_unreclaimable in this case.
> > > > > > > >
> > > > > > > > Let's think the problem again.
> > > > > > > > Fundamental problem is that why the lower zone's lowmem_reserve for higher zone is huge big
> > > > > > > > that might be bigger than the zone's size.
> > > > > > > > I think we need the boundary for limiting lowmem_reseve.
> > > > > > > > So how about this?
> > > > > > > I didn't see a reason why high zone allocation should fallback to low
> > > > > > > zone if high zone is big. Changing the lowmem_reserve can cause the
> > > > > > > fallback. Has any rationale here?
> > > > > >
> > > > > > I try to think better solution than yours but I got failed. :(
> > > > > > The why I try to avoid your patch is that kswapd is very complicated these days so
> > > > > > I wanted to not add more logic for handling corner cases if we can solve it
> > > > > > other ways. But as I said, but I got failed.
> > > > > >
> > > > > > It seems that it doesn't make sense that previous my patch that limit lowmem_reserve.
> > > > > > Because we can have higher zone which is very big size so that lowmem_zone[higher_zone] of
> > > > > > low zone could be bigger freely than the lowmem itself size.
> > > > > > It implies the low zone should be not used for higher allocation.
> > > > > > It has no reason to limit it. My brain was broken. :(
> > > > > >
> > > > > > But I have a question about your patch still.
> > > > > > What happens if DMA zone sets zone->all_unreclaimable with 1?
> > > > > >
> > > > > > You said as follows,
> > > > > >
> > > > > > > This is confusing and can potentially cause oom. Say a low zone has
> > > > > > > all_unreclaimable when high zone hasn't enough memory. Then allocating
> > > > > > > some pages in low zone(for example reading blkdev with highmem support),
> > > > > > > then run into direct reclaim. Since the zone has all_unreclaimable set,
> > > > > >
> > > > > > If low zone has enough pages for allocation, it cannot have entered reclaim.
> > > > > > It means now low zone doesn't have enough free pages for the order allocation.
> > > > > > So it's natural to enter reclaim path.
> > > > > >
> > > > > > > direct reclaim might reclaim nothing and an oom reported. If
> > > > > >
> > > > > > It's not correct "nothing". At least, it will do something in DEF_PRIORITY.
> > > > > it does something, but might not reclaim any pages, for example, it
> > > > > starts page write, but page isn't in disk yet in DEF_PRIORITY and it
> > > > > skip further reclaiming in !DEF_PRIORITY.
> > > > >
> > > > > > > all_unreclaimable is unset, the zone can actually reclaim some pages.
> > > > > >
> > > > > > The reason of this problem is that the zone has no lru page, you said.
> > > > > > Then how could we reclaim some pages in the zone even if the zone's all_unreclaimable is unset?
> > > > > > You expect slab pages?
> > > > > The zone could have lru pages. Let's take an example, allocation from
> > > > > ZONE_HIGHMEM, then kswapd runs, ZONE_NORMAL gets all_unreclaimable set
> > > > > even it has free pages. Then we do write blkdev device, which use
> > > > > ZONE_NORMAL for page cache. Some pages in ZONE_NORMAL are in lru, then
> > > > > we run into direct page reclaim for ZONE_NORMAL. Since all_unreclaimable
> > > > > is set and pages in ZONE_NORMAL lru are dirty, direct reclaim could
> > > > > fail. But I'd agree this is a corner case.
> > > > > Besides when I saw ZONE_DMA has a lot of free pages and
> > > > > all_unreclaimable is set, it's really confusing.
> > > >
> > > > Hi Shaohua,
> > > > Sorry for late response and Thanks for your explanation.
> > > > It's valuable to fix, I think.
> > > > How about this?
> > > >
> > > > I hope other guys have a interest in the problem.
> > > > Cced them.
> > > Hi,
> > > it's a long holiday here, so I'm late, sorry.
> > 
> > No problem. I coundn't access internet freely, either.
> > 
> > >
> > > > From 070d5b1a69921bc71c6aaa5445fb1d29ecb38f74 Mon Sep 17 00:00:00 2001
> > > > From: Minchan Kim <minchan.kim@gmail.com>
> > > > Date: Sat, 1 Oct 2011 15:26:08 +0900
> > > > Subject: [RFC] vmscan: set all_unreclaimable of zone carefully
> > > >
> > > > Shaohua Li reported all_unreclaimable of DMA zone is always set
> > > > because the system has a big memory HIGH zone so that lowmem_reserve[HIGH]
> > > > could be a big.
> > > >
> > > > It could be a problem as follows
> > > >
> > > > Assumption :
> > > > 1. The system has a big high memory so that lowmem_reserve[HIGH] of DMA zone would be big.
> > > > 2. HIGH/NORMAL zone are full but DMA zone has enough free pages.
> > > >
> > > > Scenario
> > > > 1. A request to allocate a page in HIGH zone.
> > > > 2. HIGH/NORMAL zone already consumes lots of pages so that it would be fall-backed to DMA zone.
> > > > 3. In DMA zone, allocator got failed, too becuase lowmem_reserve[HIGH] is very big so that it wakes up kswapd
> > > > 4. kswapd would call shrink_zone while it see DMA zone since DMA zone's lowmem_reserve[HIGHMEM]
> > > >    would be big so that it couldn't meet zone_watermark_ok_safe(high_wmark_pages(zone) + balance_gap,
> > > >    *end_zone*)
> > > > 5. DMA zone doesn't meet stop condition(nr_slab != 0, !zone_reclaimable) because the zone has small lru pages
> > > >    and it doesn't have slab pages so that kswapd would set all_unreclaimable of the zone to *1* easily.
> > > > 6. B request to allocate many pages in NORMAL zone but NORMAL zone has no free pages
> > > >    so that it would be fall-backed to DMA zone.
> > > > 7. DMA zone would allocates many pages for NORMAL zone because lowmem_reserve[NORMAL] is small.
> > > >    These pages are used by application(ie, it menas LRU pages. Yes. Now DMA zone could have many reclaimable pages)
> > > > 8. C request to allocate a page in NORMAL zone but he got failed because DMA zone doesn't have enough free pages.
> > > >    (Most of pages in DMA zone are consumed by B)
> > > > 9. Kswapd try to reclaim lru pages in DMA zone but got failed because all_unreclaimable of the zone is 1. Otherwise,
> > > >    it could reclaim many pages which are used by B.
> > > >
> > > > Of coures, we can do something in DEF_PRIORITY but it couldn't do enough because it can't raise
> > > > synchronus reclaim in direct reclaim path if the zone has many dirty pages
> > > > so that the process is killed by OOM.
> > > >
> > > > The principal problem is caused by step 8.
> > > > In step 8, we increased # of lru size very much but still the zone->all_unreclaimable is 1.
> > > > If we increase lru size, it is valuable to try reclaiming again.
> > > > The rationale is that we reset all_unreclaimable to 0 even if we free just a one page.
> > > So this fixes the oom, but we still have DMA has all_unreclaimable set
> > > always, because all_unreclaimable == zone_reclaimable_pages() + 1. Not a
> > > problem?
> > 
> > I think we can fix it if it is needeed.
> > 
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 471b20b..ede852c 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -1019,7 +1019,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
> >                    "\n  all_unreclaimable: %u"
> >                    "\n  start_pfn:         %lu"
> >                    "\n  inactive_ratio:    %u",
> > -                  zone->all_unreclaimable,
> > +                  zone_unreclaimable(zone),
> >                    zone->zone_start_pfn,
> >                    zone->inactive_ratio);
> >         seq_putc(m, '\n');
> > 
> > I think it's not a big problem at the start.
> > all_unreclamable doesn't mean "we have no free page in the zone" but "we have no reclaimable pages any more in the zone".
> > It is possible in case of setting reserve memory very high for higher zones.
> > If you think it's awkard, we could add description about that in Documentation.
> this doesn't fix it, because all_unreclaimable ==
> zone_reclaimable_pages() + 1, zone_unreclaimable() will still be true.


We set the zone into all_unreclaimabe when VM found much scanning in the zone without reclaiming a page
or have no lru page/scanning.
So if all_unreclaimable of the zone is set to 1, it is natural to contine remaining all_unreclaimable = 1
unless any page isn't freed or other LRU allocation happens.

What should be fixed?

> 
> > > What's wrong with my original patch? It appears reasonable if a zone has
> > > a lot of free memory, don't set unreclaimable to it.
> > 
> > As I said, all_unreclaimable doesn't mean "no free pages in the zone".
> > So we shouldn't add new dependency between all_unreclaimable and # of free pages.
> > 
> > And what's the purpose of high_wmark and some specific order check for it?
> > Does it mean really "the zone has no free memory"?
> > 
> > Your description try to explain about that and it seems to depend on outer/inner loop in balance_pgdat.
> > (But it's very hard for dump me to parse :( )
> > I don't like that fragile code. If we might change it in future?
> > Of course, we can add enough description about that but it means more complex.
> > I don't want to add more complexity in kswapd unless we have a good reason.
> > 
> > The most important thing is that the problem isn't related to # of free pages.
> > As I state in my patch, the problem is caused by not considering LRU size changing.
> > I would like to target a principal cause.
> ok, if unreclaimable means no reclaimable pages. I'd consider the
> original method (regards zone without lru pages as reclaimable), which

Why do we have to consider zone which has no lru pages as reclaimable?
It doesn't make sense to me.

> is simpler. My original objection that method will regard zone which is
> full and has no lru pages as reclaimable, but as you said reclaimable
> isn't related to free pages.

Yes. it's not related to free pages but lru pages.
If we don't have no lru pages, we should mark the zone with all_unreclaimable.

Of course, we should consider slab pages, too but it's not easy for VM to reclaim
slabe pages compared to LRU pages so I decided ignoring it in this version
but if anyone complains about that, I will consdier that, too.

> I can't get a clear meaning for all_unreclaimable with your new patch.

What I mean about all_unreclaimable is that the zone has no reclaimable pages which
could be lru pages/slab pages. I ignore slab pages in this version


> 
> Thanks,
> Shaohua
> 

-- 
Kinds regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 1/2]vmscan: correct all_unreclaimable for zone without lru pages
  2011-10-08  9:35                     ` Minchan Kim
@ 2011-10-09  6:08                       ` Shaohua Li
  2011-10-09  7:45                         ` Minchan Kim
  0 siblings, 1 reply; 20+ messages in thread
From: Shaohua Li @ 2011-10-09  6:08 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Michal Hocko, mel, Rik van Riel, linux-mm,
	Johannes Weiner, KOSAKI Motohiro, KAMEZAWA Hiroyuki

On Sat, 2011-10-08 at 17:35 +0800, Minchan Kim wrote:
> On Sat, Oct 08, 2011 at 01:48:21PM +0800, Shaohua Li wrote:
> > On Sat, 2011-10-08 at 12:32 +0800, Minchan Kim wrote:
> > > On Sat, Oct 08, 2011 at 11:09:51AM +0800, Shaohua Li wrote:
> > > > On Sat, 2011-10-01 at 14:59 +0800, Minchan Kim wrote:
> > > > > On Fri, Sep 30, 2011 at 10:12:23AM +0800, Shaohua Li wrote:
> > > > > > On Thu, 2011-09-29 at 17:18 +0800, Minchan Kim wrote:
> > > > > > > On Thu, Sep 29, 2011 at 09:14:51AM +0800, Shaohua Li wrote:
> > > > > > > > On Thu, 2011-09-29 at 01:57 +0800, Minchan Kim wrote:
> > > > > > > > > On Wed, Sep 28, 2011 at 03:08:31PM +0800, Shaohua Li wrote:
> > > > > > > > > > On Wed, 2011-09-28 at 14:57 +0800, Minchan Kim wrote:
> > > > > > > > > > > On Tue, Sep 27, 2011 at 03:23:04PM +0800, Shaohua Li wrote:
> > > > > > > > > > > > I saw DMA zone always has ->all_unreclaimable set. The reason is the high zones
> > > > > > > > > > > > are big, so zone_watermark_ok/_safe() will always return false with a high
> > > > > > > > > > > > classzone_idx for DMA zone, because DMA zone's lowmem_reserve is big for a high
> > > > > > > > > > > > classzone_idx. When kswapd runs into DMA zone, it doesn't scan/reclaim any
> > > > > > > > > > > > pages(no pages in lru), but mark the zone as all_unreclaimable. This can
> > > > > > > > > > > > happen in other low zones too.
> > > > > > > > > > >
> > > > > > > > > > > Good catch!
> > > > > > > > > > >
> > > > > > > > > > > > This is confusing and can potentially cause oom. Say a low zone has
> > > > > > > > > > > > all_unreclaimable when high zone hasn't enough memory. Then allocating
> > > > > > > > > > > > some pages in low zone(for example reading blkdev with highmem support),
> > > > > > > > > > > > then run into direct reclaim. Since the zone has all_unreclaimable set,
> > > > > > > > > > > > direct reclaim might reclaim nothing and an oom reported. If
> > > > > > > > > > > > all_unreclaimable is unset, the zone can actually reclaim some pages.
> > > > > > > > > > > > If all_unreclaimable is unset, in the inner loop of balance_pgdat we always have
> > > > > > > > > > > > all_zones_ok 0 when checking a low zone's watermark. If high zone watermark isn't
> > > > > > > > > > > > good, there is no problem. Otherwise, we might loop one more time in the outer
> > > > > > > > > > > > loop, but since high zone watermark is ok, the end_zone will be lower, then low
> > > > > > > > > > > > zone's watermark check will be ok and the outer loop will break. So looks this
> > > > > > > > > > > > doesn't bring any problem.
> > > > > > > > > > >
> > > > > > > > > > > I think it would be better to correct zone_reclaimable.
> > > > > > > > > > > My point is zone_reclaimable should consider zone->pages_scanned.
> > > > > > > > > > > The point of the function is how many pages scanned VS how many pages remained in LRU.
> > > > > > > > > > > If reclaimer doesn't scan the zone at all because of no lru pages, it shouldn't tell
> > > > > > > > > > > the zone is all_unreclaimable.
> > > > > > > > > > actually this is exact my first version of the patch. The problem is if
> > > > > > > > > > a zone is true unreclaimable (used by kenrel pages or whatever), we will
> > > > > > > > > > have zone->pages_scanned 0 too. I thought we should set
> > > > > > > > > > all_unreclaimable in this case.
> > > > > > > > >
> > > > > > > > > Let's think the problem again.
> > > > > > > > > Fundamental problem is that why the lower zone's lowmem_reserve for higher zone is huge big
> > > > > > > > > that might be bigger than the zone's size.
> > > > > > > > > I think we need the boundary for limiting lowmem_reseve.
> > > > > > > > > So how about this?
> > > > > > > > I didn't see a reason why high zone allocation should fallback to low
> > > > > > > > zone if high zone is big. Changing the lowmem_reserve can cause the
> > > > > > > > fallback. Has any rationale here?
> > > > > > >
> > > > > > > I try to think better solution than yours but I got failed. :(
> > > > > > > The why I try to avoid your patch is that kswapd is very complicated these days so
> > > > > > > I wanted to not add more logic for handling corner cases if we can solve it
> > > > > > > other ways. But as I said, but I got failed.
> > > > > > >
> > > > > > > It seems that it doesn't make sense that previous my patch that limit lowmem_reserve.
> > > > > > > Because we can have higher zone which is very big size so that lowmem_zone[higher_zone] of
> > > > > > > low zone could be bigger freely than the lowmem itself size.
> > > > > > > It implies the low zone should be not used for higher allocation.
> > > > > > > It has no reason to limit it. My brain was broken. :(
> > > > > > >
> > > > > > > But I have a question about your patch still.
> > > > > > > What happens if DMA zone sets zone->all_unreclaimable with 1?
> > > > > > >
> > > > > > > You said as follows,
> > > > > > >
> > > > > > > > This is confusing and can potentially cause oom. Say a low zone has
> > > > > > > > all_unreclaimable when high zone hasn't enough memory. Then allocating
> > > > > > > > some pages in low zone(for example reading blkdev with highmem support),
> > > > > > > > then run into direct reclaim. Since the zone has all_unreclaimable set,
> > > > > > >
> > > > > > > If low zone has enough pages for allocation, it cannot have entered reclaim.
> > > > > > > It means now low zone doesn't have enough free pages for the order allocation.
> > > > > > > So it's natural to enter reclaim path.
> > > > > > >
> > > > > > > > direct reclaim might reclaim nothing and an oom reported. If
> > > > > > >
> > > > > > > It's not correct "nothing". At least, it will do something in DEF_PRIORITY.
> > > > > > it does something, but might not reclaim any pages, for example, it
> > > > > > starts page write, but page isn't in disk yet in DEF_PRIORITY and it
> > > > > > skip further reclaiming in !DEF_PRIORITY.
> > > > > >
> > > > > > > > all_unreclaimable is unset, the zone can actually reclaim some pages.
> > > > > > >
> > > > > > > The reason of this problem is that the zone has no lru page, you said.
> > > > > > > Then how could we reclaim some pages in the zone even if the zone's all_unreclaimable is unset?
> > > > > > > You expect slab pages?
> > > > > > The zone could have lru pages. Let's take an example, allocation from
> > > > > > ZONE_HIGHMEM, then kswapd runs, ZONE_NORMAL gets all_unreclaimable set
> > > > > > even it has free pages. Then we do write blkdev device, which use
> > > > > > ZONE_NORMAL for page cache. Some pages in ZONE_NORMAL are in lru, then
> > > > > > we run into direct page reclaim for ZONE_NORMAL. Since all_unreclaimable
> > > > > > is set and pages in ZONE_NORMAL lru are dirty, direct reclaim could
> > > > > > fail. But I'd agree this is a corner case.
> > > > > > Besides when I saw ZONE_DMA has a lot of free pages and
> > > > > > all_unreclaimable is set, it's really confusing.
> > > > >
> > > > > Hi Shaohua,
> > > > > Sorry for late response and Thanks for your explanation.
> > > > > It's valuable to fix, I think.
> > > > > How about this?
> > > > >
> > > > > I hope other guys have a interest in the problem.
> > > > > Cced them.
> > > > Hi,
> > > > it's a long holiday here, so I'm late, sorry.
> > >
> > > No problem. I coundn't access internet freely, either.
> > >
> > > >
> > > > > From 070d5b1a69921bc71c6aaa5445fb1d29ecb38f74 Mon Sep 17 00:00:00 2001
> > > > > From: Minchan Kim <minchan.kim@gmail.com>
> > > > > Date: Sat, 1 Oct 2011 15:26:08 +0900
> > > > > Subject: [RFC] vmscan: set all_unreclaimable of zone carefully
> > > > >
> > > > > Shaohua Li reported all_unreclaimable of DMA zone is always set
> > > > > because the system has a big memory HIGH zone so that lowmem_reserve[HIGH]
> > > > > could be a big.
> > > > >
> > > > > It could be a problem as follows
> > > > >
> > > > > Assumption :
> > > > > 1. The system has a big high memory so that lowmem_reserve[HIGH] of DMA zone would be big.
> > > > > 2. HIGH/NORMAL zone are full but DMA zone has enough free pages.
> > > > >
> > > > > Scenario
> > > > > 1. A request to allocate a page in HIGH zone.
> > > > > 2. HIGH/NORMAL zone already consumes lots of pages so that it would be fall-backed to DMA zone.
> > > > > 3. In DMA zone, allocator got failed, too becuase lowmem_reserve[HIGH] is very big so that it wakes up kswapd
> > > > > 4. kswapd would call shrink_zone while it see DMA zone since DMA zone's lowmem_reserve[HIGHMEM]
> > > > >    would be big so that it couldn't meet zone_watermark_ok_safe(high_wmark_pages(zone) + balance_gap,
> > > > >    *end_zone*)
> > > > > 5. DMA zone doesn't meet stop condition(nr_slab != 0, !zone_reclaimable) because the zone has small lru pages
> > > > >    and it doesn't have slab pages so that kswapd would set all_unreclaimable of the zone to *1* easily.
> > > > > 6. B request to allocate many pages in NORMAL zone but NORMAL zone has no free pages
> > > > >    so that it would be fall-backed to DMA zone.
> > > > > 7. DMA zone would allocates many pages for NORMAL zone because lowmem_reserve[NORMAL] is small.
> > > > >    These pages are used by application(ie, it menas LRU pages. Yes. Now DMA zone could have many reclaimable pages)
> > > > > 8. C request to allocate a page in NORMAL zone but he got failed because DMA zone doesn't have enough free pages.
> > > > >    (Most of pages in DMA zone are consumed by B)
> > > > > 9. Kswapd try to reclaim lru pages in DMA zone but got failed because all_unreclaimable of the zone is 1. Otherwise,
> > > > >    it could reclaim many pages which are used by B.
> > > > >
> > > > > Of coures, we can do something in DEF_PRIORITY but it couldn't do enough because it can't raise
> > > > > synchronus reclaim in direct reclaim path if the zone has many dirty pages
> > > > > so that the process is killed by OOM.
> > > > >
> > > > > The principal problem is caused by step 8.
> > > > > In step 8, we increased # of lru size very much but still the zone->all_unreclaimable is 1.
> > > > > If we increase lru size, it is valuable to try reclaiming again.
> > > > > The rationale is that we reset all_unreclaimable to 0 even if we free just a one page.
> > > > So this fixes the oom, but we still have DMA has all_unreclaimable set
> > > > always, because all_unreclaimable == zone_reclaimable_pages() + 1. Not a
> > > > problem?
> > >
> > > I think we can fix it if it is needeed.
> > >
> > > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > > index 471b20b..ede852c 100644
> > > --- a/mm/vmstat.c
> > > +++ b/mm/vmstat.c
> > > @@ -1019,7 +1019,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
> > >                    "\n  all_unreclaimable: %u"
> > >                    "\n  start_pfn:         %lu"
> > >                    "\n  inactive_ratio:    %u",
> > > -                  zone->all_unreclaimable,
> > > +                  zone_unreclaimable(zone),
> > >                    zone->zone_start_pfn,
> > >                    zone->inactive_ratio);
> > >         seq_putc(m, '\n');
> > >
> > > I think it's not a big problem at the start.
> > > all_unreclamable doesn't mean "we have no free page in the zone" but "we have no reclaimable pages any more in the zone".
> > > It is possible in case of setting reserve memory very high for higher zones.
> > > If you think it's awkard, we could add description about that in Documentation.
> > this doesn't fix it, because all_unreclaimable ==
> > zone_reclaimable_pages() + 1, zone_unreclaimable() will still be true.
> 
> 
> We set the zone into all_unreclaimabe when VM found much scanning in the zone without reclaiming a page
> or have no lru page/scanning.
> So if all_unreclaimable of the zone is set to 1, it is natural to contine remaining all_unreclaimable = 1
> unless any page isn't freed or other LRU allocation happens.
> 
> What should be fixed?
I mean if the dma zone should be all_unreclaimable, but it zone without
lru pages should be all_unreclaimable, then this isn't a problem

> >
> > > > What's wrong with my original patch? It appears reasonable if a zone has
> > > > a lot of free memory, don't set unreclaimable to it.
> > >
> > > As I said, all_unreclaimable doesn't mean "no free pages in the zone".
> > > So we shouldn't add new dependency between all_unreclaimable and # of free pages.
> > >
> > > And what's the purpose of high_wmark and some specific order check for it?
> > > Does it mean really "the zone has no free memory"?
> > >
> > > Your description try to explain about that and it seems to depend on outer/inner loop in balance_pgdat.
> > > (But it's very hard for dump me to parse :( )
> > > I don't like that fragile code. If we might change it in future?
> > > Of course, we can add enough description about that but it means more complex.
> > > I don't want to add more complexity in kswapd unless we have a good reason.
> > >
> > > The most important thing is that the problem isn't related to # of free pages.
> > > As I state in my patch, the problem is caused by not considering LRU size changing.
> > > I would like to target a principal cause.
> > ok, if unreclaimable means no reclaimable pages. I'd consider the
> > original method (regards zone without lru pages as reclaimable), which
> 
> Why do we have to consider zone which has no lru pages as reclaimable?
> It doesn't make sense to me.
my head is messed which zone should be considered as reclaimable.

> > is simpler. My original objection that method will regard zone which is
> > full and has no lru pages as reclaimable, but as you said reclaimable
> > isn't related to free pages.
> 
> Yes. it's not related to free pages but lru pages.
> If we don't have no lru pages, we should mark the zone with all_unreclaimable.
> 
> Of course, we should consider slab pages, too but it's not easy for VM to reclaim
> slabe pages compared to LRU pages so I decided ignoring it in this version
> but if anyone complains about that, I will consdier that, too.
> 
> > I can't get a clear meaning for all_unreclaimable with your new patch.
> 
> What I mean about all_unreclaimable is that the zone has no reclaimable pages which
> could be lru pages/slab pages. I ignore slab pages in this version
I now tended to agree using reclaimable pages to determine if a zone is
reclaimable. please fix the zoneinfo_show_print() issue and add document
about all_unreclaimable meaning, then feel free to add my
Reviewed-by: Shaohua Li <shaohua.li@intel.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 1/2]vmscan: correct all_unreclaimable for zone without lru pages
  2011-10-09  6:08                       ` Shaohua Li
@ 2011-10-09  7:45                         ` Minchan Kim
  2011-10-11  8:09                           ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 20+ messages in thread
From: Minchan Kim @ 2011-10-09  7:45 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Andrew Morton, Michal Hocko, mel, Rik van Riel, linux-mm,
	Johannes Weiner, KOSAKI Motohiro, KAMEZAWA Hiroyuki

On Sun, Oct 09, 2011 at 02:08:08PM +0800, Shaohua Li wrote:
> On Sat, 2011-10-08 at 17:35 +0800, Minchan Kim wrote:
> > On Sat, Oct 08, 2011 at 01:48:21PM +0800, Shaohua Li wrote:
> > > On Sat, 2011-10-08 at 12:32 +0800, Minchan Kim wrote:
> > > > On Sat, Oct 08, 2011 at 11:09:51AM +0800, Shaohua Li wrote:
> > > > > On Sat, 2011-10-01 at 14:59 +0800, Minchan Kim wrote:
> > > > > > On Fri, Sep 30, 2011 at 10:12:23AM +0800, Shaohua Li wrote:
> > > > > > > On Thu, 2011-09-29 at 17:18 +0800, Minchan Kim wrote:
> > > > > > > > On Thu, Sep 29, 2011 at 09:14:51AM +0800, Shaohua Li wrote:
> > > > > > > > > On Thu, 2011-09-29 at 01:57 +0800, Minchan Kim wrote:
> > > > > > > > > > On Wed, Sep 28, 2011 at 03:08:31PM +0800, Shaohua Li wrote:
> > > > > > > > > > > On Wed, 2011-09-28 at 14:57 +0800, Minchan Kim wrote:
> > > > > > > > > > > > On Tue, Sep 27, 2011 at 03:23:04PM +0800, Shaohua Li wrote:
> > > > > > > > > > > > > I saw DMA zone always has ->all_unreclaimable set. The reason is the high zones
> > > > > > > > > > > > > are big, so zone_watermark_ok/_safe() will always return false with a high
> > > > > > > > > > > > > classzone_idx for DMA zone, because DMA zone's lowmem_reserve is big for a high
> > > > > > > > > > > > > classzone_idx. When kswapd runs into DMA zone, it doesn't scan/reclaim any
> > > > > > > > > > > > > pages(no pages in lru), but mark the zone as all_unreclaimable. This can
> > > > > > > > > > > > > happen in other low zones too.
> > > > > > > > > > > >
> > > > > > > > > > > > Good catch!
> > > > > > > > > > > >
> > > > > > > > > > > > > This is confusing and can potentially cause oom. Say a low zone has
> > > > > > > > > > > > > all_unreclaimable when high zone hasn't enough memory. Then allocating
> > > > > > > > > > > > > some pages in low zone(for example reading blkdev with highmem support),
> > > > > > > > > > > > > then run into direct reclaim. Since the zone has all_unreclaimable set,
> > > > > > > > > > > > > direct reclaim might reclaim nothing and an oom reported. If
> > > > > > > > > > > > > all_unreclaimable is unset, the zone can actually reclaim some pages.
> > > > > > > > > > > > > If all_unreclaimable is unset, in the inner loop of balance_pgdat we always have
> > > > > > > > > > > > > all_zones_ok 0 when checking a low zone's watermark. If high zone watermark isn't
> > > > > > > > > > > > > good, there is no problem. Otherwise, we might loop one more time in the outer
> > > > > > > > > > > > > loop, but since high zone watermark is ok, the end_zone will be lower, then low
> > > > > > > > > > > > > zone's watermark check will be ok and the outer loop will break. So looks this
> > > > > > > > > > > > > doesn't bring any problem.
> > > > > > > > > > > >
> > > > > > > > > > > > I think it would be better to correct zone_reclaimable.
> > > > > > > > > > > > My point is zone_reclaimable should consider zone->pages_scanned.
> > > > > > > > > > > > The point of the function is how many pages scanned VS how many pages remained in LRU.
> > > > > > > > > > > > If reclaimer doesn't scan the zone at all because of no lru pages, it shouldn't tell
> > > > > > > > > > > > the zone is all_unreclaimable.
> > > > > > > > > > > actually this is exact my first version of the patch. The problem is if
> > > > > > > > > > > a zone is true unreclaimable (used by kenrel pages or whatever), we will
> > > > > > > > > > > have zone->pages_scanned 0 too. I thought we should set
> > > > > > > > > > > all_unreclaimable in this case.
> > > > > > > > > >
> > > > > > > > > > Let's think the problem again.
> > > > > > > > > > Fundamental problem is that why the lower zone's lowmem_reserve for higher zone is huge big
> > > > > > > > > > that might be bigger than the zone's size.
> > > > > > > > > > I think we need the boundary for limiting lowmem_reseve.
> > > > > > > > > > So how about this?
> > > > > > > > > I didn't see a reason why high zone allocation should fallback to low
> > > > > > > > > zone if high zone is big. Changing the lowmem_reserve can cause the
> > > > > > > > > fallback. Has any rationale here?
> > > > > > > >
> > > > > > > > I try to think better solution than yours but I got failed. :(
> > > > > > > > The why I try to avoid your patch is that kswapd is very complicated these days so
> > > > > > > > I wanted to not add more logic for handling corner cases if we can solve it
> > > > > > > > other ways. But as I said, but I got failed.
> > > > > > > >
> > > > > > > > It seems that it doesn't make sense that previous my patch that limit lowmem_reserve.
> > > > > > > > Because we can have higher zone which is very big size so that lowmem_zone[higher_zone] of
> > > > > > > > low zone could be bigger freely than the lowmem itself size.
> > > > > > > > It implies the low zone should be not used for higher allocation.
> > > > > > > > It has no reason to limit it. My brain was broken. :(
> > > > > > > >
> > > > > > > > But I have a question about your patch still.
> > > > > > > > What happens if DMA zone sets zone->all_unreclaimable with 1?
> > > > > > > >
> > > > > > > > You said as follows,
> > > > > > > >
> > > > > > > > > This is confusing and can potentially cause oom. Say a low zone has
> > > > > > > > > all_unreclaimable when high zone hasn't enough memory. Then allocating
> > > > > > > > > some pages in low zone(for example reading blkdev with highmem support),
> > > > > > > > > then run into direct reclaim. Since the zone has all_unreclaimable set,
> > > > > > > >
> > > > > > > > If low zone has enough pages for allocation, it cannot have entered reclaim.
> > > > > > > > It means now low zone doesn't have enough free pages for the order allocation.
> > > > > > > > So it's natural to enter reclaim path.
> > > > > > > >
> > > > > > > > > direct reclaim might reclaim nothing and an oom reported. If
> > > > > > > >
> > > > > > > > It's not correct "nothing". At least, it will do something in DEF_PRIORITY.
> > > > > > > it does something, but might not reclaim any pages, for example, it
> > > > > > > starts page write, but page isn't in disk yet in DEF_PRIORITY and it
> > > > > > > skip further reclaiming in !DEF_PRIORITY.
> > > > > > >
> > > > > > > > > all_unreclaimable is unset, the zone can actually reclaim some pages.
> > > > > > > >
> > > > > > > > The reason of this problem is that the zone has no lru page, you said.
> > > > > > > > Then how could we reclaim some pages in the zone even if the zone's all_unreclaimable is unset?
> > > > > > > > You expect slab pages?
> > > > > > > The zone could have lru pages. Let's take an example, allocation from
> > > > > > > ZONE_HIGHMEM, then kswapd runs, ZONE_NORMAL gets all_unreclaimable set
> > > > > > > even it has free pages. Then we do write blkdev device, which use
> > > > > > > ZONE_NORMAL for page cache. Some pages in ZONE_NORMAL are in lru, then
> > > > > > > we run into direct page reclaim for ZONE_NORMAL. Since all_unreclaimable
> > > > > > > is set and pages in ZONE_NORMAL lru are dirty, direct reclaim could
> > > > > > > fail. But I'd agree this is a corner case.
> > > > > > > Besides when I saw ZONE_DMA has a lot of free pages and
> > > > > > > all_unreclaimable is set, it's really confusing.
> > > > > >
> > > > > > Hi Shaohua,
> > > > > > Sorry for late response and Thanks for your explanation.
> > > > > > It's valuable to fix, I think.
> > > > > > How about this?
> > > > > >
> > > > > > I hope other guys have a interest in the problem.
> > > > > > Cced them.
> > > > > Hi,
> > > > > it's a long holiday here, so I'm late, sorry.
> > > >
> > > > No problem. I coundn't access internet freely, either.
> > > >
> > > > >
> > > > > > From 070d5b1a69921bc71c6aaa5445fb1d29ecb38f74 Mon Sep 17 00:00:00 2001
> > > > > > From: Minchan Kim <minchan.kim@gmail.com>
> > > > > > Date: Sat, 1 Oct 2011 15:26:08 +0900
> > > > > > Subject: [RFC] vmscan: set all_unreclaimable of zone carefully
> > > > > >
> > > > > > Shaohua Li reported all_unreclaimable of DMA zone is always set
> > > > > > because the system has a big memory HIGH zone so that lowmem_reserve[HIGH]
> > > > > > could be a big.
> > > > > >
> > > > > > It could be a problem as follows
> > > > > >
> > > > > > Assumption :
> > > > > > 1. The system has a big high memory so that lowmem_reserve[HIGH] of DMA zone would be big.
> > > > > > 2. HIGH/NORMAL zone are full but DMA zone has enough free pages.
> > > > > >
> > > > > > Scenario
> > > > > > 1. A request to allocate a page in HIGH zone.
> > > > > > 2. HIGH/NORMAL zone already consumes lots of pages so that it would be fall-backed to DMA zone.
> > > > > > 3. In DMA zone, allocator got failed, too becuase lowmem_reserve[HIGH] is very big so that it wakes up kswapd
> > > > > > 4. kswapd would call shrink_zone while it see DMA zone since DMA zone's lowmem_reserve[HIGHMEM]
> > > > > >    would be big so that it couldn't meet zone_watermark_ok_safe(high_wmark_pages(zone) + balance_gap,
> > > > > >    *end_zone*)
> > > > > > 5. DMA zone doesn't meet stop condition(nr_slab != 0, !zone_reclaimable) because the zone has small lru pages
> > > > > >    and it doesn't have slab pages so that kswapd would set all_unreclaimable of the zone to *1* easily.
> > > > > > 6. B request to allocate many pages in NORMAL zone but NORMAL zone has no free pages
> > > > > >    so that it would be fall-backed to DMA zone.
> > > > > > 7. DMA zone would allocates many pages for NORMAL zone because lowmem_reserve[NORMAL] is small.
> > > > > >    These pages are used by application(ie, it menas LRU pages. Yes. Now DMA zone could have many reclaimable pages)
> > > > > > 8. C request to allocate a page in NORMAL zone but he got failed because DMA zone doesn't have enough free pages.
> > > > > >    (Most of pages in DMA zone are consumed by B)
> > > > > > 9. Kswapd try to reclaim lru pages in DMA zone but got failed because all_unreclaimable of the zone is 1. Otherwise,
> > > > > >    it could reclaim many pages which are used by B.
> > > > > >
> > > > > > Of coures, we can do something in DEF_PRIORITY but it couldn't do enough because it can't raise
> > > > > > synchronus reclaim in direct reclaim path if the zone has many dirty pages
> > > > > > so that the process is killed by OOM.
> > > > > >
> > > > > > The principal problem is caused by step 8.
> > > > > > In step 8, we increased # of lru size very much but still the zone->all_unreclaimable is 1.
> > > > > > If we increase lru size, it is valuable to try reclaiming again.
> > > > > > The rationale is that we reset all_unreclaimable to 0 even if we free just a one page.
> > > > > So this fixes the oom, but we still have DMA has all_unreclaimable set
> > > > > always, because all_unreclaimable == zone_reclaimable_pages() + 1. Not a
> > > > > problem?
> > > >
> > > > I think we can fix it if it is needeed.
> > > >
> > > > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > > > index 471b20b..ede852c 100644
> > > > --- a/mm/vmstat.c
> > > > +++ b/mm/vmstat.c
> > > > @@ -1019,7 +1019,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
> > > >                    "\n  all_unreclaimable: %u"
> > > >                    "\n  start_pfn:         %lu"
> > > >                    "\n  inactive_ratio:    %u",
> > > > -                  zone->all_unreclaimable,
> > > > +                  zone_unreclaimable(zone),
> > > >                    zone->zone_start_pfn,
> > > >                    zone->inactive_ratio);
> > > >         seq_putc(m, '\n');
> > > >
> > > > I think it's not a big problem at the start.
> > > > all_unreclamable doesn't mean "we have no free page in the zone" but "we have no reclaimable pages any more in the zone".
> > > > It is possible in case of setting reserve memory very high for higher zones.
> > > > If you think it's awkard, we could add description about that in Documentation.
> > > this doesn't fix it, because all_unreclaimable ==
> > > zone_reclaimable_pages() + 1, zone_unreclaimable() will still be true.
> > 
> > 
> > We set the zone into all_unreclaimabe when VM found much scanning in the zone without reclaiming a page
> > or have no lru page/scanning.
> > So if all_unreclaimable of the zone is set to 1, it is natural to contine remaining all_unreclaimable = 1
> > unless any page isn't freed or other LRU allocation happens.
> > 
> > What should be fixed?
> I mean if the dma zone should be all_unreclaimable, but it zone without
> lru pages should be all_unreclaimable, then this isn't a problem
> 
> > >
> > > > > What's wrong with my original patch? It appears reasonable if a zone has
> > > > > a lot of free memory, don't set unreclaimable to it.
> > > >
> > > > As I said, all_unreclaimable doesn't mean "no free pages in the zone".
> > > > So we shouldn't add new dependency between all_unreclaimable and # of free pages.
> > > >
> > > > And what's the purpose of high_wmark and some specific order check for it?
> > > > Does it mean really "the zone has no free memory"?
> > > >
> > > > Your description try to explain about that and it seems to depend on outer/inner loop in balance_pgdat.
> > > > (But it's very hard for dump me to parse :( )
> > > > I don't like that fragile code. If we might change it in future?
> > > > Of course, we can add enough description about that but it means more complex.
> > > > I don't want to add more complexity in kswapd unless we have a good reason.
> > > >
> > > > The most important thing is that the problem isn't related to # of free pages.
> > > > As I state in my patch, the problem is caused by not considering LRU size changing.
> > > > I would like to target a principal cause.
> > > ok, if unreclaimable means no reclaimable pages. I'd consider the
> > > original method (regards zone without lru pages as reclaimable), which
> > 
> > Why do we have to consider zone which has no lru pages as reclaimable?
> > It doesn't make sense to me.
> my head is messed which zone should be considered as reclaimable.
> 
> > > is simpler. My original objection that method will regard zone which is
> > > full and has no lru pages as reclaimable, but as you said reclaimable
> > > isn't related to free pages.
> > 
> > Yes. it's not related to free pages but lru pages.
> > If we don't have no lru pages, we should mark the zone with all_unreclaimable.
> > 
> > Of course, we should consider slab pages, too but it's not easy for VM to reclaim
> > slabe pages compared to LRU pages so I decided ignoring it in this version
> > but if anyone complains about that, I will consdier that, too.
> > 
> > > I can't get a clear meaning for all_unreclaimable with your new patch.
> > 
> > What I mean about all_unreclaimable is that the zone has no reclaimable pages which
> > could be lru pages/slab pages. I ignore slab pages in this version
> I now tended to agree using reclaimable pages to determine if a zone is
> reclaimable. please fix the zoneinfo_show_print() issue and add document
> about all_unreclaimable meaning, then feel free to add my
> Reviewed-by: Shaohua Li <shaohua.li@intel.com>

Thanks for your careful review.
I will send a formal version.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 1/2]vmscan: correct all_unreclaimable for zone without lru pages
  2011-10-09  7:45                         ` Minchan Kim
@ 2011-10-11  8:09                           ` KAMEZAWA Hiroyuki
  2011-10-11  9:07                             ` Minchan Kim
  0 siblings, 1 reply; 20+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-10-11  8:09 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Shaohua Li, Andrew Morton, Michal Hocko, mel, Rik van Riel,
	linux-mm, Johannes Weiner, KOSAKI Motohiro

On Sun, 9 Oct 2011 16:45:58 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:
> hanks for your careful review.
> I will send a formal version.
> 
> From 49078e0ebccae371b04930ae76dfd5ba158032ca Mon Sep 17 00:00:00 2001
> From: Minchan Kim <minchan.kim@gmail.com>
> Date: Sun, 9 Oct 2011 16:38:40 +0900
> Subject: [PATCH] vmscan: judge zone's all_unreclaimable carefully
> 
> Shaohua Li reported all_unreclaimable of DMA zone is always set
> because the system has a big memory HIGH zone so that lowmem_reserve[HIGH]
> could be a big.
> 
> It could be a problem as follows
> 
> Assumption :
> 1. The system has a big high memory so that lowmem_reserve[HIGH] of DMA zone would be big.
> 2. HIGH/NORMAL zone are full but DMA zone has enough free pages.
> 
> Scenario
> 1. A request to allocate a page in HIGH zone.
> 2. HIGH/NORMAL zone already consumes lots of pages so that it would be fall-backed to DMA zone.
> 3. In DMA zone, allocator got failed, too becuase lowmem_reserve[HIGH] is very big so that it wakes up kswapd
> 4. kswapd would call shrink_zone while it see DMA zone since DMA zone's lowmem_reserve[HIGHMEM]
>    would be big so that it couldn't meet zone_watermark_ok_safe(high_wmark_pages(zone) + balance_gap,
>    *end_zone*)
> 5. DMA zone doesn't meet stop condition(nr_slab != 0, !zone_reclaimable) because the zone has small lru pages
>    and it doesn't have slab pages so that kswapd would set all_unreclaimable of the zone to *1* easily.
> 6. B request to allocate many pages in NORMAL zone but NORMAL zone has no free pages
>    so that it would be fall-backed to DMA zone.
> 7. DMA zone would allocates many pages for NORMAL zone because lowmem_reserve[NORMAL] is small.
>    These pages are used by application(ie, it menas LRU pages. Yes. Now DMA zone could have many reclaimable pages)
> 8. C request to allocate a page in NORMAL zone but he got failed because DMA zone doesn't have enough free pages.
>    (Most of pages in DMA zone are consumed by B)
> 9. Kswapd try to reclaim lru pages in DMA zone but got failed because all_unreclaimable of the zone is 1. Otherwise,
>    it could reclaim many pages which are used by B.
> 
> Of coures, we can do something in DEF_PRIORITY but it couldn't do enough because it can't raise
> synchronus reclaim in direct reclaim path if the zone has many dirty pages
> so that the process is killed by OOM.
> 
> The principal problem is caused by step 8.
> In step 8, we increased # of lru size very much but still the zone->all_unreclaimable is 1.
> If we increase lru size, it is valuable to try reclaiming again.
> The rationale is that we reset all_unreclaimable to 0 even if we free just a one page.
> 
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Johannes Weiner <jweiner@redhat.com>
> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Reported-by: Shaohua Li <shaohua.li@intel.com>
> Reviewed-by: Shaohua Li <shaohua.li@intel.com>
> Signed-off-by: Minchan Kim <minchan.kim@gmail.com>

Hmm, catching changes of page usage in a zone ?
And this will allow to catch swap_on() and make a zone reclaimable
even if no page usage changes. right ?

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 1/2]vmscan: correct all_unreclaimable for zone without lru pages
  2011-10-11  8:09                           ` KAMEZAWA Hiroyuki
@ 2011-10-11  9:07                             ` Minchan Kim
  2011-10-11  9:29                               ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 20+ messages in thread
From: Minchan Kim @ 2011-10-11  9:07 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Shaohua Li, Andrew Morton, Michal Hocko, mel, Rik van Riel,
	linux-mm, Johannes Weiner, KOSAKI Motohiro

Hi Kame,

On Tue, Oct 11, 2011 at 05:09:41PM +0900, KAMEZAWA Hiroyuki wrote:
> On Sun, 9 Oct 2011 16:45:58 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
> > hanks for your careful review.
> > I will send a formal version.
> > 
> > From 49078e0ebccae371b04930ae76dfd5ba158032ca Mon Sep 17 00:00:00 2001
> > From: Minchan Kim <minchan.kim@gmail.com>
> > Date: Sun, 9 Oct 2011 16:38:40 +0900
> > Subject: [PATCH] vmscan: judge zone's all_unreclaimable carefully
> > 
> > Shaohua Li reported all_unreclaimable of DMA zone is always set
> > because the system has a big memory HIGH zone so that lowmem_reserve[HIGH]
> > could be a big.
> > 
> > It could be a problem as follows
> > 
> > Assumption :
> > 1. The system has a big high memory so that lowmem_reserve[HIGH] of DMA zone would be big.
> > 2. HIGH/NORMAL zone are full but DMA zone has enough free pages.
> > 
> > Scenario
> > 1. A request to allocate a page in HIGH zone.
> > 2. HIGH/NORMAL zone already consumes lots of pages so that it would be fall-backed to DMA zone.
> > 3. In DMA zone, allocator got failed, too becuase lowmem_reserve[HIGH] is very big so that it wakes up kswapd
> > 4. kswapd would call shrink_zone while it see DMA zone since DMA zone's lowmem_reserve[HIGHMEM]
> >    would be big so that it couldn't meet zone_watermark_ok_safe(high_wmark_pages(zone) + balance_gap,
> >    *end_zone*)
> > 5. DMA zone doesn't meet stop condition(nr_slab != 0, !zone_reclaimable) because the zone has small lru pages
> >    and it doesn't have slab pages so that kswapd would set all_unreclaimable of the zone to *1* easily.
> > 6. B request to allocate many pages in NORMAL zone but NORMAL zone has no free pages
> >    so that it would be fall-backed to DMA zone.
> > 7. DMA zone would allocates many pages for NORMAL zone because lowmem_reserve[NORMAL] is small.
> >    These pages are used by application(ie, it menas LRU pages. Yes. Now DMA zone could have many reclaimable pages)
> > 8. C request to allocate a page in NORMAL zone but he got failed because DMA zone doesn't have enough free pages.
> >    (Most of pages in DMA zone are consumed by B)
> > 9. Kswapd try to reclaim lru pages in DMA zone but got failed because all_unreclaimable of the zone is 1. Otherwise,
> >    it could reclaim many pages which are used by B.
> > 
> > Of coures, we can do something in DEF_PRIORITY but it couldn't do enough because it can't raise
> > synchronus reclaim in direct reclaim path if the zone has many dirty pages
> > so that the process is killed by OOM.
> > 
> > The principal problem is caused by step 8.
> > In step 8, we increased # of lru size very much but still the zone->all_unreclaimable is 1.
> > If we increase lru size, it is valuable to try reclaiming again.
> > The rationale is that we reset all_unreclaimable to 0 even if we free just a one page.
> > 
> > Cc: Mel Gorman <mel@csn.ul.ie>
> > Cc: Rik van Riel <riel@redhat.com>
> > Cc: Michal Hocko <mhocko@suse.cz>
> > Cc: Johannes Weiner <jweiner@redhat.com>
> > Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Reported-by: Shaohua Li <shaohua.li@intel.com>
> > Reviewed-by: Shaohua Li <shaohua.li@intel.com>
> > Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
> 
> Hmm, catching changes of page usage in a zone ?

Not exactly.
It does catch only lru page increasement of zone.

> And this will allow to catch swap_on() and make a zone reclaimable
> even if no page usage changes. right ?

It's not in the patch but I think it could be a another patch.
Could you post it if you really need it?

> 
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Thanks, Kame.

> 
> 

-- 
Kinds regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 1/2]vmscan: correct all_unreclaimable for zone without lru pages
  2011-10-11  9:07                             ` Minchan Kim
@ 2011-10-11  9:29                               ` KAMEZAWA Hiroyuki
  2011-10-11  9:36                                 ` Minchan Kim
  0 siblings, 1 reply; 20+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-10-11  9:29 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Shaohua Li, Andrew Morton, Michal Hocko, mel, Rik van Riel,
	linux-mm, Johannes Weiner, KOSAKI Motohiro

On Tue, 11 Oct 2011 18:07:56 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> Hi Kame,
> 
> On Tue, Oct 11, 2011 at 05:09:41PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Sun, 9 Oct 2011 16:45:58 +0900
> > Minchan Kim <minchan.kim@gmail.com> wrote:
> > > hanks for your careful review.
> > > I will send a formal version.
> > > 
> > > From 49078e0ebccae371b04930ae76dfd5ba158032ca Mon Sep 17 00:00:00 2001
> > > From: Minchan Kim <minchan.kim@gmail.com>
> > > Date: Sun, 9 Oct 2011 16:38:40 +0900
> > > Subject: [PATCH] vmscan: judge zone's all_unreclaimable carefully
> > > 
> > > Shaohua Li reported all_unreclaimable of DMA zone is always set
> > > because the system has a big memory HIGH zone so that lowmem_reserve[HIGH]
> > > could be a big.
> > > 
> > > It could be a problem as follows
> > > 
> > > Assumption :
> > > 1. The system has a big high memory so that lowmem_reserve[HIGH] of DMA zone would be big.
> > > 2. HIGH/NORMAL zone are full but DMA zone has enough free pages.
> > > 
> > > Scenario
> > > 1. A request to allocate a page in HIGH zone.
> > > 2. HIGH/NORMAL zone already consumes lots of pages so that it would be fall-backed to DMA zone.
> > > 3. In DMA zone, allocator got failed, too becuase lowmem_reserve[HIGH] is very big so that it wakes up kswapd
> > > 4. kswapd would call shrink_zone while it see DMA zone since DMA zone's lowmem_reserve[HIGHMEM]
> > >    would be big so that it couldn't meet zone_watermark_ok_safe(high_wmark_pages(zone) + balance_gap,
> > >    *end_zone*)
> > > 5. DMA zone doesn't meet stop condition(nr_slab != 0, !zone_reclaimable) because the zone has small lru pages
> > >    and it doesn't have slab pages so that kswapd would set all_unreclaimable of the zone to *1* easily.
> > > 6. B request to allocate many pages in NORMAL zone but NORMAL zone has no free pages
> > >    so that it would be fall-backed to DMA zone.
> > > 7. DMA zone would allocates many pages for NORMAL zone because lowmem_reserve[NORMAL] is small.
> > >    These pages are used by application(ie, it menas LRU pages. Yes. Now DMA zone could have many reclaimable pages)
> > > 8. C request to allocate a page in NORMAL zone but he got failed because DMA zone doesn't have enough free pages.
> > >    (Most of pages in DMA zone are consumed by B)
> > > 9. Kswapd try to reclaim lru pages in DMA zone but got failed because all_unreclaimable of the zone is 1. Otherwise,
> > >    it could reclaim many pages which are used by B.
> > > 
> > > Of coures, we can do something in DEF_PRIORITY but it couldn't do enough because it can't raise
> > > synchronus reclaim in direct reclaim path if the zone has many dirty pages
> > > so that the process is killed by OOM.
> > > 
> > > The principal problem is caused by step 8.
> > > In step 8, we increased # of lru size very much but still the zone->all_unreclaimable is 1.
> > > If we increase lru size, it is valuable to try reclaiming again.
> > > The rationale is that we reset all_unreclaimable to 0 even if we free just a one page.
> > > 
> > > Cc: Mel Gorman <mel@csn.ul.ie>
> > > Cc: Rik van Riel <riel@redhat.com>
> > > Cc: Michal Hocko <mhocko@suse.cz>
> > > Cc: Johannes Weiner <jweiner@redhat.com>
> > > Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > Reported-by: Shaohua Li <shaohua.li@intel.com>
> > > Reviewed-by: Shaohua Li <shaohua.li@intel.com>
> > > Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
> > 
> > Hmm, catching changes of page usage in a zone ?
> 
> Not exactly.
> It does catch only lru page increasement of zone.
> 
Sure.

> > And this will allow to catch swap_on() and make a zone reclaimable
> > even if no page usage changes. right ?
> 
> It's not in the patch but I think it could be a another patch.
> Could you post it if you really need it?
> 
What I mean is "zone_reclaimable_pages() take swappable or not
into account for anon pages. So, it's already covered."

I have no requirements.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 1/2]vmscan: correct all_unreclaimable for zone without lru pages
  2011-10-11  9:29                               ` KAMEZAWA Hiroyuki
@ 2011-10-11  9:36                                 ` Minchan Kim
  0 siblings, 0 replies; 20+ messages in thread
From: Minchan Kim @ 2011-10-11  9:36 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Shaohua Li, Andrew Morton, Michal Hocko, mel, Rik van Riel,
	linux-mm, Johannes Weiner, KOSAKI Motohiro

On Tue, Oct 11, 2011 at 06:29:48PM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 11 Oct 2011 18:07:56 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
> 
> > Hi Kame,
> > 
> > On Tue, Oct 11, 2011 at 05:09:41PM +0900, KAMEZAWA Hiroyuki wrote:
> > > On Sun, 9 Oct 2011 16:45:58 +0900
> > > Minchan Kim <minchan.kim@gmail.com> wrote:
> > > > hanks for your careful review.
> > > > I will send a formal version.
> > > > 
> > > > From 49078e0ebccae371b04930ae76dfd5ba158032ca Mon Sep 17 00:00:00 2001
> > > > From: Minchan Kim <minchan.kim@gmail.com>
> > > > Date: Sun, 9 Oct 2011 16:38:40 +0900
> > > > Subject: [PATCH] vmscan: judge zone's all_unreclaimable carefully
> > > > 
> > > > Shaohua Li reported all_unreclaimable of DMA zone is always set
> > > > because the system has a big memory HIGH zone so that lowmem_reserve[HIGH]
> > > > could be a big.
> > > > 
> > > > It could be a problem as follows
> > > > 
> > > > Assumption :
> > > > 1. The system has a big high memory so that lowmem_reserve[HIGH] of DMA zone would be big.
> > > > 2. HIGH/NORMAL zone are full but DMA zone has enough free pages.
> > > > 
> > > > Scenario
> > > > 1. A request to allocate a page in HIGH zone.
> > > > 2. HIGH/NORMAL zone already consumes lots of pages so that it would be fall-backed to DMA zone.
> > > > 3. In DMA zone, allocator got failed, too becuase lowmem_reserve[HIGH] is very big so that it wakes up kswapd
> > > > 4. kswapd would call shrink_zone while it see DMA zone since DMA zone's lowmem_reserve[HIGHMEM]
> > > >    would be big so that it couldn't meet zone_watermark_ok_safe(high_wmark_pages(zone) + balance_gap,
> > > >    *end_zone*)
> > > > 5. DMA zone doesn't meet stop condition(nr_slab != 0, !zone_reclaimable) because the zone has small lru pages
> > > >    and it doesn't have slab pages so that kswapd would set all_unreclaimable of the zone to *1* easily.
> > > > 6. B request to allocate many pages in NORMAL zone but NORMAL zone has no free pages
> > > >    so that it would be fall-backed to DMA zone.
> > > > 7. DMA zone would allocates many pages for NORMAL zone because lowmem_reserve[NORMAL] is small.
> > > >    These pages are used by application(ie, it menas LRU pages. Yes. Now DMA zone could have many reclaimable pages)
> > > > 8. C request to allocate a page in NORMAL zone but he got failed because DMA zone doesn't have enough free pages.
> > > >    (Most of pages in DMA zone are consumed by B)
> > > > 9. Kswapd try to reclaim lru pages in DMA zone but got failed because all_unreclaimable of the zone is 1. Otherwise,
> > > >    it could reclaim many pages which are used by B.
> > > > 
> > > > Of coures, we can do something in DEF_PRIORITY but it couldn't do enough because it can't raise
> > > > synchronus reclaim in direct reclaim path if the zone has many dirty pages
> > > > so that the process is killed by OOM.
> > > > 
> > > > The principal problem is caused by step 8.
> > > > In step 8, we increased # of lru size very much but still the zone->all_unreclaimable is 1.
> > > > If we increase lru size, it is valuable to try reclaiming again.
> > > > The rationale is that we reset all_unreclaimable to 0 even if we free just a one page.
> > > > 
> > > > Cc: Mel Gorman <mel@csn.ul.ie>
> > > > Cc: Rik van Riel <riel@redhat.com>
> > > > Cc: Michal Hocko <mhocko@suse.cz>
> > > > Cc: Johannes Weiner <jweiner@redhat.com>
> > > > Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > > Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > > Reported-by: Shaohua Li <shaohua.li@intel.com>
> > > > Reviewed-by: Shaohua Li <shaohua.li@intel.com>
> > > > Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
> > > 
> > > Hmm, catching changes of page usage in a zone ?
> > 
> > Not exactly.
> > It does catch only lru page increasement of zone.
> > 
> Sure.
> 
> > > And this will allow to catch swap_on() and make a zone reclaimable
> > > even if no page usage changes. right ?
> > 
> > It's not in the patch but I think it could be a another patch.
> > Could you post it if you really need it?
> > 
> What I mean is "zone_reclaimable_pages() take swappable or not
> into account for anon pages. So, it's already covered."

Got it. I thought you're saying swap on race as follows,
When VM decides the zone is all_unreclimable, sudden any user
could do swap_on. From now on, we could reclaim anon pages so we have to
reset all_unreclaimable.

Anyway, it's a idea. if anyone think we should handle it, feel free to post.
But I am sure.

> 
> I have no requirements.
> 
> Thanks,
> -Kame
> 

-- 
Kinds regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2011-10-11  9:37 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-09-27  7:23 [patch 1/2]vmscan: correct all_unreclaimable for zone without lru pages Shaohua Li
2011-09-27  9:28 ` Michal Hocko
2011-09-28  0:46   ` Shaohua Li
2011-09-28  6:57 ` Minchan Kim
2011-09-28  7:08   ` Shaohua Li
2011-09-28 17:57     ` Minchan Kim
2011-09-29  1:14       ` Shaohua Li
2011-09-29  9:18         ` Minchan Kim
2011-09-30  2:12           ` Shaohua Li
2011-10-01  6:59             ` Minchan Kim
2011-10-08  3:09               ` Shaohua Li
2011-10-08  4:32                 ` Minchan Kim
2011-10-08  5:48                   ` Shaohua Li
2011-10-08  9:35                     ` Minchan Kim
2011-10-09  6:08                       ` Shaohua Li
2011-10-09  7:45                         ` Minchan Kim
2011-10-11  8:09                           ` KAMEZAWA Hiroyuki
2011-10-11  9:07                             ` Minchan Kim
2011-10-11  9:29                               ` KAMEZAWA Hiroyuki
2011-10-11  9:36                                 ` Minchan Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).