* [PATCH 0/4] various zone_reclaim cleanup
@ 2009-05-13 3:06 KOSAKI Motohiro
2009-05-13 3:06 ` [PATCH 1/4] vmscan: change the number of the unmapped files in zone reclaim KOSAKI Motohiro
` (3 more replies)
0 siblings, 4 replies; 45+ messages in thread
From: KOSAKI Motohiro @ 2009-05-13 3:06 UTC (permalink / raw)
To: LKML, linux-mm, Andrew Morton, Rik van Riel, Christoph Lameter
Cc: kosaki.motohiro
here is zone_reclaim related various cleanups.
[1/4] vmscan: change the number of the unmapped files in zone reclaim
[2/4] vmscan: drop PF_SWAPWRITE from zone_reclaim
[3/4] vmscan: zone_reclaim use may_swap
[4/4] zone_reclaim_mode is always 0 by default
Please comment.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 45+ messages in thread* [PATCH 1/4] vmscan: change the number of the unmapped files in zone reclaim 2009-05-13 3:06 [PATCH 0/4] various zone_reclaim cleanup KOSAKI Motohiro @ 2009-05-13 3:06 ` KOSAKI Motohiro 2009-05-13 13:31 ` Rik van Riel ` (2 more replies) 2009-05-13 3:06 ` [PATCH 2/4] vmscan: drop PF_SWAPWRITE from zone_reclaim KOSAKI Motohiro ` (2 subsequent siblings) 3 siblings, 3 replies; 45+ messages in thread From: KOSAKI Motohiro @ 2009-05-13 3:06 UTC (permalink / raw) To: KOSAKI Motohiro Cc: LKML, linux-mm, Andrew Morton, Rik van Riel, Christoph Lameter Subject: [PATCH] vmscan: change the number of the unmapped files in zone reclaim Documentation/sysctl/vm.txt says A percentage of the total pages in each zone. Zone reclaim will only occur if more than this percentage of pages are file backed and unmapped. This is to insure that a minimal amount of local pages is still available for file I/O even if the node is overallocated. However, zone_page_state(zone, NR_FILE_PAGES) contain some non file backed pages (e.g. swapcache, buffer-head) The right calculation is to use NR_{IN}ACTIVE_FILE. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Rik van Riel <riel@redhat.com> --- mm/vmscan.c | 21 ++++++++++++++------- 1 file changed, 14 insertions(+), 7 deletions(-) Index: b/mm/vmscan.c =================================================================== --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2397,6 +2397,7 @@ static int __zone_reclaim(struct zone *z .isolate_pages = isolate_pages_global, }; unsigned long slab_reclaimable; + long nr_unmapped_file_pages; disable_swap_token(); cond_resched(); @@ -2409,9 +2410,11 @@ static int __zone_reclaim(struct zone *z reclaim_state.reclaimed_slab = 0; p->reclaim_state = &reclaim_state; - if (zone_page_state(zone, NR_FILE_PAGES) - - zone_page_state(zone, NR_FILE_MAPPED) > - zone->min_unmapped_pages) { + nr_unmapped_file_pages = zone_page_state(zone, NR_INACTIVE_FILE) + + zone_page_state(zone, NR_ACTIVE_FILE) - + zone_page_state(zone, NR_FILE_MAPPED); + + if (nr_unmapped_file_pages > zone->min_unmapped_pages) { /* * Free memory by calling shrink zone with increasing * priorities until we have enough memory freed. @@ -2458,6 +2461,8 @@ int zone_reclaim(struct zone *zone, gfp_ { int node_id; int ret; + long nr_unmapped_file_pages; + long nr_slab_reclaimable; /* * Zone reclaim reclaims unmapped file backed pages and @@ -2469,10 +2474,12 @@ int zone_reclaim(struct zone *zone, gfp_ * if less than a specified percentage of the zone is used by * unmapped file backed pages. */ - if (zone_page_state(zone, NR_FILE_PAGES) - - zone_page_state(zone, NR_FILE_MAPPED) <= zone->min_unmapped_pages - && zone_page_state(zone, NR_SLAB_RECLAIMABLE) - <= zone->min_slab_pages) + nr_unmapped_file_pages = zone_page_state(zone, NR_INACTIVE_FILE) + + zone_page_state(zone, NR_ACTIVE_FILE) - + zone_page_state(zone, NR_FILE_MAPPED); + nr_slab_reclaimable = zone_page_state(zone, NR_SLAB_RECLAIMABLE); + if (nr_unmapped_file_pages <= zone->min_unmapped_pages && + nr_slab_reclaimable <= zone->min_slab_pages) return 0; if (zone_is_all_unreclaimable(zone)) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 1/4] vmscan: change the number of the unmapped files in zone reclaim 2009-05-13 3:06 ` [PATCH 1/4] vmscan: change the number of the unmapped files in zone reclaim KOSAKI Motohiro @ 2009-05-13 13:31 ` Rik van Riel 2009-05-14 19:52 ` Christoph Lameter 2009-05-18 3:15 ` Wu Fengguang 2 siblings, 0 replies; 45+ messages in thread From: Rik van Riel @ 2009-05-13 13:31 UTC (permalink / raw) To: KOSAKI Motohiro; +Cc: LKML, linux-mm, Andrew Morton, Christoph Lameter KOSAKI Motohiro wrote: > Subject: [PATCH] vmscan: change the number of the unmapped files in zone reclaim > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> > Cc: Christoph Lameter <cl@linux-foundation.org> > Cc: Rik van Riel <riel@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 1/4] vmscan: change the number of the unmapped files in zone reclaim 2009-05-13 3:06 ` [PATCH 1/4] vmscan: change the number of the unmapped files in zone reclaim KOSAKI Motohiro 2009-05-13 13:31 ` Rik van Riel @ 2009-05-14 19:52 ` Christoph Lameter 2009-05-18 3:15 ` Wu Fengguang 2 siblings, 0 replies; 45+ messages in thread From: Christoph Lameter @ 2009-05-14 19:52 UTC (permalink / raw) To: KOSAKI Motohiro; +Cc: LKML, linux-mm, Andrew Morton, Rik van Riel Yup, the use of NR_FILE_PAGES there predates the INACTIVE/ACTIVE stats. Reviewed-by: Christoph Lameter <cl@linux-foundation.org> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 1/4] vmscan: change the number of the unmapped files in zone reclaim 2009-05-13 3:06 ` [PATCH 1/4] vmscan: change the number of the unmapped files in zone reclaim KOSAKI Motohiro 2009-05-13 13:31 ` Rik van Riel 2009-05-14 19:52 ` Christoph Lameter @ 2009-05-18 3:15 ` Wu Fengguang 2009-05-18 3:35 ` KOSAKI Motohiro 2 siblings, 1 reply; 45+ messages in thread From: Wu Fengguang @ 2009-05-18 3:15 UTC (permalink / raw) To: KOSAKI Motohiro Cc: LKML, linux-mm, Andrew Morton, Rik van Riel, Christoph Lameter On Wed, May 13, 2009 at 12:06:28PM +0900, KOSAKI Motohiro wrote: > Subject: [PATCH] vmscan: change the number of the unmapped files in zone reclaim > > Documentation/sysctl/vm.txt says > > A percentage of the total pages in each zone. Zone reclaim will only > occur if more than this percentage of pages are file backed and unmapped. > This is to insure that a minimal amount of local pages is still available for > file I/O even if the node is overallocated. > > However, zone_page_state(zone, NR_FILE_PAGES) contain some non file backed pages > (e.g. swapcache, buffer-head) > > The right calculation is to use NR_{IN}ACTIVE_FILE. > > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> > Cc: Christoph Lameter <cl@linux-foundation.org> > Cc: Rik van Riel <riel@redhat.com> > --- > mm/vmscan.c | 21 ++++++++++++++------- > 1 file changed, 14 insertions(+), 7 deletions(-) > > Index: b/mm/vmscan.c > =================================================================== > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2397,6 +2397,7 @@ static int __zone_reclaim(struct zone *z > .isolate_pages = isolate_pages_global, > }; > unsigned long slab_reclaimable; > + long nr_unmapped_file_pages; > > disable_swap_token(); > cond_resched(); > @@ -2409,9 +2410,11 @@ static int __zone_reclaim(struct zone *z > reclaim_state.reclaimed_slab = 0; > p->reclaim_state = &reclaim_state; > > - if (zone_page_state(zone, NR_FILE_PAGES) - > - zone_page_state(zone, NR_FILE_MAPPED) > > - zone->min_unmapped_pages) { > + nr_unmapped_file_pages = zone_page_state(zone, NR_INACTIVE_FILE) + > + zone_page_state(zone, NR_ACTIVE_FILE) - > + zone_page_state(zone, NR_FILE_MAPPED); This can possibly go negative. > + if (nr_unmapped_file_pages > zone->min_unmapped_pages) { > /* > * Free memory by calling shrink zone with increasing > * priorities until we have enough memory freed. > @@ -2458,6 +2461,8 @@ int zone_reclaim(struct zone *zone, gfp_ > { > int node_id; > int ret; > + long nr_unmapped_file_pages; > + long nr_slab_reclaimable; > > /* > * Zone reclaim reclaims unmapped file backed pages and > @@ -2469,10 +2474,12 @@ int zone_reclaim(struct zone *zone, gfp_ > * if less than a specified percentage of the zone is used by > * unmapped file backed pages. > */ > - if (zone_page_state(zone, NR_FILE_PAGES) - > - zone_page_state(zone, NR_FILE_MAPPED) <= zone->min_unmapped_pages > - && zone_page_state(zone, NR_SLAB_RECLAIMABLE) > - <= zone->min_slab_pages) > + nr_unmapped_file_pages = zone_page_state(zone, NR_INACTIVE_FILE) + > + zone_page_state(zone, NR_ACTIVE_FILE) - > + zone_page_state(zone, NR_FILE_MAPPED); Ditto. Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> > + nr_slab_reclaimable = zone_page_state(zone, NR_SLAB_RECLAIMABLE); > + if (nr_unmapped_file_pages <= zone->min_unmapped_pages && > + nr_slab_reclaimable <= zone->min_slab_pages) > return 0; > > if (zone_is_all_unreclaimable(zone)) > > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 1/4] vmscan: change the number of the unmapped files in zone reclaim 2009-05-18 3:15 ` Wu Fengguang @ 2009-05-18 3:35 ` KOSAKI Motohiro 2009-05-18 3:53 ` Wu Fengguang 0 siblings, 1 reply; 45+ messages in thread From: KOSAKI Motohiro @ 2009-05-18 3:35 UTC (permalink / raw) To: Wu Fengguang Cc: LKML, linux-mm, Andrew Morton, Rik van Riel, Christoph Lameter >> --- a/mm/vmscan.c >> +++ b/mm/vmscan.c >> @@ -2397,6 +2397,7 @@ static int __zone_reclaim(struct zone *z >> .isolate_pages = isolate_pages_global, >> }; >> unsigned long slab_reclaimable; >> + long nr_unmapped_file_pages; >> >> disable_swap_token(); >> cond_resched(); >> @@ -2409,9 +2410,11 @@ static int __zone_reclaim(struct zone *z >> reclaim_state.reclaimed_slab = 0; >> p->reclaim_state = &reclaim_state; >> >> - if (zone_page_state(zone, NR_FILE_PAGES) - >> - zone_page_state(zone, NR_FILE_MAPPED) > >> - zone->min_unmapped_pages) { >> + nr_unmapped_file_pages = zone_page_state(zone, NR_INACTIVE_FILE) + >> + zone_page_state(zone, NR_ACTIVE_FILE) - >> + zone_page_state(zone, NR_FILE_MAPPED); > > This can possibly go negative. Is this a problem? negative value mean almost pages are mapped. Thus (nr_unmapped_file_pages > zone->min_unmapped_pages) => 0 is ok, I think. > >> + if (nr_unmapped_file_pages > zone->min_unmapped_pages) { >> /* >> * Free memory by calling shrink zone with increasing >> * priorities until we have enough memory freed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 1/4] vmscan: change the number of the unmapped files in zone reclaim 2009-05-18 3:35 ` KOSAKI Motohiro @ 2009-05-18 3:53 ` Wu Fengguang 2009-05-19 1:11 ` KOSAKI Motohiro 0 siblings, 1 reply; 45+ messages in thread From: Wu Fengguang @ 2009-05-18 3:53 UTC (permalink / raw) To: KOSAKI Motohiro Cc: LKML, linux-mm, Andrew Morton, Rik van Riel, Christoph Lameter On Mon, May 18, 2009 at 11:35:31AM +0800, KOSAKI Motohiro wrote: > >> --- a/mm/vmscan.c > >> +++ b/mm/vmscan.c > >> @@ -2397,6 +2397,7 @@ static int __zone_reclaim(struct zone *z > >> A A A A A A A .isolate_pages = isolate_pages_global, > >> A A A }; > >> A A A unsigned long slab_reclaimable; > >> + A A long nr_unmapped_file_pages; > >> > >> A A A disable_swap_token(); > >> A A A cond_resched(); > >> @@ -2409,9 +2410,11 @@ static int __zone_reclaim(struct zone *z > >> A A A reclaim_state.reclaimed_slab = 0; > >> A A A p->reclaim_state = &reclaim_state; > >> > >> - A A if (zone_page_state(zone, NR_FILE_PAGES) - > >> - A A A A A A zone_page_state(zone, NR_FILE_MAPPED) > > >> - A A A A A A zone->min_unmapped_pages) { > >> + A A nr_unmapped_file_pages = zone_page_state(zone, NR_INACTIVE_FILE) + > >> + A A A A A A A A A A A A A A A zone_page_state(zone, NR_ACTIVE_FILE) - > >> + A A A A A A A A A A A A A A A zone_page_state(zone, NR_FILE_MAPPED); > > > > This can possibly go negative. > > Is this a problem? > negative value mean almost pages are mapped. Thus > > (nr_unmapped_file_pages > zone->min_unmapped_pages) => 0 > > is ok, I think. I wonder why you didn't get a gcc warning, because zone->min_unmapped_pages is a "unsigned long". Anyway, add a simple note to the code if it works *implicitly*? Thanks, Fengguang > > > >> + A A if (nr_unmapped_file_pages > zone->min_unmapped_pages) { > >> A A A A A A A /* > >> A A A A A A A A * Free memory by calling shrink zone with increasing > >> A A A A A A A A * priorities until we have enough memory freed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 1/4] vmscan: change the number of the unmapped files in zone reclaim 2009-05-18 3:53 ` Wu Fengguang @ 2009-05-19 1:11 ` KOSAKI Motohiro 0 siblings, 0 replies; 45+ messages in thread From: KOSAKI Motohiro @ 2009-05-19 1:11 UTC (permalink / raw) To: Wu Fengguang Cc: kosaki.motohiro, LKML, linux-mm, Andrew Morton, Rik van Riel, Christoph Lameter > On Mon, May 18, 2009 at 11:35:31AM +0800, KOSAKI Motohiro wrote: > > >> --- a/mm/vmscan.c > > >> +++ b/mm/vmscan.c > > >> @@ -2397,6 +2397,7 @@ static int __zone_reclaim(struct zone *z > > >> ? ? ? ? ? ? ? .isolate_pages = isolate_pages_global, > > >> ? ? ? }; > > >> ? ? ? unsigned long slab_reclaimable; > > >> + ? ? long nr_unmapped_file_pages; > > >> > > >> ? ? ? disable_swap_token(); > > >> ? ? ? cond_resched(); > > >> @@ -2409,9 +2410,11 @@ static int __zone_reclaim(struct zone *z > > >> ? ? ? reclaim_state.reclaimed_slab = 0; > > >> ? ? ? p->reclaim_state = &reclaim_state; > > >> > > >> - ? ? if (zone_page_state(zone, NR_FILE_PAGES) - > > >> - ? ? ? ? ? ? zone_page_state(zone, NR_FILE_MAPPED) > > > >> - ? ? ? ? ? ? zone->min_unmapped_pages) { > > >> + ? ? nr_unmapped_file_pages = zone_page_state(zone, NR_INACTIVE_FILE) + > > >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?zone_page_state(zone, NR_ACTIVE_FILE) - > > >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?zone_page_state(zone, NR_FILE_MAPPED); > > > > > > This can possibly go negative. > > > > Is this a problem? > > negative value mean almost pages are mapped. Thus > > > > (nr_unmapped_file_pages > zone->min_unmapped_pages) => 0 > > > > is ok, I think. > > I wonder why you didn't get a gcc warning, because zone->min_unmapped_pages > is a "unsigned long". > > Anyway, add a simple note to the code if it works *implicitly*? hm, My gcc is wrong version? (gcc version 4.1.2 20070626 (Red Hat 4.1.2-14)) Anyway, you are right. thanks for good catch :) incremental fixing patch is here. Patch name: vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim-fix.patch Applied after: vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch --- mm/vmscan.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) Index: b/mm/vmscan.c =================================================================== --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2397,7 +2397,9 @@ static int __zone_reclaim(struct zone *z .isolate_pages = isolate_pages_global, }; unsigned long slab_reclaimable; - long nr_unmapped_file_pages; + unsigned long nr_file_pages; + unsigned long nr_mapped; + unsigned long nr_unmapped_file_pages = 0; disable_swap_token(); cond_resched(); @@ -2410,9 +2412,11 @@ static int __zone_reclaim(struct zone *z reclaim_state.reclaimed_slab = 0; p->reclaim_state = &reclaim_state; - nr_unmapped_file_pages = zone_page_state(zone, NR_INACTIVE_FILE) + - zone_page_state(zone, NR_ACTIVE_FILE) - - zone_page_state(zone, NR_FILE_MAPPED); + nr_file_pages = zone_page_state(zone, NR_INACTIVE_FILE) + + zone_page_state(zone, NR_ACTIVE_FILE); + nr_mapped = zone_page_state(zone, NR_FILE_MAPPED); + if (likely(nr_file_pages >= nr_mapped)) + nr_unmapped_file_pages = nr_file_pages - nr_mapped; if (nr_unmapped_file_pages > zone->min_unmapped_pages) { /* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH 2/4] vmscan: drop PF_SWAPWRITE from zone_reclaim 2009-05-13 3:06 [PATCH 0/4] various zone_reclaim cleanup KOSAKI Motohiro 2009-05-13 3:06 ` [PATCH 1/4] vmscan: change the number of the unmapped files in zone reclaim KOSAKI Motohiro @ 2009-05-13 3:06 ` KOSAKI Motohiro 2009-05-13 13:35 ` Rik van Riel ` (2 more replies) 2009-05-13 3:07 ` [PATCH 3/4] vmscan: zone_reclaim use may_swap KOSAKI Motohiro 2009-05-13 3:08 ` [PATCH 4/4] zone_reclaim_mode is always 0 by default KOSAKI Motohiro 3 siblings, 3 replies; 45+ messages in thread From: KOSAKI Motohiro @ 2009-05-13 3:06 UTC (permalink / raw) To: KOSAKI Motohiro Cc: LKML, linux-mm, Andrew Morton, Rik van Riel, Christoph Lameter Subject: [PATCH] vmscan: drop PF_SWAPWRITE from zone_reclaim PF_SWAPWRITE mean ignore write congestion. (see may_write_to_queue()) foreground reclaim shouldn't ignore it because to write congested device cause large IO lantency. it isn't better than remote node allocation. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Rik van Riel <riel@redhat.com> --- mm/vmscan.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: b/mm/vmscan.c =================================================================== --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2406,7 +2406,7 @@ static int __zone_reclaim(struct zone *z * and we also need to be able to write out pages for RECLAIM_WRITE * and RECLAIM_SWAP. */ - p->flags |= PF_MEMALLOC | PF_SWAPWRITE; + p->flags |= PF_MEMALLOC; reclaim_state.reclaimed_slab = 0; p->reclaim_state = &reclaim_state; @@ -2453,7 +2453,7 @@ static int __zone_reclaim(struct zone *z } p->reclaim_state = NULL; - current->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE); + current->flags &= ~PF_MEMALLOC; return sc.nr_reclaimed >= nr_pages; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 2/4] vmscan: drop PF_SWAPWRITE from zone_reclaim 2009-05-13 3:06 ` [PATCH 2/4] vmscan: drop PF_SWAPWRITE from zone_reclaim KOSAKI Motohiro @ 2009-05-13 13:35 ` Rik van Riel 2009-05-14 19:57 ` Christoph Lameter 2009-05-18 3:33 ` Wu Fengguang 2 siblings, 0 replies; 45+ messages in thread From: Rik van Riel @ 2009-05-13 13:35 UTC (permalink / raw) To: KOSAKI Motohiro; +Cc: LKML, linux-mm, Andrew Morton, Christoph Lameter KOSAKI Motohiro wrote: > Subject: [PATCH] vmscan: drop PF_SWAPWRITE from zone_reclaim > > PF_SWAPWRITE mean ignore write congestion. (see may_write_to_queue()) > > foreground reclaim shouldn't ignore it because to write congested device cause > large IO lantency. > it isn't better than remote node allocation. It might be on NUMAQ (which is no longer manufactured), but your change looks right for every other vaguely modern NUMA architecture that I know of. > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> > Cc: Christoph Lameter <cl@linux-foundation.org> > Cc: Rik van Riel <riel@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 2/4] vmscan: drop PF_SWAPWRITE from zone_reclaim 2009-05-13 3:06 ` [PATCH 2/4] vmscan: drop PF_SWAPWRITE from zone_reclaim KOSAKI Motohiro 2009-05-13 13:35 ` Rik van Riel @ 2009-05-14 19:57 ` Christoph Lameter 2009-05-18 3:33 ` Wu Fengguang 2 siblings, 0 replies; 45+ messages in thread From: Christoph Lameter @ 2009-05-14 19:57 UTC (permalink / raw) To: KOSAKI Motohiro; +Cc: LKML, linux-mm, Andrew Morton, Rik van Riel On Wed, 13 May 2009, KOSAKI Motohiro wrote: > Subject: [PATCH] vmscan: drop PF_SWAPWRITE from zone_reclaim > > PF_SWAPWRITE mean ignore write congestion. (see may_write_to_queue()) > > foreground reclaim shouldn't ignore it because to write congested device cause > large IO lantency. > it isn't better than remote node allocation. Zone reclaim by default does not perform writes. RECLAIM_WRITE must be set for that to be effective. Acked-by: Christoph Lameter <cl@linux-foundation.org> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 2/4] vmscan: drop PF_SWAPWRITE from zone_reclaim 2009-05-13 3:06 ` [PATCH 2/4] vmscan: drop PF_SWAPWRITE from zone_reclaim KOSAKI Motohiro 2009-05-13 13:35 ` Rik van Riel 2009-05-14 19:57 ` Christoph Lameter @ 2009-05-18 3:33 ` Wu Fengguang 2 siblings, 0 replies; 45+ messages in thread From: Wu Fengguang @ 2009-05-18 3:33 UTC (permalink / raw) To: KOSAKI Motohiro Cc: LKML, linux-mm, Andrew Morton, Rik van Riel, Christoph Lameter On Wed, May 13, 2009 at 12:06:51PM +0900, KOSAKI Motohiro wrote: > Subject: [PATCH] vmscan: drop PF_SWAPWRITE from zone_reclaim > > PF_SWAPWRITE mean ignore write congestion. (see may_write_to_queue()) > > foreground reclaim shouldn't ignore it because to write congested device cause > large IO lantency. > it isn't better than remote node allocation. Acked-by: Wu Fengguang <fengguang.wu@intel.com> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH 3/4] vmscan: zone_reclaim use may_swap 2009-05-13 3:06 [PATCH 0/4] various zone_reclaim cleanup KOSAKI Motohiro 2009-05-13 3:06 ` [PATCH 1/4] vmscan: change the number of the unmapped files in zone reclaim KOSAKI Motohiro 2009-05-13 3:06 ` [PATCH 2/4] vmscan: drop PF_SWAPWRITE from zone_reclaim KOSAKI Motohiro @ 2009-05-13 3:07 ` KOSAKI Motohiro 2009-05-13 11:26 ` Johannes Weiner ` (3 more replies) 2009-05-13 3:08 ` [PATCH 4/4] zone_reclaim_mode is always 0 by default KOSAKI Motohiro 3 siblings, 4 replies; 45+ messages in thread From: KOSAKI Motohiro @ 2009-05-13 3:07 UTC (permalink / raw) To: KOSAKI Motohiro Cc: LKML, linux-mm, Andrew Morton, Rik van Riel, Christoph Lameter Subject: [PATCH] vmscan: zone_reclaim use may_swap Documentation/sysctl/vm.txt says zone_reclaim_mode: Zone_reclaim_mode allows someone to set more or less aggressive approaches to reclaim memory when a zone runs out of memory. If it is set to zero then no zone reclaim occurs. Allocations will be satisfied from other zones / nodes in the system. This is value ORed together of 1 = Zone reclaim on 2 = Zone reclaim writes dirty pages out 4 = Zone reclaim swaps pages So, "(zone_reclaim_mode & RECLAIM_SWAP) == 0" mean we don't want to reclaim swap-backed pages. not mapped file. Thus, may_swap is better than may_unmap. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Rik van Riel <riel@redhat.com> --- mm/vmscan.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: b/mm/vmscan.c =================================================================== --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2387,8 +2387,8 @@ static int __zone_reclaim(struct zone *z int priority; struct scan_control sc = { .may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE), - .may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP), - .may_swap = 1, + .may_unmap = 1, + .may_swap = !!(zone_reclaim_mode & RECLAIM_SWAP), .swap_cluster_max = max_t(unsigned long, nr_pages, SWAP_CLUSTER_MAX), .gfp_mask = gfp_mask, -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 3/4] vmscan: zone_reclaim use may_swap 2009-05-13 3:07 ` [PATCH 3/4] vmscan: zone_reclaim use may_swap KOSAKI Motohiro @ 2009-05-13 11:26 ` Johannes Weiner 2009-05-13 14:43 ` Rik van Riel ` (2 subsequent siblings) 3 siblings, 0 replies; 45+ messages in thread From: Johannes Weiner @ 2009-05-13 11:26 UTC (permalink / raw) To: KOSAKI Motohiro Cc: LKML, linux-mm, Andrew Morton, Rik van Riel, Christoph Lameter On Wed, May 13, 2009 at 12:07:30PM +0900, KOSAKI Motohiro wrote: > Subject: [PATCH] vmscan: zone_reclaim use may_swap > > Documentation/sysctl/vm.txt says > > zone_reclaim_mode: > > Zone_reclaim_mode allows someone to set more or less aggressive approaches to > reclaim memory when a zone runs out of memory. If it is set to zero then no > zone reclaim occurs. Allocations will be satisfied from other zones / nodes > in the system. > > This is value ORed together of > > 1 = Zone reclaim on > 2 = Zone reclaim writes dirty pages out > 4 = Zone reclaim swaps pages > > > So, "(zone_reclaim_mode & RECLAIM_SWAP) == 0" mean we don't want to reclaim > swap-backed pages. not mapped file. > > Thus, may_swap is better than may_unmap. > > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> > Cc: Christoph Lameter <cl@linux-foundation.org> > Cc: Rik van Riel <riel@redhat.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 3/4] vmscan: zone_reclaim use may_swap 2009-05-13 3:07 ` [PATCH 3/4] vmscan: zone_reclaim use may_swap KOSAKI Motohiro 2009-05-13 11:26 ` Johannes Weiner @ 2009-05-13 14:43 ` Rik van Riel 2009-05-14 19:59 ` Christoph Lameter 2009-05-18 3:35 ` Wu Fengguang 3 siblings, 0 replies; 45+ messages in thread From: Rik van Riel @ 2009-05-13 14:43 UTC (permalink / raw) To: KOSAKI Motohiro; +Cc: LKML, linux-mm, Andrew Morton, Christoph Lameter KOSAKI Motohiro wrote: > Subject: [PATCH] vmscan: zone_reclaim use may_swap > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> > Cc: Christoph Lameter <cl@linux-foundation.org> > Cc: Rik van Riel <riel@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 3/4] vmscan: zone_reclaim use may_swap 2009-05-13 3:07 ` [PATCH 3/4] vmscan: zone_reclaim use may_swap KOSAKI Motohiro 2009-05-13 11:26 ` Johannes Weiner 2009-05-13 14:43 ` Rik van Riel @ 2009-05-14 19:59 ` Christoph Lameter 2009-05-18 3:35 ` Wu Fengguang 3 siblings, 0 replies; 45+ messages in thread From: Christoph Lameter @ 2009-05-14 19:59 UTC (permalink / raw) To: KOSAKI Motohiro; +Cc: LKML, linux-mm, Andrew Morton, Rik van Riel Acked-by: Christoph Lameter <cl@linux-foundation.org> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 3/4] vmscan: zone_reclaim use may_swap 2009-05-13 3:07 ` [PATCH 3/4] vmscan: zone_reclaim use may_swap KOSAKI Motohiro ` (2 preceding siblings ...) 2009-05-14 19:59 ` Christoph Lameter @ 2009-05-18 3:35 ` Wu Fengguang 3 siblings, 0 replies; 45+ messages in thread From: Wu Fengguang @ 2009-05-18 3:35 UTC (permalink / raw) To: KOSAKI Motohiro Cc: LKML, linux-mm, Andrew Morton, Rik van Riel, Christoph Lameter On Wed, May 13, 2009 at 12:07:30PM +0900, KOSAKI Motohiro wrote: > Subject: [PATCH] vmscan: zone_reclaim use may_swap > > Documentation/sysctl/vm.txt says > > zone_reclaim_mode: > > Zone_reclaim_mode allows someone to set more or less aggressive approaches to > reclaim memory when a zone runs out of memory. If it is set to zero then no > zone reclaim occurs. Allocations will be satisfied from other zones / nodes > in the system. > > This is value ORed together of > > 1 = Zone reclaim on > 2 = Zone reclaim writes dirty pages out > 4 = Zone reclaim swaps pages > > > So, "(zone_reclaim_mode & RECLAIM_SWAP) == 0" mean we don't want to reclaim > swap-backed pages. not mapped file. > > Thus, may_swap is better than may_unmap. > > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> > Cc: Christoph Lameter <cl@linux-foundation.org> > Cc: Rik van Riel <riel@redhat.com> > --- > mm/vmscan.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > Index: b/mm/vmscan.c > =================================================================== > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2387,8 +2387,8 @@ static int __zone_reclaim(struct zone *z > int priority; > struct scan_control sc = { > .may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE), > - .may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP), > - .may_swap = 1, > + .may_unmap = 1, > + .may_swap = !!(zone_reclaim_mode & RECLAIM_SWAP), > .swap_cluster_max = max_t(unsigned long, nr_pages, > SWAP_CLUSTER_MAX), > .gfp_mask = gfp_mask, > Acked-by: Wu Fengguang <fengguang.wu@intel.com> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH 4/4] zone_reclaim_mode is always 0 by default 2009-05-13 3:06 [PATCH 0/4] various zone_reclaim cleanup KOSAKI Motohiro ` (2 preceding siblings ...) 2009-05-13 3:07 ` [PATCH 3/4] vmscan: zone_reclaim use may_swap KOSAKI Motohiro @ 2009-05-13 3:08 ` KOSAKI Motohiro 2009-05-13 14:47 ` Rik van Riel ` (3 more replies) 3 siblings, 4 replies; 45+ messages in thread From: KOSAKI Motohiro @ 2009-05-13 3:08 UTC (permalink / raw) To: KOSAKI Motohiro Cc: LKML, linux-mm, Andrew Morton, Rik van Riel, Christoph Lameter Subject: [PATCH] zone_reclaim_mode is always 0 by default Current linux policy is, if the machine has large remote node distance, zone_reclaim_mode is enabled by default because we've be able to assume to large distance mean large server until recently. Unfrotunately, recent modern x86 CPU (e.g. Core i7, Opeteron) have P2P transport memory controller. IOW it's NUMA from software view. Some Core i7 machine has large remote node distance and zone_reclaim don't fit desktop and small file server. it cause performance degression. Thus, zone_reclaim == 0 is better by default. sorry, HPC gusy. you need to turn zone_reclaim_mode on manually now. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Rik van Riel <riel@redhat.com> --- mm/page_alloc.c | 7 ------- 1 file changed, 7 deletions(-) Index: b/mm/page_alloc.c =================================================================== --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2494,13 +2494,6 @@ static void build_zonelists(pg_data_t *p int distance = node_distance(local_node, node); /* - * If another node is sufficiently far away then it is better - * to reclaim pages in a zone before going off node. - */ - if (distance > RECLAIM_DISTANCE) - zone_reclaim_mode = 1; - - /* * We don't want to pressure a particular node. * So adding penalty to the first node in same * distance group to make it round-robin. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default 2009-05-13 3:08 ` [PATCH 4/4] zone_reclaim_mode is always 0 by default KOSAKI Motohiro @ 2009-05-13 14:47 ` Rik van Riel 2009-05-14 8:20 ` KOSAKI Motohiro 2009-05-13 15:22 ` Robin Holt ` (2 subsequent siblings) 3 siblings, 1 reply; 45+ messages in thread From: Rik van Riel @ 2009-05-13 14:47 UTC (permalink / raw) To: KOSAKI Motohiro; +Cc: LKML, linux-mm, Andrew Morton, Christoph Lameter KOSAKI Motohiro wrote: > Subject: [PATCH] zone_reclaim_mode is always 0 by default > > Current linux policy is, if the machine has large remote node distance, > zone_reclaim_mode is enabled by default because we've be able to assume to > large distance mean large server until recently. > > Unfrotunately, recent modern x86 CPU (e.g. Core i7, Opeteron) have P2P transport > memory controller. IOW it's NUMA from software view. > > Some Core i7 machine has large remote node distance and zone_reclaim don't > fit desktop and small file server. it cause performance degression. > > Thus, zone_reclaim == 0 is better by default. sorry, HPC gusy. > you need to turn zone_reclaim_mode on manually now. I'll believe that it causes a performance regression with the old zone_reclaim behaviour, however the way you tweaked zone_reclaim should make it behave a lot better, no? -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default 2009-05-13 14:47 ` Rik van Riel @ 2009-05-14 8:20 ` KOSAKI Motohiro 2009-05-14 11:48 ` Robin Holt 0 siblings, 1 reply; 45+ messages in thread From: KOSAKI Motohiro @ 2009-05-14 8:20 UTC (permalink / raw) To: Rik van Riel Cc: kosaki.motohiro, LKML, linux-mm, Andrew Morton, Christoph Lameter, Robin Holt (cc to Robin) > KOSAKI Motohiro wrote: > > Subject: [PATCH] zone_reclaim_mode is always 0 by default > > > > Current linux policy is, if the machine has large remote node distance, > > zone_reclaim_mode is enabled by default because we've be able to assume to > > large distance mean large server until recently. > > > > Unfrotunately, recent modern x86 CPU (e.g. Core i7, Opeteron) have P2P transport > > memory controller. IOW it's NUMA from software view. > > > > Some Core i7 machine has large remote node distance and zone_reclaim don't > > fit desktop and small file server. it cause performance degression. > > > > Thus, zone_reclaim == 0 is better by default. sorry, HPC gusy. > > you need to turn zone_reclaim_mode on manually now. > > I'll believe that it causes a performance regression with the > old zone_reclaim behaviour, however the way you tweaked > zone_reclaim should make it behave a lot better, no? Unfortunately no. zone reclaim has two weakness by design. 1. zone reclaim don't works well when workingset size > local node size. but it can happen easily on small machine. if it happen, zone reclaim drop own process's memory. Plus, zone reclaim also doesn't fit DB server. its process has large workingset. 2. zone reclaim have inter zone balancing issue. example: x86_64 2node 8G machine has following zone assignment zone 0 (DMA32): 3GB zone 0 (Normal): 1GB zone 1 (Normal): 4GB if the page is allocated from DMA32, you are lucky. DMA32 isn't reclaimed so freqently. but if from zone0 Normal, you are unlucky. it is very frequent reclaimed although it is small than other zone. I know my patch change large server default. but I believe linux default kernel parameter adapt to desktop and entry machine. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default 2009-05-14 8:20 ` KOSAKI Motohiro @ 2009-05-14 11:48 ` Robin Holt 2009-05-14 12:02 ` KOSAKI Motohiro 0 siblings, 1 reply; 45+ messages in thread From: Robin Holt @ 2009-05-14 11:48 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Rik van Riel, LKML, linux-mm, Andrew Morton, Christoph Lameter, Robin Holt > Unfortunately no. > zone reclaim has two weakness by design. > > 1. > zone reclaim don't works well when workingset size > local node size. > but it can happen easily on small machine. > if it happen, zone reclaim drop own process's memory. > > Plus, zone reclaim also doesn't fit DB server. its process has large > workingset. Large DB server is not your typical desktop application either. > 2. > zone reclaim have inter zone balancing issue. > > example: x86_64 2node 8G machine has following zone assignment > > zone 0 (DMA32): 3GB > zone 0 (Normal): 1GB > zone 1 (Normal): 4GB > > if the page is allocated from DMA32, you are lucky. DMA32 isn't reclaimed > so freqently. but if from zone0 Normal, you are unlucky. > it is very frequent reclaimed although it is small than other zone. I have seen that behavior on some of our mismatched large systems as well, although never had one so imbalanced because ia64 only has Normal. > I know my patch change large server default. but I believe linux > default kernel parameter adapt to desktop and entry machine. If this imbalance is an x86_64 only problem, then we could do something simple like the following untested patch. This leaves the default for everyone except x86_64. Robin ------------------------------------------------------------------------ Even if there is a great node distance on x86_64, disable zone reclaim by default. This was done to handle the imbalanced zone sizes where a majority of the memory in zone 0 is DMA32 with a small remaining Normal which will be aggressively reclaimed. For other architectures, we leave the default behavior. Signed-off-by: Robin Holt <holt@sgi.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Rik van Riel <riel@redhat.com> --- arch/x86/include/asm/topology.h | 2 ++ include/linux/topology.h | 5 +++++ mm/page_alloc.c | 2 +- 3 files changed, 8 insertions(+), 1 deletion(-) Index: page_reclaim_mode/arch/x86/include/asm/topology.h =================================================================== --- page_reclaim_mode.orig/arch/x86/include/asm/topology.h 2009-05-14 06:44:20.118925713 -0500 +++ page_reclaim_mode/arch/x86/include/asm/topology.h 2009-05-14 06:44:21.251067716 -0500 @@ -128,6 +128,8 @@ extern unsigned long node_remap_size[]; #endif +#define DEFAULT_ZONE_RECLAIM_MODE 0 + /* sched_domains SD_NODE_INIT for NUMA machines */ #define SD_NODE_INIT (struct sched_domain) { \ .min_interval = 8, \ Index: page_reclaim_mode/include/linux/topology.h =================================================================== --- page_reclaim_mode.orig/include/linux/topology.h 2009-05-14 06:44:20.070919619 -0500 +++ page_reclaim_mode/include/linux/topology.h 2009-05-14 06:44:21.279071382 -0500 @@ -61,6 +61,11 @@ int arch_update_cpu_topology(void); */ #define RECLAIM_DISTANCE 20 #endif + +#ifndef DEFAULT_ZONE_RECLAIM_MODE +#define DEFAULT_ZONE_RECLAIM_MODE 1 +#endif + #ifndef PENALTY_FOR_NODE_WITH_CPUS #define PENALTY_FOR_NODE_WITH_CPUS (1) #endif Index: page_reclaim_mode/mm/page_alloc.c =================================================================== --- page_reclaim_mode.orig/mm/page_alloc.c 2009-05-14 06:44:20.138928363 -0500 +++ page_reclaim_mode/mm/page_alloc.c 2009-05-14 06:44:21.311075244 -0500 @@ -2331,7 +2331,7 @@ static void build_zonelists(pg_data_t *p * to reclaim pages in a zone before going off node. */ if (distance > RECLAIM_DISTANCE) - zone_reclaim_mode = 1; + zone_reclaim_mode = DEFAULT_ZONE_RECLAIM_MODE; /* * We don't want to pressure a particular node. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default 2009-05-14 11:48 ` Robin Holt @ 2009-05-14 12:02 ` KOSAKI Motohiro 0 siblings, 0 replies; 45+ messages in thread From: KOSAKI Motohiro @ 2009-05-14 12:02 UTC (permalink / raw) To: Robin Holt Cc: kosaki.motohiro, Rik van Riel, LKML, linux-mm, Andrew Morton, Christoph Lameter > > Unfortunately no. > > zone reclaim has two weakness by design. > > > > 1. > > zone reclaim don't works well when workingset size > local node size. > > but it can happen easily on small machine. > > if it happen, zone reclaim drop own process's memory. > > > > Plus, zone reclaim also doesn't fit DB server. its process has large > > workingset. > > Large DB server is not your typical desktop application either. ack. > > 2. > > zone reclaim have inter zone balancing issue. > > > > example: x86_64 2node 8G machine has following zone assignment > > > > zone 0 (DMA32): 3GB > > zone 0 (Normal): 1GB > > zone 1 (Normal): 4GB > > > > if the page is allocated from DMA32, you are lucky. DMA32 isn't reclaimed > > so freqently. but if from zone0 Normal, you are unlucky. > > it is very frequent reclaimed although it is small than other zone. > > I have seen that behavior on some of our mismatched large systems as well, > although never had one so imbalanced because ia64 only has Normal. not true. some ia64 server has about 2GB DMA zone. SGI ia64 is special one. > > I know my patch change large server default. but I believe linux > > default kernel parameter adapt to desktop and entry machine. > > If this imbalance is an x86_64 only problem, then we could do something > simple like the following untested patch. This leaves the default > for everyone except x86_64. not x86_64 only. many 64bit architecture have 2 or 4GB DMA zone. even though, your patch seems interesting. at least it solve desktop user issue and we don't need to care another area user. embedded and high-end server user is typically skillfull. they can change kernel parameter by themself. > > Robin > > ------------------------------------------------------------------------ > > Even if there is a great node distance on x86_64, disable zone reclaim > by default. This was done to handle the imbalanced zone sizes where a > majority of the memory in zone 0 is DMA32 with a small remaining Normal > which will be aggressively reclaimed. > > For other architectures, we leave the default behavior. > > Signed-off-by: Robin Holt <holt@sgi.com> > Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> > Cc: Christoph Lameter <cl@linux-foundation.org> > Cc: Rik van Riel <riel@redhat.com> > > --- > arch/x86/include/asm/topology.h | 2 ++ > include/linux/topology.h | 5 +++++ > mm/page_alloc.c | 2 +- > 3 files changed, 8 insertions(+), 1 deletion(-) > Index: page_reclaim_mode/arch/x86/include/asm/topology.h > =================================================================== > --- page_reclaim_mode.orig/arch/x86/include/asm/topology.h 2009-05-14 06:44:20.118925713 -0500 > +++ page_reclaim_mode/arch/x86/include/asm/topology.h 2009-05-14 06:44:21.251067716 -0500 > @@ -128,6 +128,8 @@ extern unsigned long node_remap_size[]; > > #endif > > +#define DEFAULT_ZONE_RECLAIM_MODE 0 > + > /* sched_domains SD_NODE_INIT for NUMA machines */ > #define SD_NODE_INIT (struct sched_domain) { \ > .min_interval = 8, \ > Index: page_reclaim_mode/include/linux/topology.h > =================================================================== > --- page_reclaim_mode.orig/include/linux/topology.h 2009-05-14 06:44:20.070919619 -0500 > +++ page_reclaim_mode/include/linux/topology.h 2009-05-14 06:44:21.279071382 -0500 > @@ -61,6 +61,11 @@ int arch_update_cpu_topology(void); > */ > #define RECLAIM_DISTANCE 20 > #endif > + > +#ifndef DEFAULT_ZONE_RECLAIM_MODE > +#define DEFAULT_ZONE_RECLAIM_MODE 1 > +#endif > + > #ifndef PENALTY_FOR_NODE_WITH_CPUS > #define PENALTY_FOR_NODE_WITH_CPUS (1) > #endif > Index: page_reclaim_mode/mm/page_alloc.c > =================================================================== > --- page_reclaim_mode.orig/mm/page_alloc.c 2009-05-14 06:44:20.138928363 -0500 > +++ page_reclaim_mode/mm/page_alloc.c 2009-05-14 06:44:21.311075244 -0500 > @@ -2331,7 +2331,7 @@ static void build_zonelists(pg_data_t *p > * to reclaim pages in a zone before going off node. > */ > if (distance > RECLAIM_DISTANCE) > - zone_reclaim_mode = 1; > + zone_reclaim_mode = DEFAULT_ZONE_RECLAIM_MODE; > > /* > * We don't want to pressure a particular node. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default 2009-05-13 3:08 ` [PATCH 4/4] zone_reclaim_mode is always 0 by default KOSAKI Motohiro 2009-05-13 14:47 ` Rik van Riel @ 2009-05-13 15:22 ` Robin Holt 2009-05-14 20:05 ` Christoph Lameter 2009-05-18 3:49 ` Wu Fengguang 2009-05-18 9:09 ` Wu Fengguang 3 siblings, 1 reply; 45+ messages in thread From: Robin Holt @ 2009-05-13 15:22 UTC (permalink / raw) To: KOSAKI Motohiro Cc: LKML, linux-mm, Andrew Morton, Rik van Riel, Christoph Lameter On Wed, May 13, 2009 at 12:08:12PM +0900, KOSAKI Motohiro wrote: > Subject: [PATCH] zone_reclaim_mode is always 0 by default > > Current linux policy is, if the machine has large remote node distance, > zone_reclaim_mode is enabled by default because we've be able to assume to > large distance mean large server until recently. > > Unfrotunately, recent modern x86 CPU (e.g. Core i7, Opeteron) have P2P transport > memory controller. IOW it's NUMA from software view. > > Some Core i7 machine has large remote node distance and zone_reclaim don't > fit desktop and small file server. it cause performance degression. > > Thus, zone_reclaim == 0 is better by default. sorry, HPC gusy. > you need to turn zone_reclaim_mode on manually now. I am _VERY_ concerned about this change in behavior as it has been the default for a considerable period of time. I realize it is an easily changed setting, but it is churn in the default behavior. Are there any benefits for these small servers to have zone_reclaim turned on? If you have a large node distance, I would expect they should benefit _MORE_ than those with small or no node distances. Are you seeing an impact of the load not distributing pages evenly across processors instead of a reclaim effect (ie, a single threaded process faulting in more memory than is node local and expecting those pages to come from the other node first before doing reclaim)? Maybe there is a different issue than the ones I am used to thinking about and I am completely missing the point, please enlighten me. If this proceeds forward, I would like to propose we at least leave it on for SGI SN and UV hardware. I can provide a quick patch that may be a bit ugly because it will depend upon arch specific #defines. I have not investigated this, but any alternative suggestions are certainly welcome. Currently, I am envisioning bringing something like ia64_platform_is("sn2") and is_uv_system into page_alloc.c. > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> > Cc: Christoph Lameter <cl@linux-foundation.org> > Cc: Rik van Riel <riel@redhat.com> Please add me: Cc: Robin Holt <holt@sgi.com> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default 2009-05-13 15:22 ` Robin Holt @ 2009-05-14 20:05 ` Christoph Lameter 2009-05-14 20:23 ` Rik van Riel 2009-05-15 1:02 ` KOSAKI Motohiro 0 siblings, 2 replies; 45+ messages in thread From: Christoph Lameter @ 2009-05-14 20:05 UTC (permalink / raw) To: Robin Holt; +Cc: KOSAKI Motohiro, LKML, linux-mm, Andrew Morton, Rik van Riel Not having zone reclaim on a NUMA system often means that per node allocations will fall back. Optimized node local allocations become very difficult for the page allocator. If the latency penalties are not significant then this may not matter. The larger the system, the larger the NUMA latencies become. One possibility would be to disable zone reclaim for low node numbers. Eanble it only if more than 4 nodes exist? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default 2009-05-14 20:05 ` Christoph Lameter @ 2009-05-14 20:23 ` Rik van Riel 2009-05-14 20:31 ` Christoph Lameter 2009-05-15 1:02 ` KOSAKI Motohiro 1 sibling, 1 reply; 45+ messages in thread From: Rik van Riel @ 2009-05-14 20:23 UTC (permalink / raw) To: Christoph Lameter Cc: Robin Holt, KOSAKI Motohiro, LKML, linux-mm, Andrew Morton Christoph Lameter wrote: > Not having zone reclaim on a NUMA system often means that per node > allocations will fall back. Optimized node local allocations become very > difficult for the page allocator. If the latency penalties are not > significant then this may not matter. The larger the system, the larger > the NUMA latencies become. > > One possibility would be to disable zone reclaim for low node numbers. > Eanble it only if more than 4 nodes exist? I suspect that patches 1/4 through 3/4 will cause the system to behave better already, by only reclaiming the easiest to reclaim pages from zone reclaim and falling back after that - or am overlooking something? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default 2009-05-14 20:23 ` Rik van Riel @ 2009-05-14 20:31 ` Christoph Lameter 0 siblings, 0 replies; 45+ messages in thread From: Christoph Lameter @ 2009-05-14 20:31 UTC (permalink / raw) To: Rik van Riel; +Cc: Robin Holt, KOSAKI Motohiro, LKML, linux-mm, Andrew Morton On Thu, 14 May 2009, Rik van Riel wrote: > I suspect that patches 1/4 through 3/4 will cause the > system to behave better already, by only reclaiming > the easiest to reclaim pages from zone reclaim and > falling back after that - or am overlooking something? zone reclaims default config has always only reclaimed the easiest reclaimable pages. Manual configuration is necessary to reclaim other pages. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default 2009-05-14 20:05 ` Christoph Lameter 2009-05-14 20:23 ` Rik van Riel @ 2009-05-15 1:02 ` KOSAKI Motohiro 2009-05-15 10:51 ` Robin Holt 2009-05-15 18:01 ` Christoph Lameter 1 sibling, 2 replies; 45+ messages in thread From: KOSAKI Motohiro @ 2009-05-15 1:02 UTC (permalink / raw) To: Christoph Lameter Cc: kosaki.motohiro, Robin Holt, LKML, linux-mm, Andrew Morton, Rik van Riel > Not having zone reclaim on a NUMA system often means that per node > allocations will fall back. Optimized node local allocations become very > difficult for the page allocator. If the latency penalties are not > significant then this may not matter. The larger the system, the larger > the NUMA latencies become. > > One possibility would be to disable zone reclaim for low node numbers. > Eanble it only if more than 4 nodes exist? I think this idea works good every machine and doesn't cause confusion to HPC user. How about this? ============================== Subject: [PATCH] zone_reclaim is always 0 by default on small machine Current linux policy is, zone_reclaim_mode is enabled by default if the machine has large remote node distance. it's because we could assume that large distance mean large server until recently. Unfortunately, recent modern x86 CPU (e.g. Core i7, Opeteron) have P2P transport memory controller. IOW it's seen as NUMA from software view. Some Core i7 machine has large remote node distance, but zone_reclaim don't fit desktop and small file server. it cause performance degression. Thus, zone_reclaim == 0 is better by default if the machine is small. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Rik van Riel <riel@redhat.com> Cc: Robin Holt <holt@sgi.com> --- mm/page_alloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: b/mm/page_alloc.c =================================================================== --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2497,7 +2497,7 @@ static void build_zonelists(pg_data_t *p * If another node is sufficiently far away then it is better * to reclaim pages in a zone before going off node. */ - if (distance > RECLAIM_DISTANCE) + if (nr_online_nodes >= 4 && distance > RECLAIM_DISTANCE) zone_reclaim_mode = 1; /* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default 2009-05-15 1:02 ` KOSAKI Motohiro @ 2009-05-15 10:51 ` Robin Holt 2009-05-19 2:53 ` KOSAKI Motohiro 2009-05-15 18:01 ` Christoph Lameter 1 sibling, 1 reply; 45+ messages in thread From: Robin Holt @ 2009-05-15 10:51 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Christoph Lameter, Robin Holt, LKML, linux-mm, Andrew Morton, Rik van Riel > Current linux policy is, zone_reclaim_mode is enabled by default if the machine > has large remote node distance. it's because we could assume that large distance > mean large server until recently. > > Unfortunately, recent modern x86 CPU (e.g. Core i7, Opeteron) have P2P transport > memory controller. IOW it's seen as NUMA from software view. > > Some Core i7 machine has large remote node distance, but zone_reclaim don't > fit desktop and small file server. it cause performance degression. > > Thus, zone_reclaim == 0 is better by default if the machine is small. What if I had a node 0 with 32GB or 128GB of memory. In that case, we would have 3GB for DMA32, 125GB for Normal and then a node 1 with 128GB. I would suggest that zone reclaim would perform normally and be beneficial. You are unfairly classifying this as a size of machine problem when it is really a problem with the underlying zone reclaim code being triggered due to imbalanced node/zones, part of which is due to a single node having multiple zones and those multiple zones setting up the conditions for extremely agressive reclaim. In other words, you are putting a bandage in place to hide a problem on your particular hardware. Can RECLAIM_DISTANCE be adjusted so your Ci7 boxes are no longer caught? Aren't 4 node Ci7 boxes soon to be readily available? How are your apps different from my apps in that you are not impacted by node locality? Are you being too insensitive to node locality? Conversely am I being too sensitive? All that said, I would not stop this from going in. I just think the selection criteria is rather random. I think we know the condition we are trying to avoid which is a small Normal zone on one node and a larger Normal zone on another causing zone reclaim to be overly agressive. I don't know how to quantify "small" versus "large". I would suggest that a node 0 with 16 or more GB should have zone reclaim on by default as well. Can that be expressed in the selection criteria. Thanks, Robin Holt -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default 2009-05-15 10:51 ` Robin Holt @ 2009-05-19 2:53 ` KOSAKI Motohiro 2009-05-20 14:00 ` Robin Holt 0 siblings, 1 reply; 45+ messages in thread From: KOSAKI Motohiro @ 2009-05-19 2:53 UTC (permalink / raw) To: Robin Holt Cc: kosaki.motohiro, Christoph Lameter, LKML, linux-mm, Andrew Morton, Rik van Riel Hi > > Current linux policy is, zone_reclaim_mode is enabled by default if the machine > > has large remote node distance. it's because we could assume that large distance > > mean large server until recently. > > > > Unfortunately, recent modern x86 CPU (e.g. Core i7, Opeteron) have P2P transport > > memory controller. IOW it's seen as NUMA from software view. > > > > Some Core i7 machine has large remote node distance, but zone_reclaim don't > > fit desktop and small file server. it cause performance degression. > > > > Thus, zone_reclaim == 0 is better by default if the machine is small. > > What if I had a node 0 with 32GB or 128GB of memory. In that case, > we would have 3GB for DMA32, 125GB for Normal and then a node 1 with > 128GB. I would suggest that zone reclaim would perform normally and > be beneficial. > > You are unfairly classifying this as a size of machine problem when it is > really a problem with the underlying zone reclaim code being triggered > due to imbalanced node/zones, part of which is due to a single node > having multiple zones and those multiple zones setting up the conditions > for extremely agressive reclaim. In other words, you are putting a > bandage in place to hide a problem on your particular hardware. > > Can RECLAIM_DISTANCE be adjusted so your Ci7 boxes are no longer caught? > Aren't 4 node Ci7 boxes soon to be readily available? How are your apps > different from my apps in that you are not impacted by node locality? > Are you being too insensitive to node locality? Conversely am I being > too sensitive? > > All that said, I would not stop this from going in. I just think the > selection criteria is rather random. I think we know the condition we > are trying to avoid which is a small Normal zone on one node and a larger > Normal zone on another causing zone reclaim to be overly agressive. > I don't know how to quantify "small" versus "large". I would suggest > that a node 0 with 16 or more GB should have zone reclaim on by default > as well. Can that be expressed in the selection criteria. I post my opinion as another mail. please see it. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default 2009-05-19 2:53 ` KOSAKI Motohiro @ 2009-05-20 14:00 ` Robin Holt 2009-05-21 2:44 ` KOSAKI Motohiro 0 siblings, 1 reply; 45+ messages in thread From: Robin Holt @ 2009-05-20 14:00 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Robin Holt, Christoph Lameter, LKML, linux-mm, Andrew Morton, Rik van Riel On Tue, May 19, 2009 at 11:53:44AM +0900, KOSAKI Motohiro wrote: > Hi > > > > Current linux policy is, zone_reclaim_mode is enabled by default if the machine > > > has large remote node distance. it's because we could assume that large distance > > > mean large server until recently. > > > > > > Unfortunately, recent modern x86 CPU (e.g. Core i7, Opeteron) have P2P transport > > > memory controller. IOW it's seen as NUMA from software view. > > > > > > Some Core i7 machine has large remote node distance, but zone_reclaim don't > > > fit desktop and small file server. it cause performance degression. > > > > > > Thus, zone_reclaim == 0 is better by default if the machine is small. > > > > What if I had a node 0 with 32GB or 128GB of memory. In that case, > > we would have 3GB for DMA32, 125GB for Normal and then a node 1 with > > 128GB. I would suggest that zone reclaim would perform normally and > > be beneficial. > > > > You are unfairly classifying this as a size of machine problem when it is > > really a problem with the underlying zone reclaim code being triggered > > due to imbalanced node/zones, part of which is due to a single node > > having multiple zones and those multiple zones setting up the conditions > > for extremely agressive reclaim. In other words, you are putting a > > bandage in place to hide a problem on your particular hardware. > > > > Can RECLAIM_DISTANCE be adjusted so your Ci7 boxes are no longer caught? > > Aren't 4 node Ci7 boxes soon to be readily available? How are your apps > > different from my apps in that you are not impacted by node locality? > > Are you being too insensitive to node locality? Conversely am I being > > too sensitive? > > > > All that said, I would not stop this from going in. I just think the > > selection criteria is rather random. I think we know the condition we > > are trying to avoid which is a small Normal zone on one node and a larger > > Normal zone on another causing zone reclaim to be overly agressive. > > I don't know how to quantify "small" versus "large". I would suggest > > that a node 0 with 16 or more GB should have zone reclaim on by default > > as well. Can that be expressed in the selection criteria. > > I post my opinion as another mail. please see it. I don't think you addressed my actual question. How much of this is a result of having a node where 1/4 of the memory is in the 'Normal' zone and 3/4 is in the DMA32 zone? How much is due to the imbalance between Node 0 'Normal' and Node 1 'Normal'? Shouldn't that type of sanity check be used for turning on zone reclaim instead of some random number of nodes. Even with 128 nodes and 256 cpus, I _NEVER_ see the system swapping out before allocating off node so I can certainly not reproduce the situation you are seeing. The imbalance I have seen was when I had two small memory nodes and two large memory nodes and then oversubscribed memory. In that situation, I noticed that the apps on the small memory nodes were more frequently impacted. This unfairness made sense to me and seemed perfectly reasonable. Thanks, Robin -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default 2009-05-20 14:00 ` Robin Holt @ 2009-05-21 2:44 ` KOSAKI Motohiro 2009-05-21 13:31 ` Christoph Lameter 0 siblings, 1 reply; 45+ messages in thread From: KOSAKI Motohiro @ 2009-05-21 2:44 UTC (permalink / raw) To: Robin Holt Cc: kosaki.motohiro, Christoph Lameter, LKML, linux-mm, Andrew Morton, Rik van Riel > On Tue, May 19, 2009 at 11:53:44AM +0900, KOSAKI Motohiro wrote: > > Hi > > > > > > Current linux policy is, zone_reclaim_mode is enabled by default if the machine > > > > has large remote node distance. it's because we could assume that large distance > > > > mean large server until recently. > > > > > > > > Unfortunately, recent modern x86 CPU (e.g. Core i7, Opeteron) have P2P transport > > > > memory controller. IOW it's seen as NUMA from software view. > > > > > > > > Some Core i7 machine has large remote node distance, but zone_reclaim don't > > > > fit desktop and small file server. it cause performance degression. > > > > > > > > Thus, zone_reclaim == 0 is better by default if the machine is small. > > > > > > What if I had a node 0 with 32GB or 128GB of memory. In that case, > > > we would have 3GB for DMA32, 125GB for Normal and then a node 1 with > > > 128GB. I would suggest that zone reclaim would perform normally and > > > be beneficial. > > > > > > You are unfairly classifying this as a size of machine problem when it is > > > really a problem with the underlying zone reclaim code being triggered > > > due to imbalanced node/zones, part of which is due to a single node > > > having multiple zones and those multiple zones setting up the conditions > > > for extremely agressive reclaim. In other words, you are putting a > > > bandage in place to hide a problem on your particular hardware. > > > > > > Can RECLAIM_DISTANCE be adjusted so your Ci7 boxes are no longer caught? > > > Aren't 4 node Ci7 boxes soon to be readily available? How are your apps > > > different from my apps in that you are not impacted by node locality? > > > Are you being too insensitive to node locality? Conversely am I being > > > too sensitive? > > > > > > All that said, I would not stop this from going in. I just think the > > > selection criteria is rather random. I think we know the condition we > > > are trying to avoid which is a small Normal zone on one node and a larger > > > Normal zone on another causing zone reclaim to be overly agressive. > > > I don't know how to quantify "small" versus "large". I would suggest > > > that a node 0 with 16 or more GB should have zone reclaim on by default > > > as well. Can that be expressed in the selection criteria. > > > > I post my opinion as another mail. please see it. > > I don't think you addressed my actual question. How much of this is > a result of having a node where 1/4 of the memory is in the 'Normal' > zone and 3/4 is in the DMA32 zone? How much is due to the imbalance > between Node 0 'Normal' and Node 1 'Normal'? Shouldn't that type of > sanity check be used for turning on zone reclaim instead of some random > number of nodes. I can't catch up your message. Can you post your patch? Can you explain your sanity check? Now, I decide to remove "nr_online_nodes >= 4" condition. Apache regression is really non-sense. > Even with 128 nodes and 256 cpus, I _NEVER_ see the > system swapping out before allocating off node so I can certainly not > reproduce the situation you are seeing. hmhm. but I don't think we can assume hpc workload. > > The imbalance I have seen was when I had two small memory nodes and two > large memory nodes and then oversubscribed memory. In that situation, > I noticed that the apps on the small memory nodes were more frequently > impacted. This unfairness made sense to me and seemed perfectly > reasonable. The node imbalancing is ok. example, typical linux init script makes many deamon process to node0, we can't avoid it and it don't make strange behavior. but zone imbalancing is bad. I don't want discuss all item again. but you can google about inter zone reclaim issue instead. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default 2009-05-21 2:44 ` KOSAKI Motohiro @ 2009-05-21 13:31 ` Christoph Lameter 2009-05-21 13:57 ` Robin Holt 2009-05-24 13:44 ` KOSAKI Motohiro 0 siblings, 2 replies; 45+ messages in thread From: Christoph Lameter @ 2009-05-21 13:31 UTC (permalink / raw) To: KOSAKI Motohiro; +Cc: Robin Holt, LKML, linux-mm, Andrew Morton, Rik van Riel On Thu, 21 May 2009, KOSAKI Motohiro wrote: > I can't catch up your message. Can you post your patch? > Can you explain your sanity check? > > Now, I decide to remove "nr_online_nodes >= 4" condition. > Apache regression is really non-sense. Not sure what that means? Apache regresses with zone reclaim? My measurements when we introduced zone reclaim showed just the opposite because Apache would get node local memory and thus run faster. You can screw this up of course if you load the system so high that the apache processes are tossed around by the scheduler. Then the node local allocation may be worse than round robin because all the pages allocated by a process are now on one node if the scheduler moves the process to a remote node then all accesses are penalized. > > Even with 128 nodes and 256 cpus, I _NEVER_ see the > > system swapping out before allocating off node so I can certainly not > > reproduce the situation you are seeing. > > hmhm. but I don't think we can assume hpc workload. System swapping due to zone reclaim? zone reclaim only reclaims unmapped pages so it will not swap. Maybe some bug crept in in the recent changes? Or you overrode the defaults for zone reclaim? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default 2009-05-21 13:31 ` Christoph Lameter @ 2009-05-21 13:57 ` Robin Holt 2009-05-24 13:44 ` KOSAKI Motohiro 1 sibling, 0 replies; 45+ messages in thread From: Robin Holt @ 2009-05-21 13:57 UTC (permalink / raw) To: Christoph Lameter Cc: KOSAKI Motohiro, Robin Holt, LKML, linux-mm, Andrew Morton, Rik van Riel On Thu, May 21, 2009 at 09:31:08AM -0400, Christoph Lameter wrote: > On Thu, 21 May 2009, KOSAKI Motohiro wrote: > > > I can't catch up your message. Can you post your patch? > > Can you explain your sanity check? > > > > Now, I decide to remove "nr_online_nodes >= 4" condition. > > Apache regression is really non-sense. > > Not sure what that means? Apache regresses with zone reclaim? My > measurements when we introduced zone reclaim showed just the opposite > because Apache would get node local memory and thus run faster. You can > screw this up of course if you load the system so high that the apache > processes are tossed around by the scheduler. Then the node local > allocation may be worse than round robin because all the pages allocated > by a process are now on one node if the scheduler moves the > process to a remote node then all accesses are penalized. I think the point Kosaki is trying to make is that reclaim happens really aggressively for processes on node 0 versus node 1. Maybe I am clinging too strongly to one of the earlier posts, but that is what I read between the lines. That frequent reclaim is impacting allocations when he would rather they skip the reclaim and go off node. Again, it sounds like he prefers tuning the default to what works best for him. I don't too strongly disagree, as long as the default isn't being changed capriciously. I have always expected that NUMA boxes had reasons for preferring node locality. Maybe I misunderstand. Maybe Ci7 is special and does not have any impact for off socket references. I would be surprised by that after reading to literature, but I have not tested latency or bandwidth on one so I can not say. Personally, it sounds like if I had a box configured as his is, I would use a cpuset to restrict most memory hungry things from using cpus on node 0 and leave that as the small 'junk processes' cpu. Maybe even restrict things like cron etc to that corner of the system. Thanks, Robin -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default 2009-05-21 13:31 ` Christoph Lameter 2009-05-21 13:57 ` Robin Holt @ 2009-05-24 13:44 ` KOSAKI Motohiro 1 sibling, 0 replies; 45+ messages in thread From: KOSAKI Motohiro @ 2009-05-24 13:44 UTC (permalink / raw) To: Christoph Lameter Cc: kosaki.motohiro, Robin Holt, LKML, linux-mm, Andrew Morton, Rik van Riel sorry I missed this mail > > > Even with 128 nodes and 256 cpus, I _NEVER_ see the > > > system swapping out before allocating off node so I can certainly not > > > reproduce the situation you are seeing. > > > > hmhm. but I don't think we can assume hpc workload. > > System swapping due to zone reclaim? zone reclaim only reclaims unmapped > pages so it will not swap. Maybe some bug crept in in the recent changes? > Or you overrode the defaults for zone reclaim? I guess he use zone_reclaim_mode=7 or similar. However, I have to explain recent zone reclaim change. current zone reclaim is 1. zone reclaim can make high order reclaim (by hanns) 2. determine file-backed page by get_scan_ratio it mean, high order allocation makes lumpy zone reclaim. and shrink_inactive_list() don't care may_swap. then, zone_reclaim_mode=1 can makes swap-out if your driver makes high order allocation request. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default 2009-05-15 1:02 ` KOSAKI Motohiro 2009-05-15 10:51 ` Robin Holt @ 2009-05-15 18:01 ` Christoph Lameter 1 sibling, 0 replies; 45+ messages in thread From: Christoph Lameter @ 2009-05-15 18:01 UTC (permalink / raw) To: KOSAKI Motohiro; +Cc: Robin Holt, LKML, linux-mm, Andrew Morton, Rik van Riel On Fri, 15 May 2009, KOSAKI Motohiro wrote: > How about this? Rewiewed-by: Christoph Lameter <cl@linux-foundation.org> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default 2009-05-13 3:08 ` [PATCH 4/4] zone_reclaim_mode is always 0 by default KOSAKI Motohiro 2009-05-13 14:47 ` Rik van Riel 2009-05-13 15:22 ` Robin Holt @ 2009-05-18 3:49 ` Wu Fengguang 2009-05-19 1:16 ` Zhang, Yanmin 2009-05-19 2:53 ` KOSAKI Motohiro 2009-05-18 9:09 ` Wu Fengguang 3 siblings, 2 replies; 45+ messages in thread From: Wu Fengguang @ 2009-05-18 3:49 UTC (permalink / raw) To: KOSAKI Motohiro Cc: LKML, linux-mm, Andrew Morton, Rik van Riel, Christoph Lameter, Zhang, Yanmin On Wed, May 13, 2009 at 12:08:12PM +0900, KOSAKI Motohiro wrote: > Subject: [PATCH] zone_reclaim_mode is always 0 by default > > Current linux policy is, if the machine has large remote node distance, > zone_reclaim_mode is enabled by default because we've be able to assume to > large distance mean large server until recently. > > Unfrotunately, recent modern x86 CPU (e.g. Core i7, Opeteron) have P2P transport > memory controller. IOW it's NUMA from software view. > > Some Core i7 machine has large remote node distance and zone_reclaim don't > fit desktop and small file server. it cause performance degression. I can confirm this, Yanmin recently ran into exactly such a regression, which was fixed by manually disabling the zone reclaim mode. So I guess you can safely add an Tested-by: "Zhang, Yanmin" <yanmin.zhang@intel.com> > Thus, zone_reclaim == 0 is better by default. sorry, HPC gusy. > you need to turn zone_reclaim_mode on manually now. I guess the borderline will continue to blur up. It will be more dependent on workloads instead of physical NUMA capabilities. So Acked-by: Wu Fengguang <fengguang.wu@intel.com> > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> > Cc: Christoph Lameter <cl@linux-foundation.org> > Cc: Rik van Riel <riel@redhat.com> > --- > mm/page_alloc.c | 7 ------- > 1 file changed, 7 deletions(-) > > Index: b/mm/page_alloc.c > =================================================================== > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2494,13 +2494,6 @@ static void build_zonelists(pg_data_t *p > int distance = node_distance(local_node, node); > > /* > - * If another node is sufficiently far away then it is better > - * to reclaim pages in a zone before going off node. > - */ > - if (distance > RECLAIM_DISTANCE) > - zone_reclaim_mode = 1; > - > - /* > * We don't want to pressure a particular node. > * So adding penalty to the first node in same > * distance group to make it round-robin. > > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 45+ messages in thread
* RE: [PATCH 4/4] zone_reclaim_mode is always 0 by default 2009-05-18 3:49 ` Wu Fengguang @ 2009-05-19 1:16 ` Zhang, Yanmin 2009-05-19 2:53 ` KOSAKI Motohiro 1 sibling, 0 replies; 45+ messages in thread From: Zhang, Yanmin @ 2009-05-19 1:16 UTC (permalink / raw) To: Wu, Fengguang, KOSAKI Motohiro Cc: LKML, linux-mm, Andrew Morton, Rik van Riel, Christoph Lameter [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset="gb2312", Size: 3015 bytes --] >>-----Original Message----- >>From: Wu, Fengguang >>Sent: 2009Äê5ÔÂ18ÈÕ 11:49 >>To: KOSAKI Motohiro >>Cc: LKML; linux-mm; Andrew Morton; Rik van Riel; Christoph Lameter; Zhang, >>Yanmin >>Subject: Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default >> >>On Wed, May 13, 2009 at 12:08:12PM +0900, KOSAKI Motohiro wrote: >>> Subject: [PATCH] zone_reclaim_mode is always 0 by default >>> >>> Current linux policy is, if the machine has large remote node distance, >>> zone_reclaim_mode is enabled by default because we've be able to assume to >>> large distance mean large server until recently. >>> >>> Unfrotunately, recent modern x86 CPU (e.g. Core i7, Opeteron) have P2P >>transport >>> memory controller. IOW it's NUMA from software view. >>> >>> Some Core i7 machine has large remote node distance and zone_reclaim don't >>> fit desktop and small file server. it cause performance degression. >> >>I can confirm this, Yanmin recently ran into exactly such a >>regression, which was fixed by manually disabling the zone reclaim >>mode. So I guess you can safely add an [YM] Fengguang told the truth. One Nehalem machine has 12GB memory, but there is always 2GB free although applications accesses lots of files. Eventually we located the root cause as zone_reclaim_mode=1. Acked. >> >>Tested-by: "Zhang, Yanmin" <yanmin.zhang@intel.com> >> >>> Thus, zone_reclaim == 0 is better by default. sorry, HPC gusy. >>> you need to turn zone_reclaim_mode on manually now. >> >>I guess the borderline will continue to blur up. It will be more >>dependent on workloads instead of physical NUMA capabilities. So >> >>Acked-by: Wu Fengguang <fengguang.wu@intel.com> >> >>> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> >>> Cc: Christoph Lameter <cl@linux-foundation.org> >>> Cc: Rik van Riel <riel@redhat.com> >>> --- >>> mm/page_alloc.c | 7 ------- >>> 1 file changed, 7 deletions(-) >>> >>> Index: b/mm/page_alloc.c >>> =================================================================== >>> --- a/mm/page_alloc.c >>> +++ b/mm/page_alloc.c >>> @@ -2494,13 +2494,6 @@ static void build_zonelists(pg_data_t *p >>> int distance = node_distance(local_node, node); >>> >>> /* >>> - * If another node is sufficiently far away then it is better >>> - * to reclaim pages in a zone before going off node. >>> - */ >>> - if (distance > RECLAIM_DISTANCE) >>> - zone_reclaim_mode = 1; >>> - >>> - /* >>> * We don't want to pressure a particular node. >>> * So adding penalty to the first node in same >>> * distance group to make it round-robin. >>> >>> >>> -- >>> To unsubscribe, send a message with 'unsubscribe linux-mm' in >>> the body to majordomo@kvack.org. For more info on Linux MM, >>> see: http://www.linux-mm.org/ . >>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> N§²æìr¸zǧu©²Æ {\béì¹»\x1c®&Þ)îÆi¢Ø^nr¶Ý¢j$½§$¢¸\x05¢¹¨è§~'.)îÄÃ,yèm¶ÿÃ\f%{±j+ðèצj)Z· ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default 2009-05-18 3:49 ` Wu Fengguang 2009-05-19 1:16 ` Zhang, Yanmin @ 2009-05-19 2:53 ` KOSAKI Motohiro 2009-05-19 2:57 ` KOSAKI Motohiro 2009-05-19 3:38 ` Zhang, Yanmin 1 sibling, 2 replies; 45+ messages in thread From: KOSAKI Motohiro @ 2009-05-19 2:53 UTC (permalink / raw) To: Wu Fengguang Cc: kosaki.motohiro, LKML, linux-mm, Andrew Morton, Rik van Riel, Christoph Lameter, Zhang, Yanmin > On Wed, May 13, 2009 at 12:08:12PM +0900, KOSAKI Motohiro wrote: > > Subject: [PATCH] zone_reclaim_mode is always 0 by default > > > > Current linux policy is, if the machine has large remote node distance, > > zone_reclaim_mode is enabled by default because we've be able to assume to > > large distance mean large server until recently. > > > > Unfrotunately, recent modern x86 CPU (e.g. Core i7, Opeteron) have P2P transport > > memory controller. IOW it's NUMA from software view. > > > > Some Core i7 machine has large remote node distance and zone_reclaim don't > > fit desktop and small file server. it cause performance degression. > > I can confirm this, Yanmin recently ran into exactly such a > regression, which was fixed by manually disabling the zone reclaim > mode. So I guess you can safely add an > > Tested-by: "Zhang, Yanmin" <yanmin.zhang@intel.com> > > > Thus, zone_reclaim == 0 is better by default. sorry, HPC gusy. > > you need to turn zone_reclaim_mode on manually now. > > I guess the borderline will continue to blur up. It will be more > dependent on workloads instead of physical NUMA capabilities. So > > Acked-by: Wu Fengguang <fengguang.wu@intel.com> ok, I would explain zone reclaim design and performance tendency. Firstly, we can make classification of linux eco system, roughly. - HPC - high-end server - volume server - desktop - embedded it is separated by typical workload mainly. Secondly, zone_reclaim mean "I strongly dislike remote node access than disk access". it is very fitting on HPC workload. it because - HPC workload typically make the number of the same as cpus of processess (or thread). IOW, the workload typically use memory equally each node. - HPC workload is typically CPU bounded job. CPU migration is rare. - HPC workload is typically long lived. (possible >1 year) IOW, remote node allocation makes _very_ _very_ much remote node access. but zone_reclaim don't fit typical server workload. - server workload often make thread pool and some thread is sleeping until a request receved. IOW, when thread waking-up, the thread might move another cpu. node distance tendency don't make sense on weak cpu locality workload. Plus, disk-cache is the file-server's identity. we shouldn't think it's not important. Plus, DB software can consume almost system memory and (In general) RDB data makes harder to split equally as hpc. desktop workload is special. desktop peopole can run various workload beyond our assumption. So, we shouldn't have any workload assumption to desktop people. However, AFAIK almost desktop software use memory as UMA. we don't need to care embedded. it is typically UMA. IOW, the benefit of zone reclaim depend on "strong cpu locality" and "workload is cpu bounded" and "thead is long lived". but many workload don't fill above requirement. IOW, zone reclaim is workload depended feature (as Wu said). In general, the feature of workload depended don't fit default option. we can't know end-user run what workload anyway. Fortunately (or Unfortunately), typical workload and machine size had significant mutuality. Thus, the current default setting calculation had worked well in past days. Now, it was breaked. What should we do? Yanmin, We know 99% linux people use intel cpu and you are one of most hard repeated testing guy in lkml and you have much test. May I ask your tested machine and benchmark? if zone_reclaim=0 tendency workload is much than zone_reclaim=1 tendency workload, we can drop our afraid and we would prioritize your opinion, of cource. thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default 2009-05-19 2:53 ` KOSAKI Motohiro @ 2009-05-19 2:57 ` KOSAKI Motohiro 2009-05-19 3:38 ` Zhang, Yanmin 1 sibling, 0 replies; 45+ messages in thread From: KOSAKI Motohiro @ 2009-05-19 2:57 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Wu Fengguang, LKML, linux-mm, Andrew Morton, Rik van Riel, Christoph Lameter, Zhang, Yanmin nit fix. > In general, the feature of workload depended don't fit default option. > we can't know end-user run what workload anyway. > > Fortunately (or Unfortunately), typical workload and machine size had typical workload and machine size and remote node distance > significant mutuality. > Thus, the current default setting calculation had worked well in past days. > > Now, it was breaked. What should we do? > > > > Yanmin, We know 99% linux people use intel cpu and you are one of > most hard repeated testing guy in lkml and you have much test. > May I ask your tested machine and benchmark? > > if zone_reclaim=0 tendency workload is much than zone_reclaim=1 tendency workload, > we can drop our afraid and we would prioritize your opinion, of cource. > > thanks. > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 45+ messages in thread
* RE: [PATCH 4/4] zone_reclaim_mode is always 0 by default 2009-05-19 2:53 ` KOSAKI Motohiro 2009-05-19 2:57 ` KOSAKI Motohiro @ 2009-05-19 3:38 ` Zhang, Yanmin 2009-05-19 4:30 ` KOSAKI Motohiro 1 sibling, 1 reply; 45+ messages in thread From: Zhang, Yanmin @ 2009-05-19 3:38 UTC (permalink / raw) To: KOSAKI Motohiro, Wu, Fengguang Cc: LKML, linux-mm, Andrew Morton, Rik van Riel, Christoph Lameter [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset="gb2312", Size: 4269 bytes --] >>-----Original Message----- >>From: KOSAKI Motohiro [mailto:kosaki.motohiro@jp.fujitsu.com] >>Sent: 2009Äê5ÔÂ19ÈÕ 10:54 >>To: Wu, Fengguang >>Cc: kosaki.motohiro@jp.fujitsu.com; LKML; linux-mm; Andrew Morton; Rik van >>Riel; Christoph Lameter; Zhang, Yanmin >>Subject: Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default >> >>> On Wed, May 13, 2009 at 12:08:12PM +0900, KOSAKI Motohiro wrote: >>> > Subject: [PATCH] zone_reclaim_mode is always 0 by default >>> > >>> > Current linux policy is, if the machine has large remote node distance, >>> > zone_reclaim_mode is enabled by default because we've be able to assume >> >>ok, I would explain zone reclaim design and performance tendency. >> >>Firstly, we can make classification of linux eco system, roughly. >> - HPC >> - high-end server >> - volume server >> - desktop >> - embedded >> >>it is separated by typical workload mainly. >> >>Secondly, zone_reclaim mean "I strongly dislike remote node access than >>disk access". >>it is very fitting on HPC workload. it because >> - HPC workload typically make the number of the same as cpus of processess >>(or thread). >> IOW, the workload typically use memory equally each node. >> - HPC workload is typically CPU bounded job. CPU migration is rare. >> - HPC workload is typically long lived. (possible >1 year) >> IOW, remote node allocation makes _very_ _very_ much remote node access. >> >>but zone_reclaim don't fit typical server workload. >> - server workload often make thread pool and some thread is sleeping until >> a request receved. >> IOW, when thread waking-up, the thread might move another cpu. >> node distance tendency don't make sense on weak cpu locality workload. >> >>Plus, disk-cache is the file-server's identity. we shouldn't think it's not >>important. >>Plus, DB software can consume almost system memory and (In general) RDB data >>makes >>harder to split equally as hpc. >> >>desktop workload is special. desktop peopole can run various workload beyond >>our assumption. So, we shouldn't have any workload assumption to desktop >>people. >>However, AFAIK almost desktop software use memory as UMA. >> >>we don't need to care embedded. it is typically UMA. >> >> >>IOW, the benefit of zone reclaim depend on "strong cpu locality" and >>"workload is cpu bounded" and "thead is long lived". >>but many workload don't fill above requirement. IOW, zone reclaim is >>workload depended feature (as Wu said). >> >> >>In general, the feature of workload depended don't fit default option. >>we can't know end-user run what workload anyway. >> >>Fortunately (or Unfortunately), typical workload and machine size had >>significant mutuality. >>Thus, the current default setting calculation had worked well in past days. [YM] Your analysis is clear and deep. >> >>Now, it was breaked. What should we do? >>Yanmin, We know 99% linux people use intel cpu and you are one of >>most hard repeated testing [YM] It's very easy to reproduce them on my machines. :) Sometimes, because the issues only exist on machines with lots of cpu while other community developers have no such environments. guy in lkml and you have much test. >>May I ask your tested machine and benchmark? [YM] Usually I started lots of benchmark testing against the latest kernel, but as for this issue, it's reported by a customer firstly. The customer runs apache on Nehalem machines to access lots of files. So the issue is an example of file server. BTW, I found many test cases of fio have big drop after I upgraded BIOS of one Nehalem machine. By checking vmstat data, I found almost a half memory is always free. It's also related to zone_reclaim_mode because new BIOS changes the node distance to a large value. I use numactl --interleave=all to walkaround the problem temporarily. I have no HPC environment. >> >>if zone_reclaim=0 tendency workload is much than zone_reclaim=1 tendency >>workload, >> we can drop our afraid and we would prioritize your opinion, of cource. So it seems only file servers have the issue currently. Yanmin N§²æìr¸zǧu©²Æ {\béì¹»\x1c®&Þ)îÆi¢Ø^nr¶Ý¢j$½§$¢¸\x05¢¹¨è§~'.)îÄÃ,yèm¶ÿÃ\f%{±j+ðèצj)Z· ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default 2009-05-19 3:38 ` Zhang, Yanmin @ 2009-05-19 4:30 ` KOSAKI Motohiro 2009-05-19 5:06 ` Zhang, Yanmin 0 siblings, 1 reply; 45+ messages in thread From: KOSAKI Motohiro @ 2009-05-19 4:30 UTC (permalink / raw) To: Zhang, Yanmin Cc: kosaki.motohiro, Wu, Fengguang, LKML, linux-mm, Andrew Morton, Rik van Riel, Christoph Lameter > >>-----Original Message----- > >>From: KOSAKI Motohiro [mailto:kosaki.motohiro@jp.fujitsu.com] > >>Sent: 2009ト\xF3\x16ヤツ19ネユ 10:54 > >>To: Wu, Fengguang > >>Cc: kosaki.motohiro@jp.fujitsu.com; LKML; linux-mm; Andrew Morton; Rik van > >>Riel; Christoph Lameter; Zhang, Yanmin > >>Subject: Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default > >> > >>> On Wed, May 13, 2009 at 12:08:12PM +0900, KOSAKI Motohiro wrote: > >>> > Subject: [PATCH] zone_reclaim_mode is always 0 by default > >>> > > >>> > Current linux policy is, if the machine has large remote node distance, > >>> > zone_reclaim_mode is enabled by default because we've be able to assume > > >> > >>ok, I would explain zone reclaim design and performance tendency. > >> > >>Firstly, we can make classification of linux eco system, roughly. > >> - HPC > >> - high-end server > >> - volume server > >> - desktop > >> - embedded > >> > >>it is separated by typical workload mainly. > >> > >>Secondly, zone_reclaim mean "I strongly dislike remote node access than > >>disk access". > >>it is very fitting on HPC workload. it because > >> - HPC workload typically make the number of the same as cpus of processess > >>(or thread). > >> IOW, the workload typically use memory equally each node. > >> - HPC workload is typically CPU bounded job. CPU migration is rare. > >> - HPC workload is typically long lived. (possible >1 year) > >> IOW, remote node allocation makes _very_ _very_ much remote node access. > >> > >>but zone_reclaim don't fit typical server workload. > >> - server workload often make thread pool and some thread is sleeping until > >> a request receved. > >> IOW, when thread waking-up, the thread might move another cpu. > >> node distance tendency don't make sense on weak cpu locality workload. > >> > >>Plus, disk-cache is the file-server's identity. we shouldn't think it's not > >>important. > >>Plus, DB software can consume almost system memory and (In general) RDB data > >>makes > >>harder to split equally as hpc. > >> > >>desktop workload is special. desktop peopole can run various workload beyond > >>our assumption. So, we shouldn't have any workload assumption to desktop > >>people. > >>However, AFAIK almost desktop software use memory as UMA. > >> > >>we don't need to care embedded. it is typically UMA. > >> > >> > >>IOW, the benefit of zone reclaim depend on "strong cpu locality" and > >>"workload is cpu bounded" and "thead is long lived". > >>but many workload don't fill above requirement. IOW, zone reclaim is > >>workload depended feature (as Wu said). > >> > >> > >>In general, the feature of workload depended don't fit default option. > >>we can't know end-user run what workload anyway. > >> > >>Fortunately (or Unfortunately), typical workload and machine size had > >>significant mutuality. > >>Thus, the current default setting calculation had worked well in past days. > [YM] Your analysis is clear and deep. Thanks! > >>Now, it was breaked. What should we do? > >>Yanmin, We know 99% linux people use intel cpu and you are one of > >>most hard repeated testing > [YM] It's very easy to reproduce them on my machines. :) Sometimes, because the > issues only exist on machines with lots of cpu while other community developers > have no such environments. > > > guy in lkml and you have much test. > >>May I ask your tested machine and benchmark? > [YM] Usually I started lots of benchmark testing against the latest kernel, but > as for this issue, it's reported by a customer firstly. The customer runs apache > on Nehalem machines to access lots of files. So the issue is an example of file > server. hmmm. I'm surprised this report. I didn't know this problem. oh.. Actually, I don't think apache is only file server. apache is one of killer application in linux. it run on very widely organization. you think large machine don't run apache? I don't think so. > BTW, I found many test cases of fio have big drop after I upgraded BIOS of one > Nehalem machine. By checking vmstat data, I found almost a half memory is always free. It's also related to zone_reclaim_mode because new BIOS changes the node > distance to a large value. I use numactl --interleave=all to walkaround the problem temporarily. > > I have no HPC environment. Yeah, that's ok. I and cristoph have. My worries is my unknown workload become regression. so, May I assume you run your benchmark both zonre reclaim 0 and 1 and you haven't seen regression by non-zone reclaim mode? if so, it encourage very much to me. if zone reclaim mode disabling don't have regression, I'll pushing to remove default zone reclaim mode completely again. > >>if zone_reclaim=0 tendency workload is much than zone_reclaim=1 tendency > >>workload, > >> we can drop our afraid and we would prioritize your opinion, of cource. > So it seems only file servers have the issue currently. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 45+ messages in thread
* RE: [PATCH 4/4] zone_reclaim_mode is always 0 by default 2009-05-19 4:30 ` KOSAKI Motohiro @ 2009-05-19 5:06 ` Zhang, Yanmin 2009-05-19 7:09 ` KOSAKI Motohiro 0 siblings, 1 reply; 45+ messages in thread From: Zhang, Yanmin @ 2009-05-19 5:06 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Wu, Fengguang, LKML, linux-mm, Andrew Morton, Rik van Riel, Christoph Lameter >>-----Original Message----- >>From: KOSAKI Motohiro [mailto:kosaki.motohiro@jp.fujitsu.com] >>Sent: 2009年5月19日 12:31 >>To: Zhang, Yanmin >>Cc: kosaki.motohiro@jp.fujitsu.com; Wu, Fengguang; LKML; linux-mm; Andrew >>Morton; Rik van Riel; Christoph Lameter >>Subject: Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default >> >>> >>-----Original Message----- >>> >>From: KOSAKI Motohiro [mailto:kosaki.motohiro@jp.fujitsu.com] >>> >>Sent: 2009ト・ヤツ19ネユ 10:54 >>> >>To: Wu, Fengguang >>> >>Cc: kosaki.motohiro@jp.fujitsu.com; LKML; linux-mm; Andrew Morton; Rik van >>> >>Riel; Christoph Lameter; Zhang, Yanmin >>> >>Subject: Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default >>> >> >>> >>> On Wed, May 13, 2009 at 12:08:12PM +0900, KOSAKI Motohiro wrote: >>> >>> > Subject: [PATCH] zone_reclaim_mode is always 0 by default >>> >>> > >>> >>> > Current linux policy is, if the machine has large remote node distance, >>> >>> > zone_reclaim_mode is enabled by default because we've be able to assume >>> >>Fortunately (or Unfortunately), typical workload and machine size had >>> >>significant mutuality. >>> >>Thus, the current default setting calculation had worked well in past days. >>> [YM] Your analysis is clear and deep. >> >>Thanks! >> >> >>> >>Now, it was breaked. What should we do? >>> >>Yanmin, We know 99% linux people use intel cpu and you are one of >>> >>most hard repeated testing >>> [YM] It's very easy to reproduce them on my machines. :) Sometimes, because >>the >>> issues only exist on machines with lots of cpu while other community >>developers >>> have no such environments. >>> >>> >>> guy in lkml and you have much test. >>> >>May I ask your tested machine and benchmark? >>> [YM] Usually I started lots of benchmark testing against the latest kernel, >>but >>> as for this issue, it's reported by a customer firstly. The customer runs >>apache >>> on Nehalem machines to access lots of files. So the issue is an example of >>file >>> server. >> >>hmmm. >>I'm surprised this report. I didn't know this problem. oh.. [YM] Did you run file server workload on such NUMA machine with zone_reclaim_mode=1? If all nodes have the same memory, the behavior is obvious. >> >>Actually, I don't think apache is only file server. >>apache is one of killer application in linux. it run on very widely >>organization. [YM] I know that. Apache could support document, ecommerce, and lots of other usage models. What I mean is one of customers hit it with their workload. >>you think large machine don't run apache? I don't think so. >> >> >> >>> BTW, I found many test cases of fio have big drop after I upgraded BIOS of >>one >>> Nehalem machine. By checking vmstat data, I found almost a half memory is >>always free. It's also related to zone_reclaim_mode because new BIOS changes >>the node >>> distance to a large value. I use numactl --interleave=all to walkaround the >>problem temporarily. >>> >>> I have no HPC environment. >> >>Yeah, that's ok. I and cristoph have. My worries is my unknown workload become >>regression. >>so, May I assume you run your benchmark both zonre reclaim 0 and 1 and you >>haven't seen regression by non-zone reclaim mode? [YM] what is non-zone reclaim mode? When zone_reclaim_mode=0? I didn't do that intentionally. Currently I just make sure FIO has a big drop when zone_reclaim_mode=1. I might test it with other benchmarks on 2 Nehalem machines. >>if so, it encourage very much to me. >> >>if zone reclaim mode disabling don't have regression, I'll pushing to >>remove default zone reclaim mode completely again. [YM] I run lots of benchmarks, but it doesn't mean I run all benchmarks, especially no HPC. >> >> >>> >>if zone_reclaim=0 tendency workload is much than zone_reclaim=1 tendency >>> >>workload, >>> >> we can drop our afraid and we would prioritize your opinion, of cource. >>> So it seems only file servers have the issue currently. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default 2009-05-19 5:06 ` Zhang, Yanmin @ 2009-05-19 7:09 ` KOSAKI Motohiro 2009-05-19 7:15 ` Zhang, Yanmin 0 siblings, 1 reply; 45+ messages in thread From: KOSAKI Motohiro @ 2009-05-19 7:09 UTC (permalink / raw) To: Zhang, Yanmin Cc: kosaki.motohiro, Wu, Fengguang, LKML, linux-mm, Andrew Morton, Rik van Riel, Christoph Lameter Hi > >>> >>Now, it was breaked. What should we do? > >>> >>Yanmin, We know 99% linux people use intel cpu and you are one of > >>> >>most hard repeated testing > >>> [YM] It's very easy to reproduce them on my machines. :) Sometimes, because > >>the > >>> issues only exist on machines with lots of cpu while other community > >>developers > >>> have no such environments. > >>> > >>> > >>> guy in lkml and you have much test. > >>> >>May I ask your tested machine and benchmark? > >>> [YM] Usually I started lots of benchmark testing against the latest kernel, > >>but > >>> as for this issue, it's reported by a customer firstly. The customer runs > >>apache > >>> on Nehalem machines to access lots of files. So the issue is an example of > >>file > >>> server. > >> > >>hmmm. > >>I'm surprised this report. I didn't know this problem. oh.. > [YM] Did you run file server workload on such NUMA machine with > zone_reclaim_mode=1? If all nodes have the same memory, the behavior is > obvious. I missed your point. I agree file server case is obvious. but I don't think anybody oppose this. > >>Actually, I don't think apache is only file server. > >>apache is one of killer application in linux. it run on very widely > >>organization. > [YM] I know that. Apache could support document, ecommerce, and lots of other > usage models. What I mean is one of customers hit it with their > workload. hmhm, ok. > >>you think large machine don't run apache? I don't think so. > >> > >> > >> > >>> BTW, I found many test cases of fio have big drop after I upgraded BIOS of > >>one > >>> Nehalem machine. By checking vmstat data, I found almost a half memory is > >>always free. It's also related to zone_reclaim_mode because new BIOS changes > >>the node > >>> distance to a large value. I use numactl --interleave=all to walkaround the > >>problem temporarily. > >>> > >>> I have no HPC environment. > >> > >>Yeah, that's ok. I and cristoph have. My worries is my unknown workload become > >>regression. > >>so, May I assume you run your benchmark both zonre reclaim 0 and 1 and you > >>haven't seen regression by non-zone reclaim mode? > [YM] what is non-zone reclaim mode? When zone_reclaim_mode=0? > I didn't do that intentionally. Currently I just make sure FIO has a big drop > when zone_reclaim_mode=1. I might test it with other benchmarks on 2 Nehalem machines. May I ask what is FIO? File IO? > >>if so, it encourage very much to me. > >> > >>if zone reclaim mode disabling don't have regression, I'll pushing to > >>remove default zone reclaim mode completely again. > [YM] I run lots of benchmarks, but it doesn't mean I run all benchmarks, especially > no HPC. Of cource. nobody can run all benchmark in the world :) > >>> >>if zone_reclaim=0 tendency workload is much than zone_reclaim=1 tendency > >>> >>workload, > >>> >> we can drop our afraid and we would prioritize your opinion, of cource. > >>> So it seems only file servers have the issue currently. > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 45+ messages in thread
* RE: [PATCH 4/4] zone_reclaim_mode is always 0 by default 2009-05-19 7:09 ` KOSAKI Motohiro @ 2009-05-19 7:15 ` Zhang, Yanmin 0 siblings, 0 replies; 45+ messages in thread From: Zhang, Yanmin @ 2009-05-19 7:15 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Wu, Fengguang, LKML, linux-mm, Andrew Morton, Rik van Riel, Christoph Lameter [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset="gb2312", Size: 1644 bytes --] >>-----Original Message----- >>From: KOSAKI Motohiro [mailto:kosaki.motohiro@jp.fujitsu.com] >>Sent: 2009Äê5ÔÂ19ÈÕ 15:10 >>To: Zhang, Yanmin >>Cc: kosaki.motohiro@jp.fujitsu.com; Wu, Fengguang; LKML; linux-mm; Andrew >>Morton; Rik van Riel; Christoph Lameter >>Subject: Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default >> >>Hi >> >>> >>> >>Now, it was breaked. What should we do? >>> >>> >>Yanmin, We know 99% linux people use intel cpu and you are one of >>> >>> >>most hard repeated testing >>> >>> [YM] It's very easy to reproduce them on my machines. :) Sometimes, because >>> >>the >>> >>> issues only exist on machines with lots of cpu while other community >>> >>developers >>> >>> have no such environments. >>> >>> >>> >>> >>> >>> guy in lkml and you have much test. >>> >>> >>May I ask your tested machine and benchmark? >>> >>> [YM] Usually I started lots of benchmark testing against the latest >>> >> >>> >>Yeah, that's ok. I and cristoph have. My worries is my unknown workload >>become >>> >>regression. >>> >>so, May I assume you run your benchmark both zonre reclaim 0 and 1 and you >>> >>haven't seen regression by non-zone reclaim mode? >>> [YM] what is non-zone reclaim mode? When zone_reclaim_mode=0? >>> I didn't do that intentionally. Currently I just make sure FIO has a big drop >>> when zone_reclaim_mode=1. I might test it with other benchmarks on 2 Nehalem >>machines. >> >>May I ask what is FIO? >>File IO? [YM] fio is a tool to test I/O. Jens Axboe is the author. N§²æìr¸zǧu©²Æ {\béì¹»\x1c®&Þ)îÆi¢Ø^nr¶Ý¢j$½§$¢¸\x05¢¹¨è§~'.)îÄÃ,yèm¶ÿÃ\f%{±j+ðèצj)Z· ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default 2009-05-13 3:08 ` [PATCH 4/4] zone_reclaim_mode is always 0 by default KOSAKI Motohiro ` (2 preceding siblings ...) 2009-05-18 3:49 ` Wu Fengguang @ 2009-05-18 9:09 ` Wu Fengguang 3 siblings, 0 replies; 45+ messages in thread From: Wu Fengguang @ 2009-05-18 9:09 UTC (permalink / raw) To: KOSAKI Motohiro Cc: LKML, linux-mm, Andrew Morton, Rik van Riel, Christoph Lameter On Wed, May 13, 2009 at 12:08:12PM +0900, KOSAKI Motohiro wrote: > Index: b/mm/page_alloc.c > =================================================================== > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2494,13 +2494,6 @@ static void build_zonelists(pg_data_t *p > int distance = node_distance(local_node, node); > > /* > - * If another node is sufficiently far away then it is better > - * to reclaim pages in a zone before going off node. > - */ > - if (distance > RECLAIM_DISTANCE) > - zone_reclaim_mode = 1; > - Also remove the RECLAIM_DISTANCE definitions in include/linux/topology.h and arch/ia64/include/asm/topology.h? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 45+ messages in thread
end of thread, other threads:[~2009-05-24 13:43 UTC | newest] Thread overview: 45+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-05-13 3:06 [PATCH 0/4] various zone_reclaim cleanup KOSAKI Motohiro 2009-05-13 3:06 ` [PATCH 1/4] vmscan: change the number of the unmapped files in zone reclaim KOSAKI Motohiro 2009-05-13 13:31 ` Rik van Riel 2009-05-14 19:52 ` Christoph Lameter 2009-05-18 3:15 ` Wu Fengguang 2009-05-18 3:35 ` KOSAKI Motohiro 2009-05-18 3:53 ` Wu Fengguang 2009-05-19 1:11 ` KOSAKI Motohiro 2009-05-13 3:06 ` [PATCH 2/4] vmscan: drop PF_SWAPWRITE from zone_reclaim KOSAKI Motohiro 2009-05-13 13:35 ` Rik van Riel 2009-05-14 19:57 ` Christoph Lameter 2009-05-18 3:33 ` Wu Fengguang 2009-05-13 3:07 ` [PATCH 3/4] vmscan: zone_reclaim use may_swap KOSAKI Motohiro 2009-05-13 11:26 ` Johannes Weiner 2009-05-13 14:43 ` Rik van Riel 2009-05-14 19:59 ` Christoph Lameter 2009-05-18 3:35 ` Wu Fengguang 2009-05-13 3:08 ` [PATCH 4/4] zone_reclaim_mode is always 0 by default KOSAKI Motohiro 2009-05-13 14:47 ` Rik van Riel 2009-05-14 8:20 ` KOSAKI Motohiro 2009-05-14 11:48 ` Robin Holt 2009-05-14 12:02 ` KOSAKI Motohiro 2009-05-13 15:22 ` Robin Holt 2009-05-14 20:05 ` Christoph Lameter 2009-05-14 20:23 ` Rik van Riel 2009-05-14 20:31 ` Christoph Lameter 2009-05-15 1:02 ` KOSAKI Motohiro 2009-05-15 10:51 ` Robin Holt 2009-05-19 2:53 ` KOSAKI Motohiro 2009-05-20 14:00 ` Robin Holt 2009-05-21 2:44 ` KOSAKI Motohiro 2009-05-21 13:31 ` Christoph Lameter 2009-05-21 13:57 ` Robin Holt 2009-05-24 13:44 ` KOSAKI Motohiro 2009-05-15 18:01 ` Christoph Lameter 2009-05-18 3:49 ` Wu Fengguang 2009-05-19 1:16 ` Zhang, Yanmin 2009-05-19 2:53 ` KOSAKI Motohiro 2009-05-19 2:57 ` KOSAKI Motohiro 2009-05-19 3:38 ` Zhang, Yanmin 2009-05-19 4:30 ` KOSAKI Motohiro 2009-05-19 5:06 ` Zhang, Yanmin 2009-05-19 7:09 ` KOSAKI Motohiro 2009-05-19 7:15 ` Zhang, Yanmin 2009-05-18 9:09 ` Wu Fengguang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).