Default zone_reclaim_mode = 1 on NUMA kernel is bad for file/email/web servers

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Default zone_reclaim_mode = 1 on NUMA kernel is bad for file/email/web servers
@ 2010-09-13  3:39 Robert Mueller
  2010-09-16 10:01 ` KOSAKI Motohiro
  0 siblings, 1 reply; 31+ messages in thread
From: Robert Mueller @ 2010-09-13  3:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: KOSAKI Motohiro, Bron Gondwana

So over the last couple of weeks, I've noticed that our shiny new IMAP
servers (Dual Xeon E5520 + Intel S5520UR MB) with 48G of RAM haven't
been performing as well as expected, and there were some big oddities.
Namely two things stuck out:

1. There was free memory. There's 20T of data on these machines. The
   kernel should have used lots of memory for caching, but for some
   reason, it wasn't. cache ~ 2G, buffers ~ 25G, unused ~ 5G
2. The machine has an SSD for very hot data. In total, there's about 16G
   of data on the SSD. Almost all of that 16G of data should end up
   being cached, so there should be little reading from the SSDs at all.
   Instead we saw at peak times 2k+ blocks read/s from the SSDs. Again a
   sign that caching wasn't working.

After a bunch of googling, I found this thread.

http://lkml.org/lkml/2009/5/12/586

It appears that patch never went anywhere, and zone_reclaim_mode is
still defaulting to 1 on our pretty standard file/email/web server type
machine with a NUMA kernel.

By changing it to 0, we saw an immediate massive change in caching
behaviour. Now cache ~ 27G, buffers ~ 7G and unused ~ 0.2G, and IO reads
from the SSD dropped to 100/s instead of 2000/s.

Having very little knowledge of what this actually does, I'd just
like to point out that from a users point of view, it's really
annoying for your machine to be crippled by a default kernel setting
that's pretty obscure.

I don't think our usage scenario of serving lots of files is that
uncommon, every file server/email server/web server will be doing pretty
much that and expecting a large part of their memory to be used as a
cache, which clearly isn't what actually happens.

Rob
Rob Mueller
robm@fastmail.fm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Default zone_reclaim_mode = 1 on NUMA kernel is bad for file/email/web servers
  2010-09-13  3:39 Default zone_reclaim_mode = 1 on NUMA kernel is bad for file/email/web servers Robert Mueller
@ 2010-09-16 10:01 ` KOSAKI Motohiro
  2010-09-16 17:06   ` Christoph Lameter
                     ` (4 more replies)
  0 siblings, 5 replies; 31+ messages in thread
From: KOSAKI Motohiro @ 2010-09-16 10:01 UTC (permalink / raw)
  To: robm
  Cc: kosaki.motohiro, linux-kernel, Bron Gondwana, linux-mm,
	Christoph Lameter, Mel Gorman

Cc to linux-mm and hpc guys. and intetionally full quote.


> So over the last couple of weeks, I've noticed that our shiny new IMAP
> servers (Dual Xeon E5520 + Intel S5520UR MB) with 48G of RAM haven't
> been performing as well as expected, and there were some big oddities.
> Namely two things stuck out:
> 
> 1. There was free memory. There's 20T of data on these machines. The
>    kernel should have used lots of memory for caching, but for some
>    reason, it wasn't. cache ~ 2G, buffers ~ 25G, unused ~ 5G
> 2. The machine has an SSD for very hot data. In total, there's about 16G
>    of data on the SSD. Almost all of that 16G of data should end up
>    being cached, so there should be little reading from the SSDs at all.
>    Instead we saw at peak times 2k+ blocks read/s from the SSDs. Again a
>    sign that caching wasn't working.
> 
> After a bunch of googling, I found this thread.
> 
> http://lkml.org/lkml/2009/5/12/586
> 
> It appears that patch never went anywhere, and zone_reclaim_mode is
> still defaulting to 1 on our pretty standard file/email/web server type
> machine with a NUMA kernel.
> 
> By changing it to 0, we saw an immediate massive change in caching
> behaviour. Now cache ~ 27G, buffers ~ 7G and unused ~ 0.2G, and IO reads
> from the SSD dropped to 100/s instead of 2000/s.
> 
> Having very little knowledge of what this actually does, I'd just
> like to point out that from a users point of view, it's really
> annoying for your machine to be crippled by a default kernel setting
> that's pretty obscure.
> 
> I don't think our usage scenario of serving lots of files is that
> uncommon, every file server/email server/web server will be doing pretty
> much that and expecting a large part of their memory to be used as a
> cache, which clearly isn't what actually happens.
> 
> Rob
> Rob Mueller
> robm@fastmail.fm
> 

Yes, sadly intel motherboard turn on zone_reclaim_mode by default. and
current zone_reclaim_mode doesn't fit file/web server usecase ;-)

So, I've created new proof concept patch. This doesn't disable zone_reclaim
at all. Instead, distinguish for file cache and for anon allocation and
only file cache doesn't use zone-reclaim.

That said, high-end hpc user often turn on cpuset.memory_spread_page and
they avoid this issue. But, why don't we consider avoid it by default?


Rob, I wonder if following patch help you. Could you please try it?


Subject: [RFC] vmscan: file cache doesn't use zone_reclaim by default

---
Need to removed debbuging piece.

 Documentation/sysctl/vm.txt |    7 +++----
 fs/inode.c                  |    2 +-
 include/linux/gfp.h         |    9 +++++++--
 include/linux/mmzone.h      |    2 ++
 include/linux/swap.h        |    6 ++++++
 mm/filemap.c                |    1 +
 mm/page_alloc.c             |    8 +++++++-
 mm/vmscan.c                 |    7 ++-----
 mm/vmstat.c                 |    2 ++
 9 files changed, 31 insertions(+), 13 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index b606c2c..4be569e 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -671,16 +671,15 @@ This is value ORed together of
 1	= Zone reclaim on
 2	= Zone reclaim writes dirty pages out
 4	= Zone reclaim swaps pages
+8	= Zone reclaim for file cache on
 
 zone_reclaim_mode is set during bootup to 1 if it is determined that pages
 from remote zones will cause a measurable performance reduction. The
 page allocator will then reclaim easily reusable pages (those page
 cache pages that are currently not used) before allocating off node pages.
 
-It may be beneficial to switch off zone reclaim if the system is
-used for a file server and all of memory should be used for caching files
-from disk. In that case the caching effect is more important than
-data locality.
+By default, for file cache allocation doesn't use zone reclaim. But
+It can be turned on manually.
 
 Allowing zone reclaim to write out pages stops processes that are
 writing large amounts of data from dirtying pages on other nodes. Zone
diff --git a/fs/inode.c b/fs/inode.c
index 8646433..02a51b1 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -166,7 +166,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 	mapping->a_ops = &empty_aops;
 	mapping->host = inode;
 	mapping->flags = 0;
-	mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
+	mapping_set_gfp_mask(mapping, GFP_FILE_CACHE);
 	mapping->assoc_mapping = NULL;
 	mapping->backing_dev_info = &default_backing_dev_info;
 	mapping->writeback_index = 0;
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 975609c..f263b1f 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -84,6 +84,10 @@ struct vm_area_struct;
 #define GFP_HIGHUSER_MOVABLE	(__GFP_WAIT | __GFP_IO | __GFP_FS | \
 				 __GFP_HARDWALL | __GFP_HIGHMEM | \
 				 __GFP_MOVABLE)
+
+#define GFP_FILE_CACHE	(GFP_HIGHUSER | __GFP_RECLAIMABLE | __GFP_MOVABLE)
+
+
 #define GFP_IOFS	(__GFP_IO | __GFP_FS)
 
 #ifdef CONFIG_NUMA
@@ -120,11 +124,12 @@ struct vm_area_struct;
 /* Convert GFP flags to their corresponding migrate type */
 static inline int allocflags_to_migratetype(gfp_t gfp_flags)
 {
-	WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
-
 	if (unlikely(page_group_by_mobility_disabled))
 		return MIGRATE_UNMOVABLE;
 
+	if ((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK)
+		gfp_flags &= ~__GFP_RECLAIMABLE;
+
 	/* Group based on mobility */
 	return (((gfp_flags & __GFP_MOVABLE) != 0) << 1) |
 		((gfp_flags & __GFP_RECLAIMABLE) != 0);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6e6e626..2eead52 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -112,6 +112,8 @@ enum zone_stat_item {
 	NUMA_LOCAL,		/* allocation from local node */
 	NUMA_OTHER,		/* allocation from other node */
 #endif
+	NR_ZONE_CACHE_AVOID,
+	NR_ZONE_RECLAIM,
 	NR_VM_ZONE_STAT_ITEMS };
 
 /*
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2fee51a..487bc3b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -65,6 +65,12 @@ static inline int current_is_kswapd(void)
 #define MAX_SWAPFILES \
 	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
 
+#define RECLAIM_OFF 0
+#define RECLAIM_ZONE (1<<0)	/* Run shrink_inactive_list on the zone */
+#define RECLAIM_WRITE (1<<1)	/* Writeout pages during reclaim */
+#define RECLAIM_SWAP (1<<2)	/* Swap pages out during reclaim */
+#define RECLAIM_CACHE (1<<3)	/* Reclaim even though file cache purpose allocation */
+
 /*
  * Magic header for a swap area. The first part of the union is
  * what the swap magic looks like for the old (limited to 128MB)
diff --git a/mm/filemap.c b/mm/filemap.c
index 3d4df44..97298c0 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -468,6 +468,7 @@ struct page *__page_cache_alloc(gfp_t gfp)
 	if (cpuset_do_page_mem_spread()) {
 		get_mems_allowed();
 		n = cpuset_mem_spread_node();
+		gfp &= ~__GFP_RECLAIMABLE;
 		page = alloc_pages_exact_node(n, gfp, 0);
 		put_mems_allowed();
 		return page;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8587c10..f81c28f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1646,9 +1646,15 @@ zonelist_scan:
 				    classzone_idx, alloc_flags))
 				goto try_this_zone;
 
-			if (zone_reclaim_mode == 0)
+			if (zone_reclaim_mode == RECLAIM_OFF)
 				goto this_zone_full;
 
+			if (!(zone_reclaim_mode & RECLAIM_CACHE) &&
+			    (gfp_mask & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK) {
+				inc_zone_state(zone, NR_ZONE_CACHE_AVOID);
+				goto try_next_zone;
+			}
+
 			ret = zone_reclaim(zone, gfp_mask, order);
 			switch (ret) {
 			case ZONE_RECLAIM_NOSCAN:
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c391c32..6f63eea 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2558,11 +2558,6 @@ module_init(kswapd_init)
  */
 int zone_reclaim_mode __read_mostly;
 
-#define RECLAIM_OFF 0
-#define RECLAIM_ZONE (1<<0)	/* Run shrink_inactive_list on the zone */
-#define RECLAIM_WRITE (1<<1)	/* Writeout pages during reclaim */
-#define RECLAIM_SWAP (1<<2)	/* Swap pages out during reclaim */
-
 /*
  * Priority for ZONE_RECLAIM. This determines the fraction of pages
  * of a node considered for each zone_reclaim. 4 scans 1/16th of
@@ -2646,6 +2641,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	};
 	unsigned long nr_slab_pages0, nr_slab_pages1;
 
+	inc_zone_state(zone, NR_ZONE_RECLAIM);
+
 	cond_resched();
 	/*
 	 * We need to be able to allocate from the reserves for RECLAIM_SWAP
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f389168..8988688 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -740,6 +740,8 @@ static const char * const vmstat_text[] = {
 	"numa_local",
 	"numa_other",
 #endif
+	"zone_cache_avoid",
+	"zone_reclaim",
 
 #ifdef CONFIG_VM_EVENT_COUNTERS
 	"pgpgin",
-- 
1.6.5.2





^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: Default zone_reclaim_mode = 1 on NUMA kernel is bad for file/email/web servers
  2010-09-16 10:01 ` KOSAKI Motohiro
@ 2010-09-16 17:06   ` Christoph Lameter
  2010-09-17  0:50     ` Robert Mueller
  2010-09-20  9:34   ` Mel Gorman
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 31+ messages in thread
From: Christoph Lameter @ 2010-09-16 17:06 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: robm, linux-kernel, Bron Gondwana, linux-mm, Mel Gorman

On Thu, 16 Sep 2010, KOSAKI Motohiro wrote:

> > So over the last couple of weeks, I've noticed that our shiny new IMAP
> > servers (Dual Xeon E5520 + Intel S5520UR MB) with 48G of RAM haven't
> > been performing as well as expected, and there were some big oddities.
> > Namely two things stuck out:
> >
> > 1. There was free memory. There's 20T of data on these machines. The
> >    kernel should have used lots of memory for caching, but for some
> >    reason, it wasn't. cache ~ 2G, buffers ~ 25G, unused ~ 5G

This means that that the memory allocations did only occur on a single
processor? And with zone reclaim it only used one node since the page
cache was reclaimed?

> > Having very little knowledge of what this actually does, I'd just
> > like to point out that from a users point of view, it's really
> > annoying for your machine to be crippled by a default kernel setting
> > that's pretty obscure.

Thats an issue of the NUMA BIOS information. Kernel defaults to zone
reclaim if the cost of accessing remote memory vs local memory crosses
a certain threshhold which usually impacts performance.

> Yes, sadly intel motherboard turn on zone_reclaim_mode by default. and
> current zone_reclaim_mode doesn't fit file/web server usecase ;-)

Or one could also say that the web servers are not designed to properly
distribute the load on a complex NUMA based memory architecture of todays
Intel machines.

> So, I've created new proof concept patch. This doesn't disable zone_reclaim
> at all. Instead, distinguish for file cache and for anon allocation and
> only file cache doesn't use zone-reclaim.

zone reclaim was intended to only be applicable to unmapped file cache in
order to be low impact.  Now you just want to apply it to anonymous pages?

> That said, high-end hpc user often turn on cpuset.memory_spread_page and
> they avoid this issue. But, why don't we consider avoid it by default?

Well as you say setting memory spreading on would avoid the issue.

So would enabling memory interleave in the BIOS to get the machine to not
consider the memory distances but average out the NUMA effects.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Default zone_reclaim_mode = 1 on NUMA kernel is bad for file/email/web servers
  2010-09-16 17:06   ` Christoph Lameter
@ 2010-09-17  0:50     ` Robert Mueller
  2010-09-17  6:01       ` Shaohua Li
  0 siblings, 1 reply; 31+ messages in thread
From: Robert Mueller @ 2010-09-17  0:50 UTC (permalink / raw)
  To: Christoph Lameter, KOSAKI Motohiro
  Cc: linux-kernel, Bron Gondwana, linux-mm, Mel Gorman

> > > Having very little knowledge of what this actually does, I'd just
> > > like to point out that from a users point of view, it's really
> > > annoying for your machine to be crippled by a default kernel
> > > setting that's pretty obscure.
>
> Thats an issue of the NUMA BIOS information. Kernel defaults to zone
> reclaim if the cost of accessing remote memory vs local memory crosses
> a certain threshhold which usually impacts performance.

We use what I thought was a fairly standard server type motherboard and
CPU combination, and I was surprised that things were so badly broken
for a standard usage scenario with a vanilla kernel with a default
configuration.

I'd point out that the cost of a remote memory access is many, many
orders of magnitude less than having to go back to disk! The problem is
that with zone_reclaim_mode = 1 it seems lots of memory was being wasted
that could be used as disk cache.

> > Yes, sadly intel motherboard turn on zone_reclaim_mode by
> > default. and current zone_reclaim_mode doesn't fit file/web
> > server usecase ;-)
>
> Or one could also say that the web servers are not designed to
> properly distribute the load on a complex NUMA based memory
> architecture of todays Intel machines.

I don't think this is any fault of how the software works. It's a *very*
standard "pre-fork child processes, allocate incoming connections to a
child process, open and mmap one or more files to read data from them".
That's not exactly a weird programming model, and it's bad that the
kernel is handling that case very badly with everything default.

> So would enabling memory interleave in the BIOS to get the machine to
> not consider the memory distances but average out the NUMA effects.

We'll see if the BIOS has an option for that and try it out. I'd like to
document for others that encounter this problem ways around it.

Rob

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Default zone_reclaim_mode = 1 on NUMA kernel is bad for file/email/web servers
  2010-09-17  0:50     ` Robert Mueller
@ 2010-09-17  6:01       ` Shaohua Li
  2010-09-17  7:32         ` Robert Mueller
  0 siblings, 1 reply; 31+ messages in thread
From: Shaohua Li @ 2010-09-17  6:01 UTC (permalink / raw)
  To: robm@fastmail.fm
  Cc: Christoph Lameter, KOSAKI Motohiro, linux-kernel@vger.kernel.org,
	Bron Gondwana, linux-mm, Mel Gorman

On Fri, 2010-09-17 at 08:50 +0800, Robert Mueller wrote:
> > > > Having very little knowledge of what this actually does, I'd just
> > > > like to point out that from a users point of view, it's really
> > > > annoying for your machine to be crippled by a default kernel
> > > > setting that's pretty obscure.
> >
> > Thats an issue of the NUMA BIOS information. Kernel defaults to zone
> > reclaim if the cost of accessing remote memory vs local memory crosses
> > a certain threshhold which usually impacts performance.
> 
> We use what I thought was a fairly standard server type motherboard and
> CPU combination, and I was surprised that things were so badly broken
> for a standard usage scenario with a vanilla kernel with a default
> configuration.
> 
> I'd point out that the cost of a remote memory access is many, many
> orders of magnitude less than having to go back to disk! The problem is
> that with zone_reclaim_mode = 1 it seems lots of memory was being wasted
> that could be used as disk cache.
> 
> > > Yes, sadly intel motherboard turn on zone_reclaim_mode by
> > > default. and current zone_reclaim_mode doesn't fit file/web
> > > server usecase ;-)
> >
> > Or one could also say that the web servers are not designed to
> > properly distribute the load on a complex NUMA based memory
> > architecture of todays Intel machines.
> 
> I don't think this is any fault of how the software works. It's a *very*
> standard "pre-fork child processes, allocate incoming connections to a
> child process, open and mmap one or more files to read data from them".
> That's not exactly a weird programming model, and it's bad that the
> kernel is handling that case very badly with everything default.
maybe you incoming connection always happen on one CPU and you do the
page allocation in that cpu, so some nodes use out of memory but others
have a lot free. Try bind the child process to different nodes might
help.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Default zone_reclaim_mode = 1 on NUMA kernel is bad for file/email/web servers
  2010-09-17  6:01       ` Shaohua Li
@ 2010-09-17  7:32         ` Robert Mueller
  2010-09-17 13:56           ` Christoph Lameter
  0 siblings, 1 reply; 31+ messages in thread
From: Robert Mueller @ 2010-09-17  7:32 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Christoph Lameter, KOSAKI Motohiro, linux-kernel@vger.kernel.org,
	Bron Gondwana, linux-mm, Mel Gorman

> > I don't think this is any fault of how the software works. It's a
> > *very* standard "pre-fork child processes, allocate incoming
> > connections to a child process, open and mmap one or more files to
> > read data from them". That's not exactly a weird programming model,
> > and it's bad that the kernel is handling that case very badly with
> > everything default.
>
> maybe you incoming connection always happen on one CPU and you do the
> page allocation in that cpu, so some nodes use out of memory but
> others have a lot free. Try bind the child process to different nodes
> might help.

There's are 5000+ child processes (it's a cyrus IMAP server). Neither
the parent of any of the children are bound to any particular CPU. It
uses a standard fcntl lock to make sure only one spare child at a time
calls accept(). I don't think that's the problem.

Rob

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Default zone_reclaim_mode = 1 on NUMA kernel is bad for file/email/web servers
  2010-09-17  7:32         ` Robert Mueller
@ 2010-09-17 13:56           ` Christoph Lameter
  2010-09-17 14:09             ` Bron Gondwana
  0 siblings, 1 reply; 31+ messages in thread
From: Christoph Lameter @ 2010-09-17 13:56 UTC (permalink / raw)
  To: Robert Mueller
  Cc: Shaohua Li, KOSAKI Motohiro, linux-kernel@vger.kernel.org,
	Bron Gondwana, linux-mm, Mel Gorman

On Fri, 17 Sep 2010, Robert Mueller wrote:

> > > I don't think this is any fault of how the software works. It's a
> > > *very* standard "pre-fork child processes, allocate incoming
> > > connections to a child process, open and mmap one or more files to
> > > read data from them". That's not exactly a weird programming model,
> > > and it's bad that the kernel is handling that case very badly with
> > > everything default.
> >
> > maybe you incoming connection always happen on one CPU and you do the
> > page allocation in that cpu, so some nodes use out of memory but
> > others have a lot free. Try bind the child process to different nodes
> > might help.
>
> There's are 5000+ child processes (it's a cyrus IMAP server). Neither
> the parent of any of the children are bound to any particular CPU. It
> uses a standard fcntl lock to make sure only one spare child at a time
> calls accept(). I don't think that's the problem.

>From the first look that seems to be the problem. You do not need to be
bound to a particular cpu, the scheduler will just leave a single process
on the same cpu by default. If you then allocate all memory only from this
process then you get the scenario that you described.

There should be multiple processes allocating memory from all processors
to take full advantage of fast local memory. If you cannot do that then
the only choice is to reduce performance by some sort of interleaving
either at the Bios or OS level. OS level interleaving only for this
particular application would be best because then the OS can at least
allocate its own data in memory local to the processors.



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Default zone_reclaim_mode = 1 on NUMA kernel is bad for file/email/web servers
  2010-09-17 13:56           ` Christoph Lameter
@ 2010-09-17 14:09             ` Bron Gondwana
  2010-09-17 14:22               ` Christoph Lameter
  0 siblings, 1 reply; 31+ messages in thread
From: Bron Gondwana @ 2010-09-17 14:09 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robert Mueller, Shaohua Li, KOSAKI Motohiro,
	linux-kernel@vger.kernel.org, Bron Gondwana, linux-mm, Mel Gorman

On Fri, Sep 17, 2010 at 08:56:06AM -0500, Christoph Lameter wrote:
> On Fri, 17 Sep 2010, Robert Mueller wrote:
> 
> > > > I don't think this is any fault of how the software works. It's a
> > > > *very* standard "pre-fork child processes, allocate incoming
> > > > connections to a child process, open and mmap one or more files to
> > > > read data from them". That's not exactly a weird programming model,
> > > > and it's bad that the kernel is handling that case very badly with
> > > > everything default.
> > >
> > > maybe you incoming connection always happen on one CPU and you do the
> > > page allocation in that cpu, so some nodes use out of memory but
> > > others have a lot free. Try bind the child process to different nodes
> > > might help.
> >
> > There's are 5000+ child processes (it's a cyrus IMAP server). Neither
> > the parent of any of the children are bound to any particular CPU. It
> > uses a standard fcntl lock to make sure only one spare child at a time
> > calls accept(). I don't think that's the problem.
> 
> From the first look that seems to be the problem. You do not need to be
> bound to a particular cpu, the scheduler will just leave a single process
> on the same cpu by default. If you then allocate all memory only from this
> process then you get the scenario that you described.

Huh?  Which bit of forking server makes you think one process is allocating
lots of memory?  They're opening and reading from files.  Unless you're
calling the kernel a "single process".
 
> There should be multiple processes allocating memory from all processors
> to take full advantage of fast local memory. If you cannot do that then
> the only choice is to reduce performance by some sort of interleaving
> either at the Bios or OS level. OS level interleaving only for this
> particular application would be best because then the OS can at least
> allocate its own data in memory local to the processors.

In actual fact we're running 20 different Cyrus instances on this
machine, each with its own config file and own master file.  The only
"parentage" they share is they were most likely started from a single
bash shell at one point, because we start them up after the server is
already running from a management script.

So we're talking 20 Cyrus master processes, each of which forks off
hundreds of imapd processes, each of which listens, opens mailboxes
as required, reads and writes files.

You can't seriously tell me that the scheduler is putting ALL THESE
PROCESSES on a single CPU.

Bron.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Default zone_reclaim_mode = 1 on NUMA kernel is bad for file/email/web servers
  2010-09-17 14:09             ` Bron Gondwana
@ 2010-09-17 14:22               ` Christoph Lameter
  2010-09-17 23:01                 ` Bron Gondwana
  0 siblings, 1 reply; 31+ messages in thread
From: Christoph Lameter @ 2010-09-17 14:22 UTC (permalink / raw)
  To: Bron Gondwana
  Cc: Robert Mueller, Shaohua Li, KOSAKI Motohiro,
	linux-kernel@vger.kernel.org, linux-mm, Mel Gorman

On Sat, 18 Sep 2010, Bron Gondwana wrote:

> > From the first look that seems to be the problem. You do not need to be
> > bound to a particular cpu, the scheduler will just leave a single process
> > on the same cpu by default. If you then allocate all memory only from this
> > process then you get the scenario that you described.
>
> Huh?  Which bit of forking server makes you think one process is allocating
> lots of memory?  They're opening and reading from files.  Unless you're
> calling the kernel a "single process".

I have no idea what your app does. The data that I glanced over looks as
if most allocations happen for a particular memory node and since the
memory is optimized to be local to that node other memory is not used
intensively. This can occur because of allocations through one process /
thread that is always running on the same cpu and therefore always
allocates from the memory node local to that cpu.

It can also happen f.e. if a driver always allocates memory local to the
I/O bus that it is using.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Default zone_reclaim_mode = 1 on NUMA kernel is bad for file/email/web servers
  2010-09-17 14:22               ` Christoph Lameter
@ 2010-09-17 23:01                 ` Bron Gondwana
  0 siblings, 0 replies; 31+ messages in thread
From: Bron Gondwana @ 2010-09-17 23:01 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Bron Gondwana, Robert Mueller, Shaohua Li, KOSAKI Motohiro,
	linux-kernel@vger.kernel.org, linux-mm, Mel Gorman

On Fri, Sep 17, 2010 at 09:22:00AM -0500, Christoph Lameter wrote:
> On Sat, 18 Sep 2010, Bron Gondwana wrote:
> 
> > > From the first look that seems to be the problem. You do not need to be
> > > bound to a particular cpu, the scheduler will just leave a single process
> > > on the same cpu by default. If you then allocate all memory only from this
> > > process then you get the scenario that you described.
> >
> > Huh?  Which bit of forking server makes you think one process is allocating
> > lots of memory?  They're opening and reading from files.  Unless you're
> > calling the kernel a "single process".
> 
> I have no idea what your app does. 

Ok - Cyrus IMAPd has been around for ages.  It's an open source email
server built on a very traditional single-process model.

* a master process which reads config files and manages the other process
* multiple imapd processes, one per connection
* multiple pop3d processes, one per connection
* multiple lmtpd processes, one per connection
* periodical "cleanup" processes.

Each of these is started by the lightweight master forking and then
execing the appropriate daemon.

In our configuration we run 20 separate "master" processes, each
managing a single disk partition's worth of email.  The reason
for this is reduced locking contention for the central mailboxes
database, and also better replication concurrency, because each
instance runs a single replication process - so replication is
sequential.

> The data that I glanced over looks as
> if most allocations happen for a particular memory node

Sorry, which data?

> and since the
> memory is optimized to be local to that node other memory is not used
> intensively. This can occur because of allocations through one process /
> thread that is always running on the same cpu and therefore always
> allocates from the memory node local to that cpu.

As Rob said, there are thousands of independent processes, each opening
a single mailbox (3 separate metadata files plus possibly hundreds of
individual email files).  It's likely that diffenent processes will open
the same mailbox over time - for example an email client opening multiple
concurrent connections, and at the same time an lmtpd connecting and
delivering new emails to the mailbox.

> It can also happen f.e. if a driver always allocates memory local to the
> I/O bus that it is using.

None of what we're doing is super weird advanced stuff, it's a vanilla
forking daemon where a single process run and does stuff on behalf of
a user.  The only slightly interesting things:

1) each "service" has a single lock file, and all the idle processes of
   that type (i.e. imapd) block on that lock while they're waiting for
   a connection.  This is to avoid thundering herd on operating systems
   which aren't nice about it.  The winner does the accept and handles
   the connection.
2) once it's finished processing a request, the process will wait for
   another connection rather than closing.

Nothing sounds like what you're talking about (one giant process that's
all on one CPU), and I don't know why you keep talking about it.  It's
nothing like what we're running on these machines.

Bron.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Default zone_reclaim_mode = 1 on NUMA kernel is bad for file/email/web servers
  2010-09-16 10:01 ` KOSAKI Motohiro
  2010-09-16 17:06   ` Christoph Lameter
@ 2010-09-20  9:34   ` Mel Gorman
  2010-09-20 23:41     ` Default zone_reclaim_mode = 1 on NUMA kernel is bad forfile/email/web servers Rob Mueller
  2010-09-21  1:05   ` Default zone_reclaim_mode = 1 on NUMA kernel is bad for file/email/web servers KAMEZAWA Hiroyuki
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 31+ messages in thread
From: Mel Gorman @ 2010-09-20  9:34 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: robm, linux-kernel, Bron Gondwana, linux-mm, Christoph Lameter

On Thu, Sep 16, 2010 at 07:01:32PM +0900, KOSAKI Motohiro wrote:
> Cc to linux-mm and hpc guys. and intetionally full quote.
> 
> 
> > So over the last couple of weeks, I've noticed that our shiny new IMAP
> > servers (Dual Xeon E5520 + Intel S5520UR MB) with 48G of RAM haven't
> > been performing as well as expected, and there were some big oddities.
> > Namely two things stuck out:
> > 
> > 1. There was free memory. There's 20T of data on these machines. The
> >    kernel should have used lots of memory for caching, but for some
> >    reason, it wasn't. cache ~ 2G, buffers ~ 25G, unused ~ 5G
> > 2. The machine has an SSD for very hot data. In total, there's about 16G
> >    of data on the SSD. Almost all of that 16G of data should end up
> >    being cached, so there should be little reading from the SSDs at all.
> >    Instead we saw at peak times 2k+ blocks read/s from the SSDs. Again a
> >    sign that caching wasn't working.
> > 
> > After a bunch of googling, I found this thread.
> > 
> > http://lkml.org/lkml/2009/5/12/586
> > 
> > It appears that patch never went anywhere, and zone_reclaim_mode is
> > still defaulting to 1 on our pretty standard file/email/web server type
> > machine with a NUMA kernel.
> > 
> > By changing it to 0, we saw an immediate massive change in caching
> > behaviour. Now cache ~ 27G, buffers ~ 7G and unused ~ 0.2G, and IO reads
> > from the SSD dropped to 100/s instead of 2000/s.
> > 
> > Having very little knowledge of what this actually does, I'd just
> > like to point out that from a users point of view, it's really
> > annoying for your machine to be crippled by a default kernel setting
> > that's pretty obscure.
> > 
> > I don't think our usage scenario of serving lots of files is that
> > uncommon, every file server/email server/web server will be doing pretty
> > much that and expecting a large part of their memory to be used as a
> > cache, which clearly isn't what actually happens.
> > 
> > Rob
> > Rob Mueller
> > robm@fastmail.fm
> > 
> 
> Yes, sadly intel motherboard turn on zone_reclaim_mode by default. and
> current zone_reclaim_mode doesn't fit file/web server usecase ;-)
> 
> So, I've created new proof concept patch. This doesn't disable zone_reclaim
> at all. Instead, distinguish for file cache and for anon allocation and
> only file cache doesn't use zone-reclaim.
> 
> That said, high-end hpc user often turn on cpuset.memory_spread_page and
> they avoid this issue. But, why don't we consider avoid it by default?
> 
> 
> Rob, I wonder if following patch help you. Could you please try it?
> 
> 
> Subject: [RFC] vmscan: file cache doesn't use zone_reclaim by default
> 
> ---
> Need to removed debbuging piece.
> 
>  Documentation/sysctl/vm.txt |    7 +++----
>  fs/inode.c                  |    2 +-
>  include/linux/gfp.h         |    9 +++++++--
>  include/linux/mmzone.h      |    2 ++
>  include/linux/swap.h        |    6 ++++++
>  mm/filemap.c                |    1 +
>  mm/page_alloc.c             |    8 +++++++-
>  mm/vmscan.c                 |    7 ++-----
>  mm/vmstat.c                 |    2 ++
>  9 files changed, 31 insertions(+), 13 deletions(-)
> 
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index b606c2c..4be569e 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -671,16 +671,15 @@ This is value ORed together of
>  1	= Zone reclaim on
>  2	= Zone reclaim writes dirty pages out
>  4	= Zone reclaim swaps pages
> +8	= Zone reclaim for file cache on
>  
>  zone_reclaim_mode is set during bootup to 1 if it is determined that pages
>  from remote zones will cause a measurable performance reduction. The
>  page allocator will then reclaim easily reusable pages (those page
>  cache pages that are currently not used) before allocating off node pages.
>  
> -It may be beneficial to switch off zone reclaim if the system is
> -used for a file server and all of memory should be used for caching files
> -from disk. In that case the caching effect is more important than
> -data locality.
> +By default, for file cache allocation doesn't use zone reclaim. But
> +It can be turned on manually.
>  
>  Allowing zone reclaim to write out pages stops processes that are
>  writing large amounts of data from dirtying pages on other nodes. Zone
> diff --git a/fs/inode.c b/fs/inode.c
> index 8646433..02a51b1 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -166,7 +166,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
>  	mapping->a_ops = &empty_aops;
>  	mapping->host = inode;
>  	mapping->flags = 0;
> -	mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
> +	mapping_set_gfp_mask(mapping, GFP_FILE_CACHE);
>  	mapping->assoc_mapping = NULL;
>  	mapping->backing_dev_info = &default_backing_dev_info;
>  	mapping->writeback_index = 0;
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 975609c..f263b1f 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -84,6 +84,10 @@ struct vm_area_struct;
>  #define GFP_HIGHUSER_MOVABLE	(__GFP_WAIT | __GFP_IO | __GFP_FS | \
>  				 __GFP_HARDWALL | __GFP_HIGHMEM | \
>  				 __GFP_MOVABLE)
> +
> +#define GFP_FILE_CACHE	(GFP_HIGHUSER | __GFP_RECLAIMABLE | __GFP_MOVABLE)
> +

This mask of both __GFP_RECLAIMABLE and __GFP_MOVABLE makes no sense at
all in terms of fragmentation avoidance. In fact, I'm surprised it didn't
trigger the warning in allocflags_to_migratetype() during your testing.

> +
>  #define GFP_IOFS	(__GFP_IO | __GFP_FS)
>  
>  #ifdef CONFIG_NUMA
> @@ -120,11 +124,12 @@ struct vm_area_struct;
>  /* Convert GFP flags to their corresponding migrate type */
>  static inline int allocflags_to_migratetype(gfp_t gfp_flags)
>  {
> -	WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
> -

Ah, you deleted the check.

>  	if (unlikely(page_group_by_mobility_disabled))
>  		return MIGRATE_UNMOVABLE;
>  
> +	if ((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK)
> +		gfp_flags &= ~__GFP_RECLAIMABLE;
> +

So you delete the flag, maybe it's obvious why later.

>  	/* Group based on mobility */
>  	return (((gfp_flags & __GFP_MOVABLE) != 0) << 1) |
>  		((gfp_flags & __GFP_RECLAIMABLE) != 0);
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 6e6e626..2eead52 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -112,6 +112,8 @@ enum zone_stat_item {
>  	NUMA_LOCAL,		/* allocation from local node */
>  	NUMA_OTHER,		/* allocation from other node */
>  #endif
> +	NR_ZONE_CACHE_AVOID,
> +	NR_ZONE_RECLAIM,
>  	NR_VM_ZONE_STAT_ITEMS };
>  
>  /*
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 2fee51a..487bc3b 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -65,6 +65,12 @@ static inline int current_is_kswapd(void)
>  #define MAX_SWAPFILES \
>  	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
>  
> +#define RECLAIM_OFF 0
> +#define RECLAIM_ZONE (1<<0)	/* Run shrink_inactive_list on the zone */
> +#define RECLAIM_WRITE (1<<1)	/* Writeout pages during reclaim */
> +#define RECLAIM_SWAP (1<<2)	/* Swap pages out during reclaim */
> +#define RECLAIM_CACHE (1<<3)	/* Reclaim even though file cache purpose allocation */
> +
>  /*
>   * Magic header for a swap area. The first part of the union is
>   * what the swap magic looks like for the old (limited to 128MB)
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 3d4df44..97298c0 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -468,6 +468,7 @@ struct page *__page_cache_alloc(gfp_t gfp)
>  	if (cpuset_do_page_mem_spread()) {
>  		get_mems_allowed();
>  		n = cpuset_mem_spread_node();
> +		gfp &= ~__GFP_RECLAIMABLE;
>  		page = alloc_pages_exact_node(n, gfp, 0);
>  		put_mems_allowed();
>  		return page;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8587c10..f81c28f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1646,9 +1646,15 @@ zonelist_scan:
>  				    classzone_idx, alloc_flags))
>  				goto try_this_zone;
>  
> -			if (zone_reclaim_mode == 0)
> +			if (zone_reclaim_mode == RECLAIM_OFF)
>  				goto this_zone_full;
>  
> +			if (!(zone_reclaim_mode & RECLAIM_CACHE) &&
> +			    (gfp_mask & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK) {
> +				inc_zone_state(zone, NR_ZONE_CACHE_AVOID);
> +				goto try_next_zone;
> +			}
> +

That doesn't look very nice. There has to be a better way of identifying what
sort of allocations to avoid reclaim_mode for than passing in a meaningless
combination of migrate flags. Are we out of GFP flags? Whether it is one that
specifies it's an allocation for file-backed page cache or something that
indicates reclaim_mode is unnecessary, I don't really mind but it shouldn't
be magically encoded in the migrate flags.

I don't think we will ever get the default value for this tunable right.
I would also worry that avoiding the reclaim_mode for file-backed cache
will hurt HPC applications that are dumping their data to disk and
depending on the existing default for zone_reclaim_mode to not pollute
other nodes.

The ideal would be if distribution packages for mail, web servers and
others that are heavily IO orientated would prompt for a change to the
default value of zone_reclaim_mode in sysctl.

>  			ret = zone_reclaim(zone, gfp_mask, order);
>  			switch (ret) {
>  			case ZONE_RECLAIM_NOSCAN:
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c391c32..6f63eea 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2558,11 +2558,6 @@ module_init(kswapd_init)
>   */
>  int zone_reclaim_mode __read_mostly;
>  
> -#define RECLAIM_OFF 0
> -#define RECLAIM_ZONE (1<<0)	/* Run shrink_inactive_list on the zone */
> -#define RECLAIM_WRITE (1<<1)	/* Writeout pages during reclaim */
> -#define RECLAIM_SWAP (1<<2)	/* Swap pages out during reclaim */
> -
>  /*
>   * Priority for ZONE_RECLAIM. This determines the fraction of pages
>   * of a node considered for each zone_reclaim. 4 scans 1/16th of
> @@ -2646,6 +2641,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>  	};
>  	unsigned long nr_slab_pages0, nr_slab_pages1;
>  
> +	inc_zone_state(zone, NR_ZONE_RECLAIM);
> +
>  	cond_resched();
>  	/*
>  	 * We need to be able to allocate from the reserves for RECLAIM_SWAP
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index f389168..8988688 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -740,6 +740,8 @@ static const char * const vmstat_text[] = {
>  	"numa_local",
>  	"numa_other",
>  #endif
> +	"zone_cache_avoid",
> +	"zone_reclaim",
>  
>  #ifdef CONFIG_VM_EVENT_COUNTERS
>  	"pgpgin",
> -- 
> 1.6.5.2
> 
> 
> 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Default zone_reclaim_mode = 1 on NUMA kernel is bad forfile/email/web servers
  2010-09-20  9:34   ` Mel Gorman
@ 2010-09-20 23:41     ` Rob Mueller
  2010-09-21  9:04       ` Mel Gorman
  0 siblings, 1 reply; 31+ messages in thread
From: Rob Mueller @ 2010-09-20 23:41 UTC (permalink / raw)
  To: Mel Gorman, KOSAKI Motohiro
  Cc: linux-kernel, Bron Gondwana, linux-mm, Christoph Lameter

> I don't think we will ever get the default value for this tunable right.
> I would also worry that avoiding the reclaim_mode for file-backed
> cache will hurt HPC applications that are dumping their data to disk
> and depending on the existing default for zone_reclaim_mode to not
> pollute other nodes.
>
> The ideal would be if distribution packages for mail, web servers
> and others that are heavily IO orientated would prompt for a change
> to the default value of zone_reclaim_mode in sysctl.

I would argue that there's a lot more mail/web/file servers out there than 
HPC machines. And HPC machines tend to have a team of people to 
monitor/tweak them. I think it would be much more sane to default this to 0 
which works best for most people, and get the HPC people to change it.

However there's still another question, why is this problem happening at all 
for us? I know almost nothing about NUMA, but from other posts, it sounds 
like the problem is the memory allocations are all happening on one node? 
But I don't understand why that would be happening. The machine runs the 
cyrus IMAP server, which is a classic unix forking server with 1000's of 
processes. Each process will mmap lots of different files to access them. 
Why would that all be happening on one node, not spread around?

One thing is that the machine is vastly more IO loaded than CPU loaded, in 
fact it uses very little CPU at all (a few % usually). Does the kernel 
prefer to run processes on one particular node if it's available? So if a 
machine has very little CPU load, every process will generally end up 
running on the same node?

Rob

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Default zone_reclaim_mode = 1 on NUMA kernel is bad forfile/email/web servers
  2010-09-20 23:41     ` Default zone_reclaim_mode = 1 on NUMA kernel is bad forfile/email/web servers Rob Mueller
@ 2010-09-21  9:04       ` Mel Gorman
  2010-09-21 14:14         ` Christoph Lameter
  2010-09-27  2:01         ` KOSAKI Motohiro
  0 siblings, 2 replies; 31+ messages in thread
From: Mel Gorman @ 2010-09-21  9:04 UTC (permalink / raw)
  To: Rob Mueller
  Cc: KOSAKI Motohiro, linux-kernel, Bron Gondwana, linux-mm,
	Christoph Lameter

On Tue, Sep 21, 2010 at 09:41:21AM +1000, Rob Mueller wrote:
>> I don't think we will ever get the default value for this tunable right.
>> I would also worry that avoiding the reclaim_mode for file-backed
>> cache will hurt HPC applications that are dumping their data to disk
>> and depending on the existing default for zone_reclaim_mode to not
>> pollute other nodes.
>>
>> The ideal would be if distribution packages for mail, web servers
>> and others that are heavily IO orientated would prompt for a change
>> to the default value of zone_reclaim_mode in sysctl.
>
> I would argue that there's a lot more mail/web/file servers out there 
> than HPC machines. And HPC machines tend to have a team of people to  
> monitor/tweak them. I think it would be much more sane to default this to 
> 0 which works best for most people, and get the HPC people to change it.
>

No doubt this is true. The only real difference is that there are more NUMA
machines running mail/web/file servers now than there might have been in the
past. The default made sense once upon a time. Personally I wouldn't mind
the default changing but my preference would be that distribution packages
installing on NUMA machines would prompt if the default should be changed if it
is likely to be of benefit for that package (e.g. the mail, file and web ones).

> However there's still another question, why is this problem happening at 
> all for us? I know almost nothing about NUMA, but from other posts, it 
> sounds like the problem is the memory allocations are all happening on 
> one node?

Yes.

> But I don't understand why that would be happening.

Because in a situation where you have many NUMA-aware applications
running bound to CPUs, it performs better if they always allocate from
local nodes instead of accessing remote nodes. It's great for one type
of workload but not so much for mail/web/file.

> The machine 
> runs the cyrus IMAP server, which is a classic unix forking server with 
> 1000's of processes. Each process will mmap lots of different files to 
> access them. Why would that all be happening on one node, not spread 
> around?
>

Honestly, I don't know and I don't have such a machine to investigate
with. My guess is that there are a number of files that are hot and
accessed by multiple processes on different nodes and they are evicting
each other but it's only a guess.

> One thing is that the machine is vastly more IO loaded than CPU loaded, 
> in fact it uses very little CPU at all (a few % usually). Does the kernel 
> prefer to run processes on one particular node if it's available?

It prefers to run on the same node it ran previously. If they all
happened to start up on a small subset of nodes, they could be
continually getting running there.

> So if a 
> machine has very little CPU load, every process will generally end up  
> running on the same node?
>

It's possible they are running on a small subset. mpstat should be able
to give a basic idea of what the spread across CPUs is.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Default zone_reclaim_mode = 1 on NUMA kernel is bad forfile/email/web servers
  2010-09-21  9:04       ` Mel Gorman
@ 2010-09-21 14:14         ` Christoph Lameter
  2010-09-22  3:44           ` Rob Mueller
  2010-09-27  2:01         ` KOSAKI Motohiro
  1 sibling, 1 reply; 31+ messages in thread
From: Christoph Lameter @ 2010-09-21 14:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rob Mueller, KOSAKI Motohiro, linux-kernel, Bron Gondwana,
	linux-mm

On Tue, 21 Sep 2010, Mel Gorman wrote:

> > However there's still another question, why is this problem happening at
> > all for us? I know almost nothing about NUMA, but from other posts, it
> > sounds like the problem is the memory allocations are all happening on
> > one node?
>
> Yes.

This could be a screwy hardware issue as pointed out before. Certain
controllers restrict the memory that I/O can be done to also (32 bit
controller only able to do I/O to lower 2G?, controller on a PCI bus that
is local only to a particular node) which would make balancing
the file cache difficult.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Default zone_reclaim_mode = 1 on NUMA kernel is bad forfile/email/web servers
  2010-09-21 14:14         ` Christoph Lameter
@ 2010-09-22  3:44           ` Rob Mueller
  0 siblings, 0 replies; 31+ messages in thread
From: Rob Mueller @ 2010-09-22  3:44 UTC (permalink / raw)
  To: Christoph Lameter, Mel Gorman
  Cc: KOSAKI Motohiro, linux-kernel, Bron Gondwana, linux-mm


> This could be a screwy hardware issue as pointed out before. Certain
> controllers restrict the memory that I/O can be done to also (32 bit
> controller only able to do I/O to lower 2G?, controller on a PCI bus that
> is local only to a particular node) which would make balancing
> the file cache difficult.

Ah interesting. Is there an easy way to tell if this is an issue? It's an 
ARECA RAID controller, this is the lspci -vvv data from it...

03:00.0 RAID bus controller: Areca Technology Corp. Device 1680
        Subsystem: Areca Technology Corp. Device 1680
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ 
Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- 
<TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 26
        Region 0: Memory at b1900000 (32-bit, non-prefetchable) [size=8K]
        Expansion ROM at b1c00000 [disabled] [size=64K]
        Capabilities: [98] Power Management version 2
                Flags: PMEClk- DSI- D1+ D2- AuxCurrent=0mA 
PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [a0] Message Signalled Interrupts: Mask- 64bit+ 
Queue=0/1 Enable-
                Address: 0000000000000000  Data: 0000
        Capabilities: [d0] Express (v1) Endpoint, MSI 00
                DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s 
unlimited, L1 <1us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ 
Unsupported+
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 256 bytes, MaxReadReq 256 bytes
                DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- 
TransPend-
                LnkCap: Port #0, Speed 2.5GT/s, Width x8, ASPM unknown, 
Latency L0 <128ns, L1 unlimited
                        ClockPM- Suprise- LLActRep- BwNot-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- 
CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x8, TrErr- Train- SlotClk+ 
DLActive- BWMgmt- ABWMgmt-
        Capabilities: [100] Advanced Error Reporting <?>
        Kernel driver in use: arcmsr



Rob


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Default zone_reclaim_mode = 1 on NUMA kernel is bad forfile/email/web servers
  2010-09-21  9:04       ` Mel Gorman
  2010-09-21 14:14         ` Christoph Lameter
@ 2010-09-27  2:01         ` KOSAKI Motohiro
  2010-09-27 13:53           ` Christoph Lameter
  1 sibling, 1 reply; 31+ messages in thread
From: KOSAKI Motohiro @ 2010-09-27  2:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Rob Mueller, linux-kernel, Bron Gondwana,
	linux-mm, Christoph Lameter

Hi

> No doubt this is true. The only real difference is that there are more NUMA
> machines running mail/web/file servers now than there might have been in the
> past. The default made sense once upon a time. Personally I wouldn't mind
> the default changing but my preference would be that distribution packages
> installing on NUMA machines would prompt if the default should be changed if it
> is likely to be of benefit for that package (e.g. the mail, file and web ones).

At first impression, I thought this is cute idea. But, after while thinking, I've found some
weak point. The problem is, too many package need to disable zone_reclaim_mode.
zone_reclaim doesn't works fine if an application need large working set rather than
local node size. It mean major desktop applications (e.g. OpenOffice.org, Firefox, GIMP)
need to disable zone_reclaim. It mean even though basic package installation require 
zone_reclaim disabling. Then, this mechanism doesn't works practically. Even though
the user hope to use the machine for hpc, disable zone_reclaim will be turn on anyway.

Probably, opposite switch (default is zone_reclaim=0, and installation MPI library change
to zone_reclaim=1) might works. but I can guess why you don't propose this one.

Hmm....

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Default zone_reclaim_mode = 1 on NUMA kernel is bad forfile/email/web servers
  2010-09-27  2:01         ` KOSAKI Motohiro
@ 2010-09-27 13:53           ` Christoph Lameter
  2010-09-27 23:17             ` Robert Mueller
                               ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Christoph Lameter @ 2010-09-27 13:53 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Rob Mueller, linux-kernel, Bron Gondwana, linux-mm

On Mon, 27 Sep 2010, KOSAKI Motohiro wrote:

> > No doubt this is true. The only real difference is that there are more NUMA
> > machines running mail/web/file servers now than there might have been in the
> > past. The default made sense once upon a time. Personally I wouldn't mind
> > the default changing but my preference would be that distribution packages
> > installing on NUMA machines would prompt if the default should be changed if it
> > is likely to be of benefit for that package (e.g. the mail, file and web ones).
>
> At first impression, I thought this is cute idea. But, after while thinking, I've found some
> weak point. The problem is, too many package need to disable zone_reclaim_mode.
> zone_reclaim doesn't works fine if an application need large working set rather than
> local node size. It mean major desktop applications (e.g. OpenOffice.org, Firefox, GIMP)
> need to disable zone_reclaim. It mean even though basic package installation require
> zone_reclaim disabling. Then, this mechanism doesn't works practically. Even though
> the user hope to use the machine for hpc, disable zone_reclaim will be turn on anyway.
>
> Probably, opposite switch (default is zone_reclaim=0, and installation MPI library change
> to zone_reclaim=1) might works. but I can guess why you don't propose this one.

The fundamental problem that needs to be addressed is the balancing of a
memory load in a system with memory ranges that have different performance
characteristics when running conventional software that does not
properly balance allocations and that has not been written with these
new memory balancing issues in mind.

You can switch off zone reclaim of course which means that the
applications will not be getting memory thats optimal for them to access.
Given the current minimal NUMA differences in most single server systems
this is likely not going to matter. In fact the kernel has such a
mechanism to switch off zone reclaim for such systems (see the definition
of RECLAIM_DISTANCE). Which seems to have somehow been defeated by the
ACPI information of those machines which indicate a high latency
difference between the memory areas. The arch code could be adjusted to
set a higher RECLAIM_DISTANCE so that this motherboard also will default
to zone reclaim mode off.

However, the larger the NUMA effects become the more the performance loss
due to these effect. Its expected that the number of processors and
therefore also the NUMA effects in coming generations of machines will
increase. Various API exist to do finer grained memory access control so
that the performance loss can be isolated to processes or memory ranges.

F.e. running the application with numactl (using interleave) or a cpuset
with round robin on could address this issue without changing zone
reclaim and would allow other processes to allocate faster local memory.

The problem with zone reclaim mainly is created for large apps whose
working set is larger than the local node. The special settings are only
needing for those applications.

What can be done here is:

1. Fix the ACPI information to indicate lower memory access differences
   (was that info actually acurate?) so that zone reclaim defaults to off.

2. Change the RECLAIM_DISTANCE setting for the arch so that the ACPI
   information does not trigger zone reclaim to be enabled.

3. Run the application with numactl settings for interleaving of memory
   accesses (or corresponding cpuset settings).

4. Fix the application to be conscious of the effect of memory allocations
   on a NUMA systems. Use the numa memory allocations API to allocate
   anonymous memory locally for optimal access and set interleave for the
   file backed pages.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Default zone_reclaim_mode = 1 on NUMA kernel is bad forfile/email/web servers
  2010-09-27 13:53           ` Christoph Lameter
@ 2010-09-27 23:17             ` Robert Mueller
  2010-09-28 12:35               ` Christoph Lameter
  2010-09-30  7:05             ` Andi Kleen
  2010-10-04 12:45             ` KOSAKI Motohiro
  2 siblings, 1 reply; 31+ messages in thread
From: Robert Mueller @ 2010-09-27 23:17 UTC (permalink / raw)
  To: Christoph Lameter, KOSAKI Motohiro
  Cc: Mel Gorman, linux-kernel, Bron Gondwana, linux-mm

> You can switch off zone reclaim of course which means that the
> applications will not be getting memory thats optimal for them to access.

That's true, but also remember that going to disk is going to be way
more expensive than memory on another node. What we found was that data
that should have been cached because it was being accessed a lot, wasn't
being cached, so it had to keep going back to disk to get it. That's
even worse.

> 1. Fix the ACPI information to indicate lower memory access
>    differences (was that info actually acurate?) so that zone reclaim
>    defaults to off.
> 
> 2. Change the RECLAIM_DISTANCE setting for the arch so that the ACPI
>    information does not trigger zone reclaim to be enabled.

How would the ACPI information actually be changed?

I ran numactl -H to get the hardware information, and that seems to
include distances. As mentioned previously, this is a very standard
Intel server motherboard.

http://www.intel.com/Products/Server/Motherboards/S5520UR/S5520UR-specifications.htm

Intel 5520 chipset with Intel I/O Controller Hub ICH10R

$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14
node 0 size: 24517 MB
node 0 free: 1523 MB
node 1 cpus: 1 3 5 7 9 11 13 15
node 1 size: 24576 MB
node 1 free: 39 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

Since I'm not sure what the "distance" values mean, I have no idea if
those values large or not?

> 3. Run the application with numactl settings for interleaving of
>    memory accesses (or corresponding cpuset settings).
>
> 4. Fix the application to be conscious of the effect of memory
>    allocations on a NUMA systems. Use the numa memory allocations API
>    to allocate anonymous memory locally for optimal access and set
>    interleave for the file backed pages.

The problem we saw was purely with file caching. The application wasn't
actually allocating much memory itself, but it was reading lots of files
from disk (via mmap'ed memory mostly), and as most people would, we
expected that data would be cached in memory to reduce future reads from
disk. That was not happening.

Rob

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Default zone_reclaim_mode = 1 on NUMA kernel is bad forfile/email/web servers
  2010-09-27 23:17             ` Robert Mueller
@ 2010-09-28 12:35               ` Christoph Lameter
  2010-09-28 12:42                 ` Bron Gondwana
  0 siblings, 1 reply; 31+ messages in thread
From: Christoph Lameter @ 2010-09-28 12:35 UTC (permalink / raw)
  To: Robert Mueller
  Cc: KOSAKI Motohiro, Mel Gorman, linux-kernel, Bron Gondwana,
	linux-mm

On Tue, 28 Sep 2010, Robert Mueller wrote:

> How would the ACPI information actually be changed?

Fix the BIOS SLIT distance tables.

> I ran numactl -H to get the hardware information, and that seems to
> include distances. As mentioned previously, this is a very standard
> Intel server motherboard.
>
> http://www.intel.com/Products/Server/Motherboards/S5520UR/S5520UR-specifications.htm
>
> Intel 5520 chipset with Intel I/O Controller Hub ICH10R
>
> $ numactl -H
> available: 2 nodes (0-1)
> node 0 cpus: 0 2 4 6 8 10 12 14
> node 0 size: 24517 MB
> node 0 free: 1523 MB
> node 1 cpus: 1 3 5 7 9 11 13 15
> node 1 size: 24576 MB
> node 1 free: 39 MB
> node distances:
> node   0   1
>   0:  10  21
>   1:  21  10

21 is larger than REMOTE_DISTANCE on x86 and triggers zone_reclaim

19 would keep it off.


> Since I'm not sure what the "distance" values mean, I have no idea if
> those values large or not?

Distance values represent the additional latency necessary to access
remote memory vs local memory (10)

> > 4. Fix the application to be conscious of the effect of memory
> >    allocations on a NUMA systems. Use the numa memory allocations API
> >    to allocate anonymous memory locally for optimal access and set
> >    interleave for the file backed pages.
>
> The problem we saw was purely with file caching. The application wasn't
> actually allocating much memory itself, but it was reading lots of files
> from disk (via mmap'ed memory mostly), and as most people would, we
> expected that data would be cached in memory to reduce future reads from
> disk. That was not happening.

Obviously and you have stated that numerous times. Problem that the use of
a remote memory will reduced performance of reads so the OS (with
zone_reclaim=1) defaults to the use of local memory and favors reclaim of
local memory over the allocation from the remote node. This is fine if
you have multiple applications running on both nodes because then each
application will get memory local to it and therefore run faster. That
does not work with a single app that only allocates from one node.

Control over memory allocations over the various nodes under NUMA
for a process can occur via the numactl ctl or the libnuma C apis.

F.e.e

numactl --interleave ... command

will address that issue for a specific command that needs to go

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Default zone_reclaim_mode = 1 on NUMA kernel is bad forfile/email/web servers
  2010-09-28 12:35               ` Christoph Lameter
@ 2010-09-28 12:42                 ` Bron Gondwana
  2010-09-28 12:49                   ` Christoph Lameter
  0 siblings, 1 reply; 31+ messages in thread
From: Bron Gondwana @ 2010-09-28 12:42 UTC (permalink / raw)
  To: Christoph Lameter, Robert Mueller
  Cc: KOSAKI Motohiro, Mel Gorman, Linux Kernel Mailing List, linux-mm

On Tue, 28 Sep 2010 07:35 -0500, "Christoph Lameter" <cl@linux.com> wrote:
> > The problem we saw was purely with file caching. The application wasn't
> > actually allocating much memory itself, but it was reading lots of files
> > from disk (via mmap'ed memory mostly), and as most people would, we
> > expected that data would be cached in memory to reduce future reads from
> > disk. That was not happening.
> 
> Obviously and you have stated that numerous times. Problem that the use
> of
> a remote memory will reduced performance of reads so the OS (with
> zone_reclaim=1) defaults to the use of local memory and favors reclaim of
> local memory over the allocation from the remote node. This is fine if
> you have multiple applications running on both nodes because then each
> application will get memory local to it and therefore run faster. That
> does not work with a single app that only allocates from one node.

Is this what's happening, or is IO actually coming from disk in preference
to the remote node?  I can certainly see the logic behind preferring to
reclaim the local node if that's all that's happening - though the OS should
be allocating the different tasks more evenly across the nodes in that case.

> Control over memory allocations over the various nodes under NUMA
> for a process can occur via the numactl ctl or the libnuma C apis.
> 
> F.e.e
> 
> numactl --interleave ... command
> 
> will address that issue for a specific command that needs to go

Gosh what a pain.  While it won't kill us too much to add to our
startup, it does feel a lot like the tail is wagging the dog from here
still.  A task that doesn't ask for anything special should get sane
defaults, and the cost of data from the other node should be a lot
less than the cost of the same data from spinning rust.

Bron.
-- 
  Bron Gondwana
  brong@fastmail.fm


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Default zone_reclaim_mode = 1 on NUMA kernel is bad forfile/email/web servers
  2010-09-28 12:42                 ` Bron Gondwana
@ 2010-09-28 12:49                   ` Christoph Lameter
  0 siblings, 0 replies; 31+ messages in thread
From: Christoph Lameter @ 2010-09-28 12:49 UTC (permalink / raw)
  To: Bron Gondwana
  Cc: Robert Mueller, KOSAKI Motohiro, Mel Gorman,
	Linux Kernel Mailing List, linux-mm

On Tue, 28 Sep 2010, Bron Gondwana wrote:

> Is this what's happening, or is IO actually coming from disk in preference
> to the remote node?  I can certainly see the logic behind preferring to
> reclaim the local node if that's all that's happening - though the OS should
> be allocating the different tasks more evenly across the nodes in that case.

Not sure about the disk. I did not see anything that would indicate and
issue with only being able to do 32 bit and I am no expert on the device
driver operations.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Default zone_reclaim_mode = 1 on NUMA kernel is bad forfile/email/web servers
  2010-09-27 13:53           ` Christoph Lameter
  2010-09-27 23:17             ` Robert Mueller
@ 2010-09-30  7:05             ` Andi Kleen
  2010-10-04 12:45             ` KOSAKI Motohiro
  2 siblings, 0 replies; 31+ messages in thread
From: Andi Kleen @ 2010-09-30  7:05 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: KOSAKI Motohiro, Mel Gorman, Rob Mueller, linux-kernel,
	Bron Gondwana, linux-mm

Christoph Lameter <cl@linux.com> writes:
>
> 1. Fix the ACPI information to indicate lower memory access differences
>    (was that info actually acurate?) so that zone reclaim defaults to off.

The reason the ACPI information is set this way is that the people who
tune the BIOS have some workload they care about which prefers zone
reclaim off and they know they can force this "faster setting" by faking
the distances.

Basically they're working around a Linux performance quirk.

Really I think some variant of Motohiro-san's patch is the right
solution: most problems with zone reclaim are related to IO 
intensive workloads and it never made sense to have the unmapped
disk cache local on a system with reasonably small NUMA factor.

The only problem is on extremly big NUMA systems where remote nodes
are so slow that it's too slow even for read() and write().
I have been playing with the idea of adding a new "nearby interleave"
NUMA mode for this, but didn't have time to implement it so far.

For application I don't think we can ever solve it completely, this
probably always needs some kind of tuning. Currently the NUMA policy
APIs are not too good for this because they are too static, e.g. in some
cases "nearby" without fixed node affinity would also help here.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Default zone_reclaim_mode = 1 on NUMA kernel is bad forfile/email/web servers
  2010-09-27 13:53           ` Christoph Lameter
  2010-09-27 23:17             ` Robert Mueller
  2010-09-30  7:05             ` Andi Kleen
@ 2010-10-04 12:45             ` KOSAKI Motohiro
  2010-10-04 13:07               ` Christoph Lameter
  2010-10-04 19:43               ` David Rientjes
  2 siblings, 2 replies; 31+ messages in thread
From: KOSAKI Motohiro @ 2010-10-04 12:45 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: kosaki.motohiro, Mel Gorman, Rob Mueller, linux-kernel,
	Bron Gondwana, linux-mm

Hi

> On Mon, 27 Sep 2010, KOSAKI Motohiro wrote:
> 
> > > No doubt this is true. The only real difference is that there are more NUMA
> > > machines running mail/web/file servers now than there might have been in the
> > > past. The default made sense once upon a time. Personally I wouldn't mind
> > > the default changing but my preference would be that distribution packages
> > > installing on NUMA machines would prompt if the default should be changed if it
> > > is likely to be of benefit for that package (e.g. the mail, file and web ones).
> >
> > At first impression, I thought this is cute idea. But, after while thinking, I've found some
> > weak point. The problem is, too many package need to disable zone_reclaim_mode.
> > zone_reclaim doesn't works fine if an application need large working set rather than
> > local node size. It mean major desktop applications (e.g. OpenOffice.org, Firefox, GIMP)
> > need to disable zone_reclaim. It mean even though basic package installation require
> > zone_reclaim disabling. Then, this mechanism doesn't works practically. Even though
> > the user hope to use the machine for hpc, disable zone_reclaim will be turn on anyway.
> >
> > Probably, opposite switch (default is zone_reclaim=0, and installation MPI library change
> > to zone_reclaim=1) might works. but I can guess why you don't propose this one.
> 
> The fundamental problem that needs to be addressed is the balancing of a
> memory load in a system with memory ranges that have different performance
> characteristics when running conventional software that does not
> properly balance allocations and that has not been written with these
> new memory balancing issues in mind.

Yeah. page cache often have very long life than processes. then, CPU place
which current process running is not so good heuristics. and kernel don't 
have good statistics to find best node for cache. That's problem.
How do we know future processes work on which cpus?

Also, CPU scheduler have an issue. IO intensive workload often makes
unbalanced process layout. (cpus haven't been so busy yet. why do we
need to make costly cpu migration?). end up, memory consumption also 
become unbalanced. this is also difficult issue. hmm..

> 
> You can switch off zone reclaim of course which means that the
> applications will not be getting memory thats optimal for them to access.
> Given the current minimal NUMA differences in most single server systems
> this is likely not going to matter. In fact the kernel has such a
> mechanism to switch off zone reclaim for such systems (see the definition
> of RECLAIM_DISTANCE). Which seems to have somehow been defeated by the
> ACPI information of those machines which indicate a high latency
> difference between the memory areas. The arch code could be adjusted to
> set a higher RECLAIM_DISTANCE so that this motherboard also will default
> to zone reclaim mode off.

Yup.

> 
> However, the larger the NUMA effects become the more the performance loss
> due to these effect. Its expected that the number of processors and
> therefore also the NUMA effects in coming generations of machines will
> increase. Various API exist to do finer grained memory access control so
> that the performance loss can be isolated to processes or memory ranges.
> 
> F.e. running the application with numactl (using interleave) or a cpuset
> with round robin on could address this issue without changing zone
> reclaim and would allow other processes to allocate faster local memory.
> 
> The problem with zone reclaim mainly is created for large apps whose
> working set is larger than the local node. The special settings are only
> needing for those applications.

In theory, yes. but please talk with userland developers. They always say
"Our software work fine on *BSD, Solaris, Mac, etc etc. that's definitely 
linux problem". /me have no way to persuade them ;-)

> 
> What can be done here is:
> 
> 1. Fix the ACPI information to indicate lower memory access differences
>    (was that info actually acurate?) so that zone reclaim defaults to off.

I think it's accurate. and I don't think this is easy works because
there are many mothorboard vendor in the world and we don't have a way of
communicate them. That's difficulty of the commodity.

> 
> 2. Change the RECLAIM_DISTANCE setting for the arch so that the ACPI
>    information does not trigger zone reclaim to be enabled.

This is one of option. but we don't need to create x86 arch specific
RECLAIM_DISTANCE. Because practical high-end numa machine are either
ia64(SGI, Fujitsu) or Power(IBM) and both platform already have arch
specific definition. then changing RECLAIM_DISTANCE doesn't make any
side effect on such platform. and if possible, x86 shouldn't have
arch specific definition because almost minor arch don't have a lot of
tester and its quality often depend on testing on x86.

attached a patch below.

> 3. Run the application with numactl settings for interleaving of memory
>    accesses (or corresponding cpuset settings).

If the problem was on only few atypical software, this makes sense.
but I don't think this is practical way on current situation.

> 
> 4. Fix the application to be conscious of the effect of memory allocations
>    on a NUMA systems. Use the numa memory allocations API to allocate
>    anonymous memory locally for optimal access and set interleave for the
>    file backed pages.

For performance, this is best way definitely. And MySQL or other DB software
should concern this, I believe.
But, again, the problem is, too many software don't match zone_reclaim_mode.

>From d54928bfb4b2b865bedcff17e9b45dfbb714a5e6 Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Date: Thu, 14 Oct 2010 13:48:21 +0900
Subject: [PATCH] mm: increase RECLAIM_DISTANCE to 30

Recently, Robert Mueller reported zone_reclaim_mode doesn't work
properly on his new NUMA server (Dual Xeon E5520 + Intel S5520UR MB).
He is using Cyrus IMAPd and it's built on a very traditional
single-process model.

  * a master process which reads config files and manages the other
    process
  * multiple imapd processes, one per connection
  * multiple pop3d processes, one per connection
  * multiple lmtpd processes, one per connection
  * periodical "cleanup" processes.

Then, there are thousands of independent processes. The problem is,
recent Intel motherboard turn on zone_reclaim_mode by default and
traditional prefork model software don't work fine on it.
Unfortunatelly, Such model is still typical one even though 21th
century. We can't ignore them.

This patch raise zone_reclaim_mode threshold to 30. 30 don't have
specific meaning. but 20 mean one-hop QPI/Hypertransport and such
relatively cheap 2-4 socket machine are often used for tradiotional
server as above. The intention is, their machine don't use
zone_reclaim_mode.

Note: ia64 and Power have arch specific RECLAIM_DISTANCE definition.
then this patch doesn't change such high-end NUMA machine behavior.

Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <cl@linux.com>
Cc: Bron Gondwana <brong@fastmail.fm>
Cc: Robert Mueller <robm@fastmail.fm>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 include/linux/topology.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index 64e084f..bfbec49 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -60,7 +60,7 @@ int arch_update_cpu_topology(void);
  * (in whatever arch specific measurement units returned by node_distance())
  * then switch on zone reclaim on boot.
  */
-#define RECLAIM_DISTANCE 20
+#define RECLAIM_DISTANCE 30
 #endif
 #ifndef PENALTY_FOR_NODE_WITH_CPUS
 #define PENALTY_FOR_NODE_WITH_CPUS	(1)
-- 
1.6.5.2

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: Default zone_reclaim_mode = 1 on NUMA kernel is bad forfile/email/web servers
  2010-10-04 12:45             ` KOSAKI Motohiro
@ 2010-10-04 13:07               ` Christoph Lameter
  2010-10-05  5:32                 ` KOSAKI Motohiro
  2010-10-04 19:43               ` David Rientjes
  1 sibling, 1 reply; 31+ messages in thread
From: Christoph Lameter @ 2010-10-04 13:07 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Rob Mueller, linux-kernel, Bron Gondwana, linux-mm

On Mon, 4 Oct 2010, KOSAKI Motohiro wrote:

> > The problem with zone reclaim mainly is created for large apps whose
> > working set is larger than the local node. The special settings are only
> > needing for those applications.
>
> In theory, yes. but please talk with userland developers. They always say
> "Our software work fine on *BSD, Solaris, Mac, etc etc. that's definitely
> linux problem". /me have no way to persuade them ;-)

Do those support NUMA? I would think not. You would have to switch on
interleave at the BIOS level (getting a hardware hack in place to get
rid of the NUMA effects) to make these OSes run right.

> This is one of option. but we don't need to create x86 arch specific
> RECLAIM_DISTANCE. Because practical high-end numa machine are either
> ia64(SGI, Fujitsu) or Power(IBM) and both platform already have arch
> specific definition. then changing RECLAIM_DISTANCE doesn't make any
> side effect on such platform. and if possible, x86 shouldn't have
> arch specific definition because almost minor arch don't have a lot of
> tester and its quality often depend on testing on x86.
>
> attached a patch below.

Looks good.

Acked-by: Christoph Lameter <cl@linux.com>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Default zone_reclaim_mode = 1 on NUMA kernel is bad forfile/email/web servers
  2010-10-04 13:07               ` Christoph Lameter
@ 2010-10-05  5:32                 ` KOSAKI Motohiro
  0 siblings, 0 replies; 31+ messages in thread
From: KOSAKI Motohiro @ 2010-10-05  5:32 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: kosaki.motohiro, Mel Gorman, Rob Mueller, linux-kernel,
	Bron Gondwana, linux-mm

> On Mon, 4 Oct 2010, KOSAKI Motohiro wrote:
> 
> > > The problem with zone reclaim mainly is created for large apps whose
> > > working set is larger than the local node. The special settings are only
> > > needing for those applications.
> >
> > In theory, yes. but please talk with userland developers. They always say
> > "Our software work fine on *BSD, Solaris, Mac, etc etc. that's definitely
> > linux problem". /me have no way to persuade them ;-)
> 
> Do those support NUMA? I would think not. You would have to switch on
> interleave at the BIOS level (getting a hardware hack in place to get
> rid of the NUMA effects) to make these OSes run right.

Sure. It wouldn't. Many opensource userland developers don't like
using out of posix API. In the other hand, many proprietery developers
don't hesitate it. I don't know reason.
Also, I'm not sure evey Corei7 Motherboard have BIOS level numa interleaving.
Are you sure? generically, commodity component vendor don't like to equipe 
additonal firmware feature. It's not zero cost. I think this solusion only fit
server vendor (e.g. IBM, HP, Fujitsu). but dunnno. Myself and fujitsu haven't hit 
this issue. I don't know _every_ motherboard equipement in the world.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Default zone_reclaim_mode = 1 on NUMA kernel is bad forfile/email/web servers
  2010-10-04 12:45             ` KOSAKI Motohiro
  2010-10-04 13:07               ` Christoph Lameter
@ 2010-10-04 19:43               ` David Rientjes
  1 sibling, 0 replies; 31+ messages in thread
From: David Rientjes @ 2010-10-04 19:43 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Christoph Lameter, Mel Gorman, Rob Mueller, linux-kernel,
	Bron Gondwana, linux-mm

On Mon, 4 Oct 2010, KOSAKI Motohiro wrote:

> Recently, Robert Mueller reported zone_reclaim_mode doesn't work
> properly on his new NUMA server (Dual Xeon E5520 + Intel S5520UR MB).
> He is using Cyrus IMAPd and it's built on a very traditional
> single-process model.
> 
>   * a master process which reads config files and manages the other
>     process
>   * multiple imapd processes, one per connection
>   * multiple pop3d processes, one per connection
>   * multiple lmtpd processes, one per connection
>   * periodical "cleanup" processes.
> 
> Then, there are thousands of independent processes. The problem is,
> recent Intel motherboard turn on zone_reclaim_mode by default and
> traditional prefork model software don't work fine on it.
> Unfortunatelly, Such model is still typical one even though 21th
> century. We can't ignore them.
> 
> This patch raise zone_reclaim_mode threshold to 30. 30 don't have
> specific meaning. but 20 mean one-hop QPI/Hypertransport and such
> relatively cheap 2-4 socket machine are often used for tradiotional
> server as above. The intention is, their machine don't use
> zone_reclaim_mode.
> 
> Note: ia64 and Power have arch specific RECLAIM_DISTANCE definition.
> then this patch doesn't change such high-end NUMA machine behavior.
> 
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Bron Gondwana <brong@fastmail.fm>
> Cc: Robert Mueller <robm@fastmail.fm>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Acked-by: David Rientjes <rientjes@google.com>

We already do this, but I guess it never got pushed to mainline.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Default zone_reclaim_mode = 1 on NUMA kernel is bad for file/email/web servers
  2010-09-16 10:01 ` KOSAKI Motohiro
  2010-09-16 17:06   ` Christoph Lameter
  2010-09-20  9:34   ` Mel Gorman
@ 2010-09-21  1:05   ` KAMEZAWA Hiroyuki
  2010-09-27  2:04     ` KOSAKI Motohiro
  2010-09-23 11:44   ` Balbir Singh
  2010-09-30  8:38   ` Bron Gondwana
  4 siblings, 1 reply; 31+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-21  1:05 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: robm, linux-kernel, Bron Gondwana, linux-mm, Christoph Lameter,
	Mel Gorman

On Thu, 16 Sep 2010 19:01:32 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> Yes, sadly intel motherboard turn on zone_reclaim_mode by default. and
> current zone_reclaim_mode doesn't fit file/web server usecase ;-)
> 
> So, I've created new proof concept patch. This doesn't disable zone_reclaim
> at all. Instead, distinguish for file cache and for anon allocation and
> only file cache doesn't use zone-reclaim.
> 
> That said, high-end hpc user often turn on cpuset.memory_spread_page and
> they avoid this issue. But, why don't we consider avoid it by default?
> 
> 
> Rob, I wonder if following patch help you. Could you please try it?
> 
> 
> Subject: [RFC] vmscan: file cache doesn't use zone_reclaim by default
> 

Hm, can't we use migration of file caches rather than pageout in
zone_reclaim_mode ? Doent' it fix anything ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Default zone_reclaim_mode = 1 on NUMA kernel is bad for file/email/web servers
  2010-09-21  1:05   ` Default zone_reclaim_mode = 1 on NUMA kernel is bad for file/email/web servers KAMEZAWA Hiroyuki
@ 2010-09-27  2:04     ` KOSAKI Motohiro
  2010-09-27  2:06       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 31+ messages in thread
From: KOSAKI Motohiro @ 2010-09-27  2:04 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: kosaki.motohiro, robm, linux-kernel, Bron Gondwana, linux-mm,
	Christoph Lameter, Mel Gorman

> On Thu, 16 Sep 2010 19:01:32 +0900 (JST)
> KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 
> > Yes, sadly intel motherboard turn on zone_reclaim_mode by default. and
> > current zone_reclaim_mode doesn't fit file/web server usecase ;-)
> > 
> > So, I've created new proof concept patch. This doesn't disable zone_reclaim
> > at all. Instead, distinguish for file cache and for anon allocation and
> > only file cache doesn't use zone-reclaim.
> > 
> > That said, high-end hpc user often turn on cpuset.memory_spread_page and
> > they avoid this issue. But, why don't we consider avoid it by default?
> > 
> > 
> > Rob, I wonder if following patch help you. Could you please try it?
> > 
> > 
> > Subject: [RFC] vmscan: file cache doesn't use zone_reclaim by default
> > 
> 
> Hm, can't we use migration of file caches rather than pageout in
> zone_reclaim_mode ? Doent' it fix anything ?

Doesn't.

Two problem. 1) Migration makes copy. then it's slower than zone_reclaim=0
2) Migration is only effective if target node has much free pages. but it
is not generic assumption.

For this case, zone_reclaim_mode=0 is best. my patch works as second best.
your one works as third.

If you have more concern, please let us know it.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Default zone_reclaim_mode = 1 on NUMA kernel is bad for file/email/web servers
  2010-09-27  2:04     ` KOSAKI Motohiro
@ 2010-09-27  2:06       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 31+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-27  2:06 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: robm, linux-kernel, Bron Gondwana, linux-mm, Christoph Lameter,
	Mel Gorman

On Mon, 27 Sep 2010 11:04:54 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > On Thu, 16 Sep 2010 19:01:32 +0900 (JST)
> > KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > 
> > > Yes, sadly intel motherboard turn on zone_reclaim_mode by default. and
> > > current zone_reclaim_mode doesn't fit file/web server usecase ;-)
> > > 
> > > So, I've created new proof concept patch. This doesn't disable zone_reclaim
> > > at all. Instead, distinguish for file cache and for anon allocation and
> > > only file cache doesn't use zone-reclaim.
> > > 
> > > That said, high-end hpc user often turn on cpuset.memory_spread_page and
> > > they avoid this issue. But, why don't we consider avoid it by default?
> > > 
> > > 
> > > Rob, I wonder if following patch help you. Could you please try it?
> > > 
> > > 
> > > Subject: [RFC] vmscan: file cache doesn't use zone_reclaim by default
> > > 
> > 
> > Hm, can't we use migration of file caches rather than pageout in
> > zone_reclaim_mode ? Doent' it fix anything ?
> 
> Doesn't.
> 
> Two problem. 1) Migration makes copy. then it's slower than zone_reclaim=0
> 2) Migration is only effective if target node has much free pages. but it
> is not generic assumption.
> 
> For this case, zone_reclaim_mode=0 is best. my patch works as second best.
> your one works as third.
> 

Hmm. I'm not sure whether it's "slower" or not. And Migraion doesn't
assume target node because it can use zonelist fallback.

I'm just has concerns that kicked-out pages will be paged-in soon.

But ok, maybe complicated.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Default zone_reclaim_mode = 1 on NUMA kernel is bad for file/email/web servers
  2010-09-16 10:01 ` KOSAKI Motohiro
                     ` (2 preceding siblings ...)
  2010-09-21  1:05   ` Default zone_reclaim_mode = 1 on NUMA kernel is bad for file/email/web servers KAMEZAWA Hiroyuki
@ 2010-09-23 11:44   ` Balbir Singh
  2010-09-30  8:38   ` Bron Gondwana
  4 siblings, 0 replies; 31+ messages in thread
From: Balbir Singh @ 2010-09-23 11:44 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: robm, linux-kernel, Bron Gondwana, linux-mm, Christoph Lameter,
	Mel Gorman

* KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [2010-09-16 19:01:32]:

> +			if (!(zone_reclaim_mode & RECLAIM_CACHE) &&
> +			    (gfp_mask & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK) {
> +				inc_zone_state(zone, NR_ZONE_CACHE_AVOID);
> +				goto try_next_zone;
> +			}
> +

Interesting approach, so for page cache related applications we expect
RECLAIM_CACHE to be set and hence zone_reclaim to occur. I have
another variation, a new gfp flag called __GFP_FREE_CACHE. You can
find the patches at

http://lwn.net/Articles/391293/
http://article.gmane.org/gmane.linux.kernel.mm/49155

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Default zone_reclaim_mode = 1 on NUMA kernel is bad for file/email/web servers
  2010-09-16 10:01 ` KOSAKI Motohiro
                     ` (3 preceding siblings ...)
  2010-09-23 11:44   ` Balbir Singh
@ 2010-09-30  8:38   ` Bron Gondwana
  4 siblings, 0 replies; 31+ messages in thread
From: Bron Gondwana @ 2010-09-30  8:38 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: robm, linux-kernel, Bron Gondwana, linux-mm, Christoph Lameter,
	Mel Gorman

On Thu, Sep 16, 2010 at 07:01:32PM +0900, KOSAKI Motohiro wrote:
> Cc to linux-mm and hpc guys. and intetionally full quote.
> 
> 
> > So over the last couple of weeks, I've noticed that our shiny new IMAP
> > servers (Dual Xeon E5520 + Intel S5520UR MB) with 48G of RAM haven't
> > been performing as well as expected, and there were some big oddities.
> > Namely two things stuck out:
> > 
> > 1. There was free memory. There's 20T of data on these machines. The
> >    kernel should have used lots of memory for caching, but for some
> >    reason, it wasn't. cache ~ 2G, buffers ~ 25G, unused ~ 5G
> > 2. The machine has an SSD for very hot data. In total, there's about 16G
> >    of data on the SSD. Almost all of that 16G of data should end up
> >    being cached, so there should be little reading from the SSDs at all.
> >    Instead we saw at peak times 2k+ blocks read/s from the SSDs. Again a
> >    sign that caching wasn't working.
> > 
> > After a bunch of googling, I found this thread.
> > 
> > http://lkml.org/lkml/2009/5/12/586
> > 
> > It appears that patch never went anywhere, and zone_reclaim_mode is
> > still defaulting to 1 on our pretty standard file/email/web server type
> > machine with a NUMA kernel.
> > 
> > By changing it to 0, we saw an immediate massive change in caching
> > behaviour. Now cache ~ 27G, buffers ~ 7G and unused ~ 0.2G, and IO reads
> > from the SSD dropped to 100/s instead of 2000/s.

Apropos to all this, look what's showed up:

http://jcole.us/blog/archives/2010/09/28/mysql-swap-insanity-and-the-numa-architecture/

More fun with NUMA.  Though in the Mysql case I can see that there's no easy
answer because there really is one big process chewing most of the RAM.

The question in our case is: why isn't the kernel balancing the multiple
separate Cyrus instances across all the nodes?  And why, as one of the
comments there says, isn't swap to NUMA being considered cheaper than
swap to disk!  That's the real problem here - that Linux is considering
accessing remote RAM to be more expensive than accessing disk!

Bron.

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2010-10-05  5:32 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-09-13  3:39 Default zone_reclaim_mode = 1 on NUMA kernel is bad for file/email/web servers Robert Mueller
2010-09-16 10:01 ` KOSAKI Motohiro
2010-09-16 17:06   ` Christoph Lameter
2010-09-17  0:50     ` Robert Mueller
2010-09-17  6:01       ` Shaohua Li
2010-09-17  7:32         ` Robert Mueller
2010-09-17 13:56           ` Christoph Lameter
2010-09-17 14:09             ` Bron Gondwana
2010-09-17 14:22               ` Christoph Lameter
2010-09-17 23:01                 ` Bron Gondwana
2010-09-20  9:34   ` Mel Gorman
2010-09-20 23:41     ` Default zone_reclaim_mode = 1 on NUMA kernel is bad forfile/email/web servers Rob Mueller
2010-09-21  9:04       ` Mel Gorman
2010-09-21 14:14         ` Christoph Lameter
2010-09-22  3:44           ` Rob Mueller
2010-09-27  2:01         ` KOSAKI Motohiro
2010-09-27 13:53           ` Christoph Lameter
2010-09-27 23:17             ` Robert Mueller
2010-09-28 12:35               ` Christoph Lameter
2010-09-28 12:42                 ` Bron Gondwana
2010-09-28 12:49                   ` Christoph Lameter
2010-09-30  7:05             ` Andi Kleen
2010-10-04 12:45             ` KOSAKI Motohiro
2010-10-04 13:07               ` Christoph Lameter
2010-10-05  5:32                 ` KOSAKI Motohiro
2010-10-04 19:43               ` David Rientjes
2010-09-21  1:05   ` Default zone_reclaim_mode = 1 on NUMA kernel is bad for file/email/web servers KAMEZAWA Hiroyuki
2010-09-27  2:04     ` KOSAKI Motohiro
2010-09-27  2:06       ` KAMEZAWA Hiroyuki
2010-09-23 11:44   ` Balbir Singh
2010-09-30  8:38   ` Bron Gondwana

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox