Re: [PATCH] allocate page caches pages in round robin fasion

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH] allocate page caches pages in round robin fasion
       [not found] ` <fa.g1i2d5e.1kgqq80@ifi.uio.no>
@ 2004-08-13 16:33   ` Ray Bryant
  0 siblings, 0 replies; 23+ messages in thread
From: Ray Bryant @ 2004-08-13 16:33 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Jesse Barnes, Andrew Morton, Linux Kernel Mailing List, steiner

Dave Hansen wrote:
> On Thu, 2004-08-12 at 16:38, Jesse Barnes wrote:
> 
>>On a NUMA machine, page cache pages should be spread out across the system 
>>since they're generally global in nature and can eat up whole nodes worth of 
>>memory otherwise.  This can end up hurting performance since jobs will have 
>>to make off node references for much or all of their non-file data.
> 
> 
> Wouldn't this be painful for any workload that accesses a unique set of
> files on each node?  If an application knows that it is touching truly
> shared data which every node could possibly access, then they can use
> the NUMA API to cause round-robin allocations to occur.  
> 

I suppose it is possible for some workloads to be able to tell the difference 
between a locally and globally allocated page cache page.  It all depends on 
the rate of access of data pages versus page cache pages.

For workloads that read in some data, then process that data for a very long 
time (e. g. typical HPC workloads), it is more important to make sure those 
data pages are allocated locally, and the page cache pages are touched much 
less frequently, so making them globally round-robin'd is a marginal 
performance hit.  The problem we are trying to avoid here is to make sure the 
node doesn't fill up with page cache pages, resulting in non-local allocations 
for those data pages, which is not a good thing [tm].

On the other hand, if your workload spends most of its time writing buffered 
file I/O to a set of pages that will comfotably fit on node, then it is 
important to have the page cache pages allocated locally.  So I can see the 
need for some program control of placement.

However, using the NUMA API to cause round-robin allocations to occur would 
use the process level policy, right?  So the same decision will be made on how 
to allocate data pages and page cache pages?  Might it not be possible that an 
application would like its page cache pages allocated globally round-robin, 
but it still wants its data pages allocated via MPOL_DEFAULT?

Perhaps what is needed is the ability to associate a mem_policy with the page 
cache allocation (or, perhaps, more generally, a default "kernel storage 
allocation policy" for storage that the kernel allocates on behalf of a 
process).  System admins could set the default according to overall workload 
considerations, and, perhaps, we would allow processes with sufficient 
priviledge to set their own policy.

> Maybe a per-node watermark on page cache usage would be more useful. 
> Once a node starts to get full, and it's past the watermark, we can go
> and shoot down some of the node's page cache.  If the data access is
> truly global, then it has a good chance of being brought in on a
> different node.

I think this could be inefficient if the file access is truly global and the 
file is large.  (Think of a file that is significantly larger than the local 
memory of any node.)  Pages would be pulled into each node in turn as they are 
accessed, then discarded as they go over the watermark, to be pulled in on 
another node, etc.  It would be better in this case just to round robin the 
allocation on first access and be done with it.

> 
> -- Dave
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 23+ messages in thread

[parent not found: <fa.hmrqqf6.ckie1e@ifi.uio.no>]

[parent not found: <fa.cg3cafa.ngi9og@ifi.uio.no>]

* Re: [PATCH] allocate page caches pages in round robin fasion
       [not found] ` <fa.cg3cafa.ngi9og@ifi.uio.no>
@ 2004-08-13 17:31   ` Ray Bryant
  0 siblings, 0 replies; 23+ messages in thread
From: Ray Bryant @ 2004-08-13 17:31 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Jesse Barnes, akpm, linux-kernel, steiner

Hi Martin,

Martin J. Bligh wrote:
<snip>

> 
> Does that actually happen though? Looking at the current code makes me think
> it'll keep some pages free on all nodes at all times, and if kswapd does
> it's job, we'll never fall back across nodes. Now ... I think that's broken,
> but I think that's what currently happens - that was what we discussed at
> KS ... I might be misreading it though, I should test it.
> 
> Even if that's not true, allocating all your most recent stuff off-node is
> still crap (so either way, I'd agree the current situation is broken), but
> I don't think the solution is to push ALL your accesses (with n-1/n probability)
> off-node ... we need to be more careful than that ...
>

I think you're missing out on the typical workload situation where we run into 
this problem.  Just to make things a bit more specific, lets assume we are on 
a 128 node (256 P) system, with 4 GB per node.  Let's assume that we have a 
100 GB data file that we access periodically during the run, accesses to that 
data file are done in random access fashion from each node.

The program starts out by reading in the data file, then forks off 256 copies 
of itself, and allocates 1 GB per CPU of local storage via MPOL_DEFAULT.  All 
of those pages had better be in local or the computation will be unbalanced 
and run as slowly as the slowest node.

As I read the __alloc_pages() code, those 100 GB of data pages will be 
allocated on the node that did the file read; when that node fills up, we will 
spill the allocation to adjacent nodes (this is the first loop of 
__alloc_pages(), kswapd doesn't get invoked until that first loop fails).

(kswapd() doesn't get invoked until all of the zones in the zonelist are full.
All of memory is in that zonelist, unless we have cpusets enabled. So the 
priority is to spill off node first and then swap() second.)

Now the application starts allocating its 2 GB of local, and the nodes where 
the page cache was allocated all get non-local pages allocated.  (Once again, 
this happens in the first loop of __alloc_pages().)  The ratio of accesses to 
local data pages versus access to remote page cache pages is unfavorable for 
local page cache allocation, since the page cache pages are accessed at a tiny 
fraction of the rate of the data pages.

Now I suppose you could argue that the application should fork first and then 
read in 1/256th of the data on each cpu.  The problems with this, in general, 
are twofold:

(1)  It could have been a simple "cp" in a startup script that did the read..
      We can't fix all of those things as well.
(2)  The application may be an ISV's program that is not NUMA aware.  We can
      fix most of that by wrappering the program with control scripts, but
      requiring the ISV to build a specific NUMA aware version of the binary
      for Altix is oftentimes not feasible.  (And because the allocation
      policy is MPOL_DEFAULT, the application doesn't have to have NUMA API
      calls imbedded in the program.)

> 
>>>If we round-robin it ... surely 7/8 of your data (on your 8 node machine)
>>>will ALWAYS be off-node ? I thought we discussed this at KS/OLS - what is
>>>needed is to punt old pages back off onto another node, rather than
>>>swapping them out. That way all your pages are going to be local.
>>

Surely you can't be suggesting that I migrate a page cache page to a local 
node just to read it?  If file accesses are random and global, won't you end 
up just bouncing page cache pages hither and yon?  Surely it is better just to 
copy the data remotely to the current node and leave it where it is?

(YMMV -- all of these tradeoffs are clearly workload dependent.)

>>That gets complicated pretty quickly I think.  We don't want to constantly 
>>shuffle pages between nodes with kswapd, and there's also the problem of 
>>deciding when to do it.

^ permalink raw reply	[flat|nested] 23+ messages in thread

[parent not found: <2sxuC-429-3@gated-at.bofh.it>]

* Re: [PATCH] allocate page caches pages in round robin fasion
       [not found] <2sxuC-429-3@gated-at.bofh.it>
@ 2004-08-13  1:14 ` Andi Kleen
  2004-08-13  1:26   ` William Lee Irwin III
                     ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Andi Kleen @ 2004-08-13  1:14 UTC (permalink / raw)
  To: Jesse Barnes; +Cc: linux-kernel

Jesse Barnes <jbarnes@engr.sgi.com> writes:

> On a NUMA machine, page cache pages should be spread out across the system 
> since they're generally global in nature and can eat up whole nodes worth of 
> memory otherwise.  This can end up hurting performance since jobs will have 
> to make off node references for much or all of their non-file data.
>
> The patch works by adding an alloc_page_round_robin routine that simply 
> allocates on successive nodes each time its called, based on the value of a 
> per-cpu variable modulo the number of nodes.  The variable is per-cpu to 
> avoid cacheline contention when many cpus try to do page cache allocations at 

I don't like this approach using a dynamic counter. I think it would
be better to add a new function that takes the vma and uses the offset
into the inode for static interleaving (anonymous memory would still
use the vma offset). This way you would have a good guarantee that the
interleaving stays interleaved even when the system swaps pages in and
out and you're less likely to get anomalies in the page distribution.

-Andi

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] allocate page caches pages in round robin fasion
  2004-08-13  1:14 ` Andi Kleen
@ 2004-08-13  1:26   ` William Lee Irwin III
  2004-08-13  1:29   ` Jesse Barnes
  2004-08-13 16:04   ` Jesse Barnes
  2 siblings, 0 replies; 23+ messages in thread
From: William Lee Irwin III @ 2004-08-13  1:26 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Jesse Barnes, linux-kernel

Jesse Barnes <jbarnes@engr.sgi.com> writes:
>> The patch works by adding an alloc_page_round_robin routine that
>> simply allocates on successive nodes each time its called, based on
>> the value of a per-cpu variable modulo the number of nodes.  The
>> variable is per-cpu to avoid cacheline contention when many cpus try
>> to do page cache allocations at 

On Fri, Aug 13, 2004 at 03:14:46AM +0200, Andi Kleen wrote:
> I don't like this approach using a dynamic counter. I think it would
> be better to add a new function that takes the vma and uses the offset
> into the inode for static interleaving (anonymous memory would still
> use the vma offset). This way you would have a good guarantee that the
> interleaving stays interleaved even when the system swaps pages in and
> out and you're less likely to get anomalies in the page distribution.

If we're going to go that far, why not use a better coloring algorithm?
IIRC linear_page_index(vma, vaddr) % MAX_NR_NODES has issues with
various semiregular access patterns where others do not (most are
relatively simple hash functions). This reminds me that getting the vma
and vaddr accessible to the allocator helps with normal page coloring.


-- wli

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] allocate page caches pages in round robin fasion
  2004-08-13  1:14 ` Andi Kleen
  2004-08-13  1:26   ` William Lee Irwin III
@ 2004-08-13  1:29   ` Jesse Barnes
  2004-08-13 16:04   ` Jesse Barnes
  2 siblings, 0 replies; 23+ messages in thread
From: Jesse Barnes @ 2004-08-13  1:29 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel

On Thursday, August 12, 2004 6:14 pm, Andi Kleen wrote:
> Jesse Barnes <jbarnes@engr.sgi.com> writes:
> > On a NUMA machine, page cache pages should be spread out across the
> > system since they're generally global in nature and can eat up whole
> > nodes worth of memory otherwise.  This can end up hurting performance
> > since jobs will have to make off node references for much or all of their
> > non-file data.
> >
> > The patch works by adding an alloc_page_round_robin routine that simply
> > allocates on successive nodes each time its called, based on the value of
> > a per-cpu variable modulo the number of nodes.  The variable is per-cpu
> > to avoid cacheline contention when many cpus try to do page cache
> > allocations at
>
> I don't like this approach using a dynamic counter. I think it would
> be better to add a new function that takes the vma and uses the offset
> into the inode for static interleaving (anonymous memory would still
> use the vma offset). This way you would have a good guarantee that the
> interleaving stays interleaved even when the system swaps pages in and
> out and you're less likely to get anomalies in the page distribution.

Well, that's one reason I didn't add an alloc_pages routine, but just a single 
page distributor.  However, a multipage round robin routine would be useful 
in other cases, like for the slab allocator.

Jesse

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] allocate page caches pages in round robin fasion
  2004-08-13  1:14 ` Andi Kleen
  2004-08-13  1:26   ` William Lee Irwin III
  2004-08-13  1:29   ` Jesse Barnes
@ 2004-08-13 16:04   ` Jesse Barnes
  2004-08-13 17:31     ` Brent Casavant
  2 siblings, 1 reply; 23+ messages in thread
From: Jesse Barnes @ 2004-08-13 16:04 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel

On Thursday, August 12, 2004 6:14 pm, Andi Kleen wrote:
> I don't like this approach using a dynamic counter. I think it would
> be better to add a new function that takes the vma and uses the offset
> into the inode for static interleaving (anonymous memory would still
> use the vma offset). This way you would have a good guarantee that the
> interleaving stays interleaved even when the system swaps pages in and
> out and you're less likely to get anomalies in the page distribution.

That sounds like a good approach, care to show me exactly what you mean with a 
patch? :)

Thanks,
Jesse

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] allocate page caches pages in round robin fasion
  2004-08-13 16:04   ` Jesse Barnes
@ 2004-08-13 17:31     ` Brent Casavant
  2004-08-13 20:16       ` Andi Kleen
  0 siblings, 1 reply; 23+ messages in thread
From: Brent Casavant @ 2004-08-13 17:31 UTC (permalink / raw)
  To: Jesse Barnes; +Cc: Andi Kleen, linux-kernel

On Fri, 13 Aug 2004, Jesse Barnes wrote:

> On Thursday, August 12, 2004 6:14 pm, Andi Kleen wrote:
> > I don't like this approach using a dynamic counter. I think it would
> > be better to add a new function that takes the vma and uses the offset
> > into the inode for static interleaving (anonymous memory would still
> > use the vma offset). This way you would have a good guarantee that the
> > interleaving stays interleaved even when the system swaps pages in and
> > out and you're less likely to get anomalies in the page distribution.
>
> That sounds like a good approach, care to show me exactly what you mean
> with a
> patch? :)

Make sure to have some sort of offset for the static interleaving that
is random or semi-random for each inode, as we discussed on linux-mm last
week.  Otherwise you'll end up with pages for small files clustered on
a few nodes, as the offset into the inode will never exceed some small
value.

And even though I know Andi doesn't care for a dynamic counter, this
last week I implemented a new MPOL_ROUNDROBIN using just such a counter.
The only significant difference from MPOL_INTERLEAVE is the use of a
per-task dynamic counter rather than an index based on offsets into a
vma and/or inode.  It seems to work quite well in my testing.  My next
step was going to be using this policy for tmpfs file allocations by
default (overridable by mbind()).  Does anyone think this idea has
merit?  If so I'll clean up the MPOL_ROUNDROBiN patch and post it,
then follow it by a tmpfs patch sometime soon (I'm just beginning
work on it).

Thanks,
Brent

-- 
Brent Casavant             bcasavan@sgi.com        Forget bright-eyed and
Operating System Engineer  http://www.sgi.com/     bushy-tailed; I'm red-
Silicon Graphics, Inc.     44.8562N 93.1355W 860F  eyed and bushy-haired.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] allocate page caches pages in round robin fasion
  2004-08-13 17:31     ` Brent Casavant
@ 2004-08-13 20:16       ` Andi Kleen
  0 siblings, 0 replies; 23+ messages in thread
From: Andi Kleen @ 2004-08-13 20:16 UTC (permalink / raw)
  To: Brent Casavant; +Cc: Jesse Barnes, linux-kernel

On Fri, Aug 13, 2004 at 12:31:41PM -0500, Brent Casavant wrote:
> On Fri, 13 Aug 2004, Jesse Barnes wrote:
> 
> > On Thursday, August 12, 2004 6:14 pm, Andi Kleen wrote:
> > > I don't like this approach using a dynamic counter. I think it would
> > > be better to add a new function that takes the vma and uses the offset
> > > into the inode for static interleaving (anonymous memory would still
> > > use the vma offset). This way you would have a good guarantee that the
> > > interleaving stays interleaved even when the system swaps pages in and
> > > out and you're less likely to get anomalies in the page distribution.
> >
> > That sounds like a good approach, care to show me exactly what you mean
> > with a
> > patch? :)

I put it on my todo list. 

> 
> Make sure to have some sort of offset for the static interleaving that
> is random or semi-random for each inode, as we discussed on linux-mm last

inode number (possible with some bits of dev_t, although that is probably
not needed) 

-Andi

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH] allocate page caches pages in round robin fasion
@ 2004-08-12 23:46 Jesse Barnes
  2004-08-13  0:13 ` William Lee Irwin III
  2004-08-13 14:50 ` Martin J. Bligh
  0 siblings, 2 replies; 23+ messages in thread
From: Jesse Barnes @ 2004-08-12 23:46 UTC (permalink / raw)
  To: akpm, linux-kernel; +Cc: steiner

[-- Attachment #1: Type: text/plain, Size: 1328 bytes --]

[ugg, attach the patch this time]

On a NUMA machine, page cache pages should be spread out across the system 
since they're generally global in nature and can eat up whole nodes worth of 
memory otherwise.  This can end up hurting performance since jobs will have 
to make off node references for much or all of their non-file data.

The patch works by adding an alloc_page_round_robin routine that simply 
allocates on successive nodes each time its called, based on the value of a 
per-cpu variable modulo the number of nodes.  The variable is per-cpu to 
avoid cacheline contention when many cpus try to do page cache allocations at 
once.

After dd if=/dev/zero of=/tmp/bigfile bs=1G count=2 on a stock kernel:
Node 7 MemUsed:         49248 kB
Node 6 MemUsed:         42176 kB
Node 5 MemUsed:        316880 kB
Node 4 MemUsed:         36160 kB
Node 3 MemUsed:         45152 kB
Node 2 MemUsed:         50000 kB
Node 1 MemUsed:         68704 kB
Node 0 MemUsed:       2426256 kB

and after the patch:
Node 7 MemUsed:        328608 kB
Node 6 MemUsed:        319424 kB
Node 5 MemUsed:        318608 kB
Node 4 MemUsed:        321600 kB
Node 3 MemUsed:        319648 kB
Node 2 MemUsed:        327504 kB
Node 1 MemUsed:        389504 kB
Node 0 MemUsed:        744752 kB

Signed-off-by: Jesse Barnes <jbarnes@sgi.com>

Thanks,
Jesse

[-- Attachment #2: page-cache-round-robin-3.patch --]
[-- Type: text/x-diff, Size: 2288 bytes --]

===== include/linux/gfp.h 1.18 vs edited =====
--- 1.18/include/linux/gfp.h	2004-05-22 14:56:25 -07:00
+++ edited/include/linux/gfp.h	2004-08-12 16:27:01 -07:00
@@ -86,6 +86,8 @@
 		NODE_DATA(nid)->node_zonelists + (gfp_mask & GFP_ZONEMASK));
 }
 
+extern struct page *alloc_page_round_robin(unsigned int gfp_mask);
+
 #ifdef CONFIG_NUMA
 extern struct page *alloc_pages_current(unsigned gfp_mask, unsigned order);
 
===== include/linux/pagemap.h 1.43 vs edited =====
--- 1.43/include/linux/pagemap.h	2004-06-24 01:55:57 -07:00
+++ edited/include/linux/pagemap.h	2004-08-12 14:37:36 -07:00
@@ -52,12 +52,12 @@
 
 static inline struct page *page_cache_alloc(struct address_space *x)
 {
-	return alloc_pages(mapping_gfp_mask(x), 0);
+	return alloc_page_round_robin(mapping_gfp_mask(x));
 }
 
 static inline struct page *page_cache_alloc_cold(struct address_space *x)
 {
-	return alloc_pages(mapping_gfp_mask(x)|__GFP_COLD, 0);
+	return alloc_page_round_robin(mapping_gfp_mask(x)|__GFP_COLD);
 }
 
 typedef int filler_t(void *, struct page *);
===== mm/page_alloc.c 1.224 vs edited =====
--- 1.224/mm/page_alloc.c	2004-08-07 23:43:41 -07:00
+++ edited/mm/page_alloc.c	2004-08-12 16:27:43 -07:00
@@ -31,6 +31,7 @@
 #include <linux/topology.h>
 #include <linux/sysctl.h>
 #include <linux/cpu.h>
+#include <linux/percpu.h>
 
 #include <asm/tlbflush.h>
 
@@ -41,6 +42,7 @@
 long nr_swap_pages;
 int numnodes = 1;
 int sysctl_lower_zone_protection = 0;
+static DEFINE_PER_CPU(int, next_rr_node);
 
 EXPORT_SYMBOL(totalram_pages);
 EXPORT_SYMBOL(nr_swap_pages);
@@ -577,6 +579,23 @@
 	}
 	return page;
 }
+
+/**
+ * alloc_page_round_robin - distribute pages across nodes
+ * @gfp_mask: GFP_* flags
+ *
+ * alloc_page_round_robin() will simply allocate from a different node
+ * than was allocated from in the last call using the next_rr_node variable.
+ * We use __get_cpu_var since we don't care about disabling preemption (we're
+ * using a mod function so nid will always be less than numnodes).  A per-cpu
+ * variable will make round robin allocations scale a bit better.
+ */
+struct page *alloc_page_round_robin(unsigned int gfp_mask)
+{
+	return alloc_pages_node(__get_cpu_var(next_rr_node)++ % numnodes,
+				gfp_mask, 0);
+}
+
 
 /*
  * This is the 'heart' of the zoned buddy allocator.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] allocate page caches pages in round robin fasion
  2004-08-12 23:46 Jesse Barnes
@ 2004-08-13  0:13 ` William Lee Irwin III
  2004-08-13  0:25   ` Jesse Barnes
  2004-08-13 14:50 ` Martin J. Bligh
  1 sibling, 1 reply; 23+ messages in thread
From: William Lee Irwin III @ 2004-08-13  0:13 UTC (permalink / raw)
  To: Jesse Barnes; +Cc: akpm, linux-kernel, steiner

On Thu, Aug 12, 2004 at 04:46:50PM -0700, Jesse Barnes wrote:
> +struct page *alloc_page_round_robin(unsigned int gfp_mask)
> +{
> +	return alloc_pages_node(__get_cpu_var(next_rr_node)++ % numnodes,
> +				gfp_mask, 0);
> +}
> +

Interesting. This may attempt to allocate from offlined nodes, assuming
one adds on sufficient hotplug bits atop mainline and/or -mm. The
following almost does it hotplug-safe except that it needs to enter the
allocator with preemption disabled and drop the preempt_count
internally to it.

static struct page *alloc_page_round_robin(unsigned gfp_mask)
{
	int nid, next_nid, *rr_node = &__get_cpu_var(next_rr_node);

	nid = *rr_node;
	next_nid = next_node(nid, node_online_map);
	if (next_nid >= MAX_NR_NODES)
		*rr_node = first_node(node_online_map);
	else
		*rr_node = next_nid;
	return alloc_pages_node(nid, gfp_mask, 0);
}

I suspect we are better off punting this in the direction of hotplug
people than trying to address it ourselves. I think we should go with
this now, as the node hotplug bits are yet to hit the tree.

-- wli

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] allocate page caches pages in round robin fasion
  2004-08-13  0:13 ` William Lee Irwin III
@ 2004-08-13  0:25   ` Jesse Barnes
  2004-08-13  0:32     ` William Lee Irwin III
  0 siblings, 1 reply; 23+ messages in thread
From: Jesse Barnes @ 2004-08-13  0:25 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: akpm, linux-kernel, steiner

On Thursday, August 12, 2004 5:13 pm, William Lee Irwin III wrote:
> Interesting. This may attempt to allocate from offlined nodes, assuming
> one adds on sufficient hotplug bits atop mainline and/or -mm. The
> following almost does it hotplug-safe except that it needs to enter the
> allocator with preemption disabled and drop the preempt_count
> internally to it.

Can we make alloc_pages_node take offline nodes instead?  Maybe it could just 
allocate from the next nearest node or something?

> I suspect we are better off punting this in the direction of hotplug
> people than trying to address it ourselves. I think we should go with
> this now, as the node hotplug bits are yet to hit the tree.

Yeah, agreed.

Jesse

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] allocate page caches pages in round robin fasion
  2004-08-13  0:25   ` Jesse Barnes
@ 2004-08-13  0:32     ` William Lee Irwin III
  0 siblings, 0 replies; 23+ messages in thread
From: William Lee Irwin III @ 2004-08-13  0:32 UTC (permalink / raw)
  To: Jesse Barnes; +Cc: akpm, linux-kernel, steiner

On Thursday, August 12, 2004 5:13 pm, William Lee Irwin III wrote:
>> Interesting. This may attempt to allocate from offlined nodes, assuming
>> one adds on sufficient hotplug bits atop mainline and/or -mm. The
>> following almost does it hotplug-safe except that it needs to enter the
>> allocator with preemption disabled and drop the preempt_count
>> internally to it.

On Thu, Aug 12, 2004 at 05:25:15PM -0700, Jesse Barnes wrote:
> Can we make alloc_pages_node take offline nodes instead?  Maybe it could just 
> allocate from the next nearest node or something?

I don't think we should do anything but point it out for those writing
the hotplug patches to look for. There are enough interactions in
general we're not even looking for that making the whole tree
hotplug-safe is hopeless unless the hotplug ppl get involved with e.g.
more complete patches, being able to actually test things, etc.


On Thursday, August 12, 2004 5:13 pm, William Lee Irwin III wrote:
>> I suspect we are better off punting this in the direction of hotplug
>> people than trying to address it ourselves. I think we should go with
>> this now, as the node hotplug bits are yet to hit the tree.

On Thu, Aug 12, 2004 at 05:25:15PM -0700, Jesse Barnes wrote:
> Yeah, agreed.

Yes... their patch, their implementation burden.


-- wli

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] allocate page caches pages in round robin fasion
  2004-08-12 23:46 Jesse Barnes
  2004-08-13  0:13 ` William Lee Irwin III
@ 2004-08-13 14:50 ` Martin J. Bligh
  2004-08-13 15:59   ` Jesse Barnes
  1 sibling, 1 reply; 23+ messages in thread
From: Martin J. Bligh @ 2004-08-13 14:50 UTC (permalink / raw)
  To: Jesse Barnes, akpm, linux-kernel; +Cc: steiner

> On a NUMA machine, page cache pages should be spread out across the system 
> since they're generally global in nature and can eat up whole nodes worth of 
> memory otherwise.  This can end up hurting performance since jobs will have 
> to make off node references for much or all of their non-file data.
> 
> The patch works by adding an alloc_page_round_robin routine that simply 
> allocates on successive nodes each time its called, based on the value of a 
> per-cpu variable modulo the number of nodes.  The variable is per-cpu to 
> avoid cacheline contention when many cpus try to do page cache allocations at 
> once.

I really don't think this is a good idea - you're assuming there's really
no locality of reference, which I don't think is at all true in most cases.

If we round-robin it ... surely 7/8 of your data (on your 8 node machine)
will ALWAYS be off-node ? I thought we discussed this at KS/OLS - what is
needed is to punt old pages back off onto another node, rather than swapping
them out. That way all your pages are going to be local.

> After dd if=/dev/zero of=/tmp/bigfile bs=1G count=2 on a stock kernel:
> Node 7 MemUsed:         49248 kB
> Node 6 MemUsed:         42176 kB
> Node 5 MemUsed:        316880 kB
> Node 4 MemUsed:         36160 kB
> Node 3 MemUsed:         45152 kB
> Node 2 MemUsed:         50000 kB
> Node 1 MemUsed:         68704 kB
> Node 0 MemUsed:       2426256 kB
> 
> and after the patch:
> Node 7 MemUsed:        328608 kB
> Node 6 MemUsed:        319424 kB
> Node 5 MemUsed:        318608 kB
> Node 4 MemUsed:        321600 kB
> Node 3 MemUsed:        319648 kB
> Node 2 MemUsed:        327504 kB
> Node 1 MemUsed:        389504 kB
> Node 0 MemUsed:        744752 kB

OK, so it obviously does something ... but is the dd actually faster? 
I'd think it's slower ...

M.

PS. I think it prob makes sense to make the *receiving* node's kswapd
do the transfer if possible for work distribution reasons ... however
I'm pretty sure there are some locking assumptions that mean this won't
be easy / possible.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] allocate page caches pages in round robin fasion
  2004-08-13 14:50 ` Martin J. Bligh
@ 2004-08-13 15:59   ` Jesse Barnes
  2004-08-13 16:20     ` Martin J. Bligh
  0 siblings, 1 reply; 23+ messages in thread
From: Jesse Barnes @ 2004-08-13 15:59 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: akpm, linux-kernel, steiner

On Friday, August 13, 2004 7:50 am, Martin J. Bligh wrote:
> I really don't think this is a good idea - you're assuming there's really
> no locality of reference, which I don't think is at all true in most cases.

No, not at all, just that locality of reference matters more for stack and 
anonymous pages than it does for page cache pages.  I.e. we don't want a node 
to be filled up with page cache pages causing all other memory references 
from the process to be off node.

> If we round-robin it ... surely 7/8 of your data (on your 8 node machine)
> will ALWAYS be off-node ? I thought we discussed this at KS/OLS - what is
> needed is to punt old pages back off onto another node, rather than
> swapping them out. That way all your pages are going to be local.

That gets complicated pretty quickly I think.  We don't want to constantly 
shuffle pages between nodes with kswapd, and there's also the problem of 
deciding when to do it.

> OK, so it obviously does something ... but is the dd actually faster?
> I'd think it's slower ...

Sure, it's probably a tiny bit slower, but assume that dd actually had some 
compute work to do after it read in a file (like an encoder or fp app), w/o 
the patch, most of it's time critical references would be off node.  The 
important thing is to get the file data in memory, since that'll be way 
faster than reading from disk, but it doesn't really matter *where* it is in 
memory.  Especially since we want an app's code and data to be node local.

Jesse

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] allocate page caches pages in round robin fasion
  2004-08-13 15:59   ` Jesse Barnes
@ 2004-08-13 16:20     ` Martin J. Bligh
  2004-08-13 16:34       ` Jesse Barnes
  0 siblings, 1 reply; 23+ messages in thread
From: Martin J. Bligh @ 2004-08-13 16:20 UTC (permalink / raw)
  To: Jesse Barnes; +Cc: akpm, linux-kernel, steiner

>> I really don't think this is a good idea - you're assuming there's really
>> no locality of reference, which I don't think is at all true in most cases.
> 
> No, not at all, just that locality of reference matters more for stack and 
> anonymous pages than it does for page cache pages.  I.e. we don't want a node 
> to be filled up with page cache pages causing all other memory references 
> from the process to be off node.

Does that actually happen though? Looking at the current code makes me think
it'll keep some pages free on all nodes at all times, and if kswapd does
it's job, we'll never fall back across nodes. Now ... I think that's broken,
but I think that's what currently happens - that was what we discussed at
KS ... I might be misreading it though, I should test it.

Even if that's not true, allocating all your most recent stuff off-node is
still crap (so either way, I'd agree the current situation is broken), but
I don't think the solution is to push ALL your accesses (with n-1/n probability)
off-node ... we need to be more careful than that ...

>> If we round-robin it ... surely 7/8 of your data (on your 8 node machine)
>> will ALWAYS be off-node ? I thought we discussed this at KS/OLS - what is
>> needed is to punt old pages back off onto another node, rather than
>> swapping them out. That way all your pages are going to be local.
> 
> That gets complicated pretty quickly I think.  We don't want to constantly 
> shuffle pages between nodes with kswapd, and there's also the problem of 
> deciding when to do it.

When the other zone is above the high watermark (or even some newer watermark,
ie it has lots of free pages). If we have any sort of locality of reference,
we want the most recently used pages on-node, and less recently ones off-node,
and then the least-recently used on disk.

>> OK, so it obviously does something ... but is the dd actually faster?
>> I'd think it's slower ...
> 
> Sure, it's probably a tiny bit slower, but assume that dd actually had some 
> compute work to do after it read in a file (like an encoder or fp app), w/o 
> the patch, most of it's time critical references would be off node.  The 
> important thing is to get the file data in memory, since that'll be way 
> faster than reading from disk, but it doesn't really matter *where* it is in 
> memory.  Especially since we want an app's code and data to be node local.

Not sure I'd agree with that - it's the same problem as swappiness on a global
basis for non-NUMA machines. We want the pages we're using MOST to be local,
the others to be not-local, and that doesn't equate (necessarily) to whether
it's pagecache or not. Shared pages could indeed be dealt with differently,
and spread more global ... but I don't agree that pagecache pages equate 1-1
with being globally shared - in fact, I think most often the opposite is true.

So when we're making such decisions, we need to consider the usage of a page
in more careful of a way than whether it's pagecache or not. page->mapcount
would be a better start, I think, than simply pagecache or not. Spreading
pagecache around will just make the most common case slower, I think.

I understand the desire to make node mem pressures equalize somewhat - having
dynamically altered fallback schemes for NUMA nodes to the node with the
most free mem, etc would help ... but we need to be really careful what
we keep local, and what we push remote. Most apps seem to have fairly
good locality

M.

PS. The obvious exceptions to these rules are shmem and shared libs ... 
shmem should probably go round-robin amongst its users nodes by default,
and shared libs replicate ... I'll look at fixing up shmem at least.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] allocate page caches pages in round robin fasion
  2004-08-13 16:20     ` Martin J. Bligh
@ 2004-08-13 16:34       ` Jesse Barnes
  2004-08-13 16:47         ` Martin J. Bligh
  0 siblings, 1 reply; 23+ messages in thread
From: Jesse Barnes @ 2004-08-13 16:34 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: akpm, linux-kernel, steiner

On Friday, August 13, 2004 9:20 am, Martin J. Bligh wrote:
> >> I really don't think this is a good idea - you're assuming there's
> >> really no locality of reference, which I don't think is at all true in
> >> most cases.
> >
> > No, not at all, just that locality of reference matters more for stack
> > and anonymous pages than it does for page cache pages.  I.e. we don't
> > want a node to be filled up with page cache pages causing all other
> > memory references from the process to be off node.
>
> Does that actually happen though? Looking at the current code makes me
> think it'll keep some pages free on all nodes at all times, and if kswapd
> does it's job, we'll never fall back across nodes. Now ... I think that's
> broken, but I think that's what currently happens - that was what we
> discussed at KS ... I might be misreading it though, I should test it.

Not nearly enough pages for any sizeable app though.  Maybe the behavior could 
be configurable?

> Even if that's not true, allocating all your most recent stuff off-node is
> still crap (so either way, I'd agree the current situation is broken), but
> I don't think the solution is to push ALL your accesses (with n-1/n
> probability) off-node ... we need to be more careful than that ...

Only page cache references...

> Not sure I'd agree with that - it's the same problem as swappiness on a
> global basis for non-NUMA machines. We want the pages we're using MOST to
> be local, the others to be not-local, and that doesn't equate (necessarily)
> to whether it's pagecache or not. Shared pages could indeed be dealt with
> differently, and spread more global ... but I don't agree that pagecache
> pages equate 1-1 with being globally shared - in fact, I think most often
> the opposite is true.

Yeah, that's a good point.  That argues for configurability too.  We should 
behave differently depending on whether the page is shared or not.

> PS. The obvious exceptions to these rules are shmem and shared libs ...
> shmem should probably go round-robin amongst its users nodes by default,
> and shared libs replicate ... I'll look at fixing up shmem at least.

Cool, that would be good.

Jesse

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] allocate page caches pages in round robin fasion
  2004-08-13 16:34       ` Jesse Barnes
@ 2004-08-13 16:47         ` Martin J. Bligh
  2004-08-13 17:31           ` Nick Piggin
  0 siblings, 1 reply; 23+ messages in thread
From: Martin J. Bligh @ 2004-08-13 16:47 UTC (permalink / raw)
  To: Jesse Barnes; +Cc: akpm, linux-kernel, steiner

--Jesse Barnes <jbarnes@engr.sgi.com> wrote (on Friday, August 13, 2004 09:34:20 -0700):

> On Friday, August 13, 2004 9:20 am, Martin J. Bligh wrote:
>> >> I really don't think this is a good idea - you're assuming there's
>> >> really no locality of reference, which I don't think is at all true in
>> >> most cases.
>> > 
>> > No, not at all, just that locality of reference matters more for stack
>> > and anonymous pages than it does for page cache pages.  I.e. we don't
>> > want a node to be filled up with page cache pages causing all other
>> > memory references from the process to be off node.
>> 
>> Does that actually happen though? Looking at the current code makes me
>> think it'll keep some pages free on all nodes at all times, and if kswapd
>> does it's job, we'll never fall back across nodes. Now ... I think that's
>> broken, but I think that's what currently happens - that was what we
>> discussed at KS ... I might be misreading it though, I should test it.
> 
> Not nearly enough pages for any sizeable app though.  Maybe the behavior could 
> be configurable?

Well, either we're:

1. Falling back and putting all our most recent accesses off-node.

or.

2. Not falling back and only able to use one node's memory for any one 
(single threaded) app.

Either situation is crap, though I'm not sure which turd we picked right
now ... I'd have to look at the code again ;-) I thought it was 2, but
I might be wrong.
 
>> Even if that's not true, allocating all your most recent stuff off-node is
>> still crap (so either way, I'd agree the current situation is broken), but
>> I don't think the solution is to push ALL your accesses (with n-1/n
>> probability) off-node ... we need to be more careful than that ...
> 
> Only page cache references...

Yeah, depends how important those are to the app though ;-) I absolutely
agree with you that the current situation is broken ... we need to do 
*something*.

>> Not sure I'd agree with that - it's the same problem as swappiness on a
>> global basis for non-NUMA machines. We want the pages we're using MOST to
>> be local, the others to be not-local, and that doesn't equate (necessarily)
>> to whether it's pagecache or not. Shared pages could indeed be dealt with
>> differently, and spread more global ... but I don't agree that pagecache
>> pages equate 1-1 with being globally shared - in fact, I think most often
>> the opposite is true.
> 
> Yeah, that's a good point.  That argues for configurability too.  We should 
> behave differently depending on whether the page is shared or not.

Right. An app that mmap'ed a big file then thrashed on it would be a good
example, though simple read-write heavily across a small fileset would do 
much same thing.

M.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] allocate page caches pages in round robin fasion
  2004-08-13 16:47         ` Martin J. Bligh
@ 2004-08-13 17:31           ` Nick Piggin
  2004-08-13 21:16             ` Martin J. Bligh
  0 siblings, 1 reply; 23+ messages in thread
From: Nick Piggin @ 2004-08-13 17:31 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Jesse Barnes, akpm, linux-kernel, steiner

Martin J. Bligh wrote:
> --Jesse Barnes <jbarnes@engr.sgi.com> wrote (on Friday, August 13, 2004 09:34:20 -0700):
> 
> 
>>On Friday, August 13, 2004 9:20 am, Martin J. Bligh wrote:
>>
>>>>>I really don't think this is a good idea - you're assuming there's
>>>>>really no locality of reference, which I don't think is at all true in
>>>>>most cases.
>>>>
>>>>No, not at all, just that locality of reference matters more for stack
>>>>and anonymous pages than it does for page cache pages.  I.e. we don't
>>>>want a node to be filled up with page cache pages causing all other
>>>>memory references from the process to be off node.
>>>
>>>Does that actually happen though? Looking at the current code makes me
>>>think it'll keep some pages free on all nodes at all times, and if kswapd
>>>does it's job, we'll never fall back across nodes. Now ... I think that's
>>>broken, but I think that's what currently happens - that was what we
>>>discussed at KS ... I might be misreading it though, I should test it.
>>
>>Not nearly enough pages for any sizeable app though.  Maybe the behavior could 
>>be configurable?
> 
> 
> Well, either we're:
> 
> 1. Falling back and putting all our most recent accesses off-node.
> 
> or.
> 
> 2. Not falling back and only able to use one node's memory for any one 
> (single threaded) app.
> 
> Either situation is crap, though I'm not sure which turd we picked right
> now ... I'd have to look at the code again ;-) I thought it was 2, but
> I might be wrong.
>  

I'm looking at this now. We are doing 1 currently.

There are a couple of issues. The first is that you need to minimise
regressions for when working set size is bigger than the local node.
The second is a fundamental tradeoff where purely FIFO touch patterns
will never be allowed to expand past local node memory for any sane
and simple (ie 2.6 worthy) scanning algorithms that I can think of.

I have a patch going now that just reclaims use-once file cache before
going off node. Seems to help a bit for basic things that just push
pagecache through the system. It definitely reduces remote allocations
by several orders of magnitude for those cases.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] allocate page caches pages in round robin fasion
  2004-08-13 17:31           ` Nick Piggin
@ 2004-08-13 21:16             ` Martin J. Bligh
  2004-08-13 22:59               ` Martin J. Bligh
  2004-08-14  1:21               ` Nick Piggin
  0 siblings, 2 replies; 23+ messages in thread
From: Martin J. Bligh @ 2004-08-13 21:16 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Jesse Barnes, akpm, linux-kernel, steiner

>> Well, either we're:
>> 
>> 1. Falling back and putting all our most recent accesses off-node.
>> 
>> or.
>> 
>> 2. Not falling back and only able to use one node's memory for any one 
>> (single threaded) app.
>> 
>> Either situation is crap, though I'm not sure which turd we picked right
>> now ... I'd have to look at the code again ;-) I thought it was 2, but
>> I might be wrong.
>>  
> 
> I'm looking at this now. We are doing 1 currently.

In theory, yes. In practice, I have a feeling kswapd will keep us above
the level of free memory where we'd fall back to another zone to allocate,
won't it?
 
> There are a couple of issues. The first is that you need to minimise
> regressions for when working set size is bigger than the local node.

Good point ... that is, indeed, a total bitch to fix.

> I have a patch going now that just reclaims use-once file cache before
> going off node. Seems to help a bit for basic things that just push
> pagecache through the system. It definitely reduces remote allocations
> by several orders of magnitude for those cases.

Makes sense, but doesn't the same thing make sense on a global basis?
I don't feel NUMA is anything magical here ...

M.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] allocate page caches pages in round robin fasion
  2004-08-13 21:16             ` Martin J. Bligh
@ 2004-08-13 22:59               ` Martin J. Bligh
  2004-08-14  1:21               ` Nick Piggin
  1 sibling, 0 replies; 23+ messages in thread
From: Martin J. Bligh @ 2004-08-13 22:59 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Jesse Barnes, akpm, linux-kernel, steiner

>>> Well, either we're:
>>> 
>>> 1. Falling back and putting all our most recent accesses off-node.
>>> 
>>> or.
>>> 
>>> 2. Not falling back and only able to use one node's memory for any one 
>>> (single threaded) app.
>>> 
>>> Either situation is crap, though I'm not sure which turd we picked right
>>> now ... I'd have to look at the code again ;-) I thought it was 2, but
>>> I might be wrong.
>>>  
>> 
>> I'm looking at this now. We are doing 1 currently.
> 
> In theory, yes. In practice, I have a feeling kswapd will keep us above
> the level of free memory where we'd fall back to another zone to allocate,
> won't it?

Nope - tested it. Buggered if I can see how that works, but it does ;-)

M.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] allocate page caches pages in round robin fasion
  2004-08-13 21:16             ` Martin J. Bligh
  2004-08-13 22:59               ` Martin J. Bligh
@ 2004-08-14  1:21               ` Nick Piggin
  1 sibling, 0 replies; 23+ messages in thread
From: Nick Piggin @ 2004-08-14  1:21 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Jesse Barnes, akpm, linux-kernel, steiner

Martin J. Bligh wrote:
>>>Well, either we're:
>>>
>>>1. Falling back and putting all our most recent accesses off-node.
>>>
>>>or.
>>>
>>>2. Not falling back and only able to use one node's memory for any one 
>>>(single threaded) app.
>>>
>>>Either situation is crap, though I'm not sure which turd we picked right
>>>now ... I'd have to look at the code again ;-) I thought it was 2, but
>>>I might be wrong.
>>> 
>>
>>I'm looking at this now. We are doing 1 currently.
> 
> 
> In theory, yes. In practice, I have a feeling kswapd will keep us above
> the level of free memory where we'd fall back to another zone to allocate,
> won't it?
>  

Nope. Take a look at the first loop-through-the-zones in alloc_pages
(preferably in akpm's tree that is cleaned up a bit).

We go through *all* zones first and allocate them down to pages_low
before kicking kswapd.

I have tried kicking kswapd before going off node, but it frees memory
really aggressively - so you're nearly left with a local alloc policy.

> 
>>There are a couple of issues. The first is that you need to minimise
>>regressions for when working set size is bigger than the local node.
> 
> 
> Good point ... that is, indeed, a total bitch to fix.
> 

At the end of the day we'll possibly just have to have a sysctl. I
don't think all regressions could be eliminated completely. We'll
see.

> 
>>I have a patch going now that just reclaims use-once file cache before
>>going off node. Seems to help a bit for basic things that just push
>>pagecache through the system. It definitely reduces remote allocations
>>by several orders of magnitude for those cases.
> 
> 
> Makes sense, but doesn't the same thing make sense on a global basis?
> I don't feel NUMA is anything magical here ...
> 

Didn't parse that. If you mean the transition from highmem->normal->dma
zones, I don't think that should be treated in this way. Imagine small
highmem zones for example. We have the lower zone protection in place
for that case - and that is something that in turn isn't good for NUMA,
because the SGI guys (I think) already ran into that and fixed it to be
per-node only.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH] allocate page caches pages in round robin fasion
@ 2004-08-12 23:38 Jesse Barnes
  2004-08-13  1:36 ` Dave Hansen
  0 siblings, 1 reply; 23+ messages in thread
From: Jesse Barnes @ 2004-08-12 23:38 UTC (permalink / raw)
  To: akpm, linux-kernel; +Cc: steiner

On a NUMA machine, page cache pages should be spread out across the system 
since they're generally global in nature and can eat up whole nodes worth of 
memory otherwise.  This can end up hurting performance since jobs will have 
to make off node references for much or all of their non-file data.

The patch works by adding an alloc_page_round_robin routine that simply 
allocates on successive nodes each time its called, based on the value of a 
per-cpu variable modulo the number of nodes.  The variable is per-cpu to 
avoid cacheline contention when many cpus try to do page cache allocations at 
once.

After dd if=/dev/zero of=/tmp/bigfile bs=1G count=2 on a stock kernel:
Node 7 MemUsed:         49248 kB
Node 6 MemUsed:         42176 kB
Node 5 MemUsed:        316880 kB
Node 4 MemUsed:         36160 kB
Node 3 MemUsed:         45152 kB
Node 2 MemUsed:         50000 kB
Node 1 MemUsed:         68704 kB
Node 0 MemUsed:       2426256 kB

and after the patch:
Node 7 MemUsed:        328608 kB
Node 6 MemUsed:        319424 kB
Node 5 MemUsed:        318608 kB
Node 4 MemUsed:        321600 kB
Node 3 MemUsed:        319648 kB
Node 2 MemUsed:        327504 kB
Node 1 MemUsed:        389504 kB
Node 0 MemUsed:        744752 kB

Signed-off-by: Jesse Barnes <jbarnes@sgi.com>

Thanks,
Jesse

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] allocate page caches pages in round robin fasion
  2004-08-12 23:38 Jesse Barnes
@ 2004-08-13  1:36 ` Dave Hansen
  0 siblings, 0 replies; 23+ messages in thread
From: Dave Hansen @ 2004-08-13  1:36 UTC (permalink / raw)
  To: Jesse Barnes; +Cc: Andrew Morton, Linux Kernel Mailing List, steiner

On Thu, 2004-08-12 at 16:38, Jesse Barnes wrote:
> On a NUMA machine, page cache pages should be spread out across the system 
> since they're generally global in nature and can eat up whole nodes worth of 
> memory otherwise.  This can end up hurting performance since jobs will have 
> to make off node references for much or all of their non-file data.

Wouldn't this be painful for any workload that accesses a unique set of
files on each node?  If an application knows that it is touching truly
shared data which every node could possibly access, then they can use
the NUMA API to cause round-robin allocations to occur.  

Maybe a per-node watermark on page cache usage would be more useful. 
Once a node starts to get full, and it's past the watermark, we can go
and shoot down some of the node's page cache.  If the data access is
truly global, then it has a good chance of being brought in on a
different node.  

-- Dave

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2004-08-14  1:21 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <fa.hmbmqn2.d4ef9c@ifi.uio.no>
     [not found] ` <fa.g1i2d5e.1kgqq80@ifi.uio.no>
2004-08-13 16:33   ` [PATCH] allocate page caches pages in round robin fasion Ray Bryant
     [not found] <fa.hmrqqf6.ckie1e@ifi.uio.no>
     [not found] ` <fa.cg3cafa.ngi9og@ifi.uio.no>
2004-08-13 17:31   ` Ray Bryant
     [not found] <2sxuC-429-3@gated-at.bofh.it>
2004-08-13  1:14 ` Andi Kleen
2004-08-13  1:26   ` William Lee Irwin III
2004-08-13  1:29   ` Jesse Barnes
2004-08-13 16:04   ` Jesse Barnes
2004-08-13 17:31     ` Brent Casavant
2004-08-13 20:16       ` Andi Kleen
2004-08-12 23:46 Jesse Barnes
2004-08-13  0:13 ` William Lee Irwin III
2004-08-13  0:25   ` Jesse Barnes
2004-08-13  0:32     ` William Lee Irwin III
2004-08-13 14:50 ` Martin J. Bligh
2004-08-13 15:59   ` Jesse Barnes
2004-08-13 16:20     ` Martin J. Bligh
2004-08-13 16:34       ` Jesse Barnes
2004-08-13 16:47         ` Martin J. Bligh
2004-08-13 17:31           ` Nick Piggin
2004-08-13 21:16             ` Martin J. Bligh
2004-08-13 22:59               ` Martin J. Bligh
2004-08-14  1:21               ` Nick Piggin
  -- strict thread matches above, loose matches on Subject: below --
2004-08-12 23:38 Jesse Barnes
2004-08-13  1:36 ` Dave Hansen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).