public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed
* pgd_free, pmd_free, and pte_free trapping memory.
@ 2004-03-16 11:24 Robin Holt
  2004-03-16 14:48 ` Christoph Hellwig
                   ` (7 more replies)
  0 siblings, 8 replies; 9+ messages in thread
From: Robin Holt @ 2004-03-16 11:24 UTC (permalink / raw)
  To: linux-ia64


On a 512 CPU system with 256 numa nodes, we have an application which
forks 500 worker threads.  During each fork, 32 pages are allocated on
the node where the main thread is doing the forks.  The child threads
then use sched_set_affinity to migrate to a different cpu. After the
application exits, we are loosing approx 15,000 pages on main node.
If we echo "0 0" >/proc/sys/vm/pagetable_cache, the memory gets returned.
This was run on a 2.4 kernel, but the code in question is identical in 2.6.

Note: because of memory size, pagetable_cache sizes are 25 for min and
15559 for max.

Looking through the code, we have identified the source of the problem.
The fork is occuring on one cpu where the pgd, pmd, and pte allocations
get pages of memory local to that cpu.  The worker thread is then
migrated to a different cpu where it exits.  The pages are then placed
on the cpu which is very distant from where the memory is located.

I looked at the i386 code which appears to have been very similar to the
ia64 at one point in time, but no longer.  They appear to have completely
eliminated the quicklists.  Is this the right direction for ia64?

Since, when the pgd, pmd, and pte are ready to be freed, they are
zeroed out again, I understand the benefit to keeping the entry around
to save the time for zeroing out the page again.  Why not have a single
quicklist where all three are placed.  How would node locality best play
into placing items on the lists?  Should we have one quicklist on
each cpu that a cpu returns node local pages and then a node quicklist
where we place pages that are not node local using cmpxchg?

One other related but different question.  The pagetable_cache size of
15,559 seems a little large.  Given that this machine has a large amount
of memory, I understand that this doesn't seem too outragously large.
What role should node memory play in setting pagetable_cache max size?

As a simple diff to open discussions, I have included the following patch.
On our above test, the problem prevents the pages from being trapped.
The method is simple, if the page that is being freed is not physically
on this node, it is freed, otherwise it is added to the quicklist.

Thanks for your attention,
Robin Holt

--- /usr/tmp/TmpDir.9611-0/linux/include/asm-ia64/pgalloc.h_1.15  Tue Mar 16 05:13:05 2004
+++ linux/include/asm-ia64/pgalloc.h      Tue Mar 16 05:12:55 2004
@@ -18,6 +18,7 @@
 #include <linux/compiler.h>
 #include <linux/mm.h>
 #include <linux/threads.h>
+#include <linux/mmzone.h>
 
 #include <asm/mmu_context.h>
 #include <asm/processor.h>
@@ -65,6 +66,12 @@
 static inline void
 pgd_free (pgd_t *pgd)
 {
+#ifdef CONFIG_NUMA
+       if(page_zone(virt_to_page(pgd))->zone_pgdat->node_id != numa_node_id()) {
+               free_page((unsigned long) pgd);
+               return;
+       }
+#endif
        *(unsigned long *)pgd = (unsigned long) pgd_quicklist;
        pgd_quicklist = (unsigned long *) pgd;
        ++pgtable_cache_size;
@@ -103,6 +110,12 @@
 static inline void
 pmd_free (pmd_t *pmd)
 {
+#ifdef CONFIG_NUMA
+       if(page_zone(virt_to_page(pmd))->zone_pgdat->node_id != numa_node_id()) {
+               free_page((unsigned long) pmd);
+               return;
+       }
+#endif
        *(unsigned long *)pmd = (unsigned long) pmd_quicklist;
        pmd_quicklist = (unsigned long *) pmd;
        ++pgtable_cache_size;
@@ -141,6 +154,12 @@
 static inline void
 pte_free (pte_t *pte)
 {
+#ifdef CONFIG_NUMA
+       if(page_zone(virt_to_page(pte))->zone_pgdat->node_id != numa_node_id()) {
+               free_page((unsigned long) pte);
+               return;
+       }
+#endif
        *(unsigned long *)pte = (unsigned long) pte_quicklist;
        pte_quicklist = (unsigned long *) pte;
        ++pgtable_cache_size;


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: pgd_free, pmd_free, and pte_free trapping memory.
  2004-03-16 11:24 pgd_free, pmd_free, and pte_free trapping memory Robin Holt
@ 2004-03-16 14:48 ` Christoph Hellwig
  2004-03-16 15:24 ` Robin Holt
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Christoph Hellwig @ 2004-03-16 14:48 UTC (permalink / raw)
  To: linux-ia64

On Tue, Mar 16, 2004 at 05:24:24AM -0600, Robin Holt wrote:
> Looking through the code, we have identified the source of the problem.
> The fork is occuring on one cpu where the pgd, pmd, and pte allocations
> get pages of memory local to that cpu.  The worker thread is then
> migrated to a different cpu where it exits.  The pages are then placed
> on the cpu which is very distant from where the memory is located.
> 
> I looked at the i386 code which appears to have been very similar to the
> ia64 at one point in time, but no longer.  They appear to have completely
> eliminated the quicklists.  Is this the right direction for ia64?
> 
> Since, when the pgd, pmd, and pte are ready to be freed, they are
> zeroed out again, I understand the benefit to keeping the entry around
> to save the time for zeroing out the page again.  Why not have a single
> quicklist where all three are placed.  How would node locality best play
> into placing items on the lists?  Should we have one quicklist on
> each cpu that a cpu returns node local pages and then a node quicklist
> where we place pages that are not node local using cmpxchg?

Tjis quicklist thing is a workaround for not having per-cpu pages in
Linux <= 2.4.  Your patch is a workaround for a workaround and gets a little
ugly.  I'd say just rip the quicklists out like x86 and benchmark it.

That's less code and thus less complexity which is always good.  Now if
the pre-zeroing actually makes a difference we might have to keep small
pre-zeroed list around, but I doubt this is really good idea (or even
nessecary)


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: pgd_free, pmd_free, and pte_free trapping memory.
  2004-03-16 11:24 pgd_free, pmd_free, and pte_free trapping memory Robin Holt
  2004-03-16 14:48 ` Christoph Hellwig
@ 2004-03-16 15:24 ` Robin Holt
  2004-03-16 15:34 ` Christoph Hellwig
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Robin Holt @ 2004-03-16 15:24 UTC (permalink / raw)
  To: linux-ia64

On Tue, Mar 16, 2004 at 02:48:20PM +0000, Christoph Hellwig wrote:
> On Tue, Mar 16, 2004 at 05:24:24AM -0600, Robin Holt wrote:
> > Looking through the code, we have identified the source of the problem.
> > The fork is occuring on one cpu where the pgd, pmd, and pte allocations
> > get pages of memory local to that cpu.  The worker thread is then
> > migrated to a different cpu where it exits.  The pages are then placed
> > on the cpu which is very distant from where the memory is located.
> > 
> > I looked at the i386 code which appears to have been very similar to the
> > ia64 at one point in time, but no longer.  They appear to have completely
> > eliminated the quicklists.  Is this the right direction for ia64?
> > 
> > Since, when the pgd, pmd, and pte are ready to be freed, they are
> > zeroed out again, I understand the benefit to keeping the entry around
> > to save the time for zeroing out the page again.  Why not have a single
> > quicklist where all three are placed.  How would node locality best play
> > into placing items on the lists?  Should we have one quicklist on
> > each cpu that a cpu returns node local pages and then a node quicklist
> > where we place pages that are not node local using cmpxchg?
> 
> Tjis quicklist thing is a workaround for not having per-cpu pages in
> Linux <= 2.4.  Your patch is a workaround for a workaround and gets a little
> ugly.  I'd say just rip the quicklists out like x86 and benchmark it.

I have a kernel with these ripped out.  I have run one simple Aim7 run
on a 32P system.  The performance fell in the noise range of a normal
Aim7 run.  Is this a good test to run?  Should I focus on any specific
benchmark, or run a suite?

> 
> That's less code and thus less complexity which is always good.  Now if
> the pre-zeroing actually makes a difference we might have to keep small
> pre-zeroed list around, but I doubt this is really good idea (or even
> nessecary)

The page zeroing costs 4uSec per page (I believe that is the number).
With a typical fork taking approx 40 pages, that should be felt during
an Aim7 run.  It looks like caches are masking some of that out.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: pgd_free, pmd_free, and pte_free trapping memory.
  2004-03-16 11:24 pgd_free, pmd_free, and pte_free trapping memory Robin Holt
  2004-03-16 14:48 ` Christoph Hellwig
  2004-03-16 15:24 ` Robin Holt
@ 2004-03-16 15:34 ` Christoph Hellwig
  2004-03-17 12:27 ` Robin Holt
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Christoph Hellwig @ 2004-03-16 15:34 UTC (permalink / raw)
  To: linux-ia64

On Tue, Mar 16, 2004 at 09:24:55AM -0600, Robin Holt wrote:
> I have a kernel with these ripped out.  I have run one simple Aim7 run
> on a 32P system.  The performance fell in the noise range of a normal
> Aim7 run.  Is this a good test to run?  Should I focus on any specific
> benchmark, or run a suite?

I'm not actually sure.  You could ask Ingo Molnar who implemented both
the per-cpu pages and ripped out the x86 quicklists if I remember correctly.

Ingo, any idea on how to benchmark that kind of thing best?

> > That's less code and thus less complexity which is always good.  Now if
> > the pre-zeroing actually makes a difference we might have to keep small
> > pre-zeroed list around, but I doubt this is really good idea (or even
> > nessecary)
> 
> The page zeroing costs 4uSec per page (I believe that is the number).
> With a typical fork taking approx 40 pages, that should be felt during
> an Aim7 run.  It looks like caches are masking some of that out.

OTOH you have more pages avalilable, the real per-cpu pages have a better
cache locality than the quicklists, the kernel has a smaller icache footprint,
etc..

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: pgd_free, pmd_free, and pte_free trapping memory.
  2004-03-16 11:24 pgd_free, pmd_free, and pte_free trapping memory Robin Holt
                   ` (2 preceding siblings ...)
  2004-03-16 15:34 ` Christoph Hellwig
@ 2004-03-17 12:27 ` Robin Holt
  2004-03-17 16:20 ` Jack Steiner
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Robin Holt @ 2004-03-17 12:27 UTC (permalink / raw)
  To: linux-ia64

On Tue, Mar 16, 2004 at 10:06:18PM -0800, David Mosberger wrote:
> >>>>> On Tue, 16 Mar 2004 09:24:55 -0600, Robin Holt <holt@sgi.com> said:
> 
>   Robin> I have a kernel with these ripped out.  I have run one simple
>   Robin> Aim7 run on a 32P system.  The performance fell in the noise
>   Robin> range of a normal Aim7 run.  Is this a good test to run?
>   Robin> Should I focus on any specific benchmark, or run a suite?
> 
> For this kind of test, I'd rather see a micro-benchmark to establish
> the worst-case effect.  Then we have an upper bound and can make a
> judgment as to whether or not that's a likely scenario.  And yes, this
> would mean that you'd probably want to measure it on a relative small
> system (certainly not NUMA, perhaps just UP).

I am currently looking into Christoph's suggestion of using a per
cpu page allocator.  Unfortunately, yesterday was a little busy and
I didn't get much of a chance to look into it.

> 
>   >>  That's less code and thus less complexity which is always good.
>   >> Now if the pre-zeroing actually makes a difference we might have
>   >> to keep small pre-zeroed list around, but I doubt this is really
>   >> good idea (or even nessecary)
> 
>   Robin> The page zeroing costs 4uSec per page (I believe that is the
>   Robin> number).  With a typical fork taking approx 40 pages, that
>   Robin> should be felt during an Aim7 run.  It looks like caches are
>   Robin> masking some of that out.
> 
> Try UP.  Also, what if the page-size is 64KB?  In any case, 4usec is a
> lot.

I got the 4uSec from Jack Steiner.  I don't know if he tested it with
64KB pages.  I will check.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: pgd_free, pmd_free, and pte_free trapping memory.
  2004-03-16 11:24 pgd_free, pmd_free, and pte_free trapping memory Robin Holt
                   ` (3 preceding siblings ...)
  2004-03-17 12:27 ` Robin Holt
@ 2004-03-17 16:20 ` Jack Steiner
  2004-03-17 16:53 ` Christoph Hellwig
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Jack Steiner @ 2004-03-17 16:20 UTC (permalink / raw)
  To: linux-ia64

> > 
> >   Robin> The page zeroing costs 4uSec per page (I believe that is the
> >   Robin> number).  With a typical fork taking approx 40 pages, that
> >   Robin> should be felt during an Aim7 run.  It looks like caches are
> >   Robin> masking some of that out.
> > 
> > Try UP.  Also, what if the page-size is 64KB?  In any case, 4usec is a
> > lot.
> 
> I got the 4uSec from Jack Steiner.  I don't know if he tested it with
> 64KB pages.  I will check.

I see the following times for the kernel "clearpage" routine. I run this
code in user space but I use the kernel assembly code for clearing pages.
AFAICT, the timing should be identical to running it in the kernel.

The test harness ensures that the timing is done with warm TLBs & cold 
cache (data not in cpu caches):

	 3.1 usec 16K node local memory
	 6.1 usec 16K remote memory

	12.5 usec 64K node local memory
	24.5 usec 64K remote memory

I ran this on Itanium 2 1300MHz cpus. However, processor core speed does
not significantly affect timings since most time is spent waiting for 
off chip memory access.

Also, timing will obviously vary across different platforms.


-- 
Thanks

Jack Steiner (steiner@sgi.com)          651-683-5302
Principal Engineer                      SGI - Silicon Graphics, Inc.



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: pgd_free, pmd_free, and pte_free trapping memory.
  2004-03-16 11:24 pgd_free, pmd_free, and pte_free trapping memory Robin Holt
                   ` (4 preceding siblings ...)
  2004-03-17 16:20 ` Jack Steiner
@ 2004-03-17 16:53 ` Christoph Hellwig
  2004-03-17 17:10 ` David Mosberger
  2004-03-17 20:33 ` Jack Steiner
  7 siblings, 0 replies; 9+ messages in thread
From: Christoph Hellwig @ 2004-03-17 16:53 UTC (permalink / raw)
  To: linux-ia64

On Wed, Mar 17, 2004 at 10:20:16AM -0600, Jack Steiner wrote:
> The test harness ensures that the timing is done with warm TLBs & cold 
> cache (data not in cpu caches):
> 
> 	 3.1 usec 16K node local memory
> 	 6.1 usec 16K remote memory
> 
> 	12.5 usec 64K node local memory
> 	24.5 usec 64K remote memory
> 
> I ran this on Itanium 2 1300MHz cpus. However, processor core speed does
> not significantly affect timings since most time is spent waiting for 
> off chip memory access.

Well, pages on the per-cpu list are supposed to be still cache hot..


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: pgd_free, pmd_free, and pte_free trapping memory.
  2004-03-16 11:24 pgd_free, pmd_free, and pte_free trapping memory Robin Holt
                   ` (5 preceding siblings ...)
  2004-03-17 16:53 ` Christoph Hellwig
@ 2004-03-17 17:10 ` David Mosberger
  2004-03-17 20:33 ` Jack Steiner
  7 siblings, 0 replies; 9+ messages in thread
From: David Mosberger @ 2004-03-17 17:10 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Wed, 17 Mar 2004 10:20:16 -0600, Jack Steiner <steiner@sgi.com> said:

  Jack> 3.1 usec 16K node local memory
  Jack> 6.1 usec 16K remote memory

  Jack> 12.5 usec 64K node local memory
  Jack> 24.5 usec 64K remote memory

Roughly 4.9GByte/sec for local memory.  That sounds about right.

Robin: I certainly agree that per-CPU allocation is the Right Thing
for SMP.  My point is that the quick-list optimization is orthogonal
and my gut feeling is that it's worth doing it.  LMbench on a UP with
and without quicklist should be able to confirm that (and no, I don't
care at all how or where the quicklist optimization is
implemented---just as long as you can take advantage of the fact that
freed directories are already zeroed).

	--david

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: pgd_free, pmd_free, and pte_free trapping memory.
  2004-03-16 11:24 pgd_free, pmd_free, and pte_free trapping memory Robin Holt
                   ` (6 preceding siblings ...)
  2004-03-17 17:10 ` David Mosberger
@ 2004-03-17 20:33 ` Jack Steiner
  7 siblings, 0 replies; 9+ messages in thread
From: Jack Steiner @ 2004-03-17 20:33 UTC (permalink / raw)
  To: linux-ia64

On Wed, Mar 17, 2004 at 04:53:01PM +0000, Christoph Hellwig wrote:
> On Wed, Mar 17, 2004 at 10:20:16AM -0600, Jack Steiner wrote:
> > The test harness ensures that the timing is done with warm TLBs & cold 
> > cache (data not in cpu caches):
> > 
> > 	 3.1 usec 16K node local memory
> > 	 6.1 usec 16K remote memory
> > 
> > 	12.5 usec 64K node local memory
> > 	24.5 usec 64K remote memory
> > 
> > I ran this on Itanium 2 1300MHz cpus. However, processor core speed does
> > not significantly affect timings since most time is spent waiting for 
> > off chip memory access.
> 
> Well, pages on the per-cpu list are supposed to be still cache hot..


For some workloads that is true. 

The workload that Robin mentioned in the original mail allocated 32 PT
pages on each fork. Even if the pages were recently used, in many
cases a large portion of those pages are probably not cache-resident.



-- 
Thanks

Jack Steiner (steiner@sgi.com)          651-683-5302
Principal Engineer                      SGI - Silicon Graphics, Inc.



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2004-03-17 20:33 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-03-16 11:24 pgd_free, pmd_free, and pte_free trapping memory Robin Holt
2004-03-16 14:48 ` Christoph Hellwig
2004-03-16 15:24 ` Robin Holt
2004-03-16 15:34 ` Christoph Hellwig
2004-03-17 12:27 ` Robin Holt
2004-03-17 16:20 ` Jack Steiner
2004-03-17 16:53 ` Christoph Hellwig
2004-03-17 17:10 ` David Mosberger
2004-03-17 20:33 ` Jack Steiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox