public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed
* show_mem() for ia64 discontig takes a really long time on large systems.
@ 2006-03-28 18:43 Robin Holt
  2006-03-28 19:16 ` show_mem() for ia64 discontig takes a really long time on Dave Hansen
                   ` (19 more replies)
  0 siblings, 20 replies; 21+ messages in thread
From: Robin Holt @ 2006-03-28 18:43 UTC (permalink / raw)
  To: linux-ia64

Recently, we ran a large system out of memory and the oom_kill() appeared
to have frozen up.  When we looked at the backtraces, we noticed the cpu
was making progress, but apparently not fast progress.  As a simple test,
I did a 'echo m >/proc/sysrq-trigger' and that had not completed in more
than a half-hour.

The system was a fully populated 512 node SGI machine.  The way that
memory is physically layed out results in a single pgdat which covers
the node with two holes in it.  This is new hardware with larger gaps
between the chunks of memory that earlier version had.  As show_mem()
is traversing the entire systems memory to print out stats on remaining
memory, it takes faults while trying to look at holes in the array of
struct pages.

At this point, I am looking for any sort of direction on what would be
a reasonable fix.  Should show_mem() be made to skip to a page aligned
point in the array when the fault fails?  Should we add the information
about start and end of hole to the pgdat()?  Should we have one pgdat
per chunk?  Are there other better ideas out there?  Any direction would
be greatly appreciated.

Thanks,
Robin Holt

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: show_mem() for ia64 discontig takes a really long time on
  2006-03-28 18:43 show_mem() for ia64 discontig takes a really long time on large systems Robin Holt
@ 2006-03-28 19:16 ` Dave Hansen
  2006-03-28 19:23 ` show_mem() for ia64 discontig takes a really long time on large systems Bob Picco
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Dave Hansen @ 2006-03-28 19:16 UTC (permalink / raw)
  To: linux-ia64

On Tue, 2006-03-28 at 12:43 -0600, Robin Holt wrote:
> The system was a fully populated 512 node SGI machine.  The way that
> memory is physically layed out results in a single pgdat which covers
> the node with two holes in it.  This is new hardware with larger gaps
> between the chunks of memory that earlier version had.  As show_mem()
> is traversing the entire systems memory to print out stats on remaining
> memory, it takes faults while trying to look at holes in the array of
> struct pages.

Could you explain a bit how this works on ia64?  I know about the
vmem_map.  Is the time spent on filling TLB entries when you hit a
'struct page' that isn't backed by real memory?

> At this point, I am looking for any sort of direction on what would be
> a reasonable fix.  Should show_mem() be made to skip to a page aligned
> point in the array when the fault fails?

Yeah, this would be my first instinct.  Perhaps a function like:

unsigned long hole_nr_pages(unsigned long pfn)
{
}

For sparsemem, it could just return PAGES_PER_SECTION.  For
architectures like ia64, it could either return the minimum hole size,
or be smarter and go look in some arch-specific information to find the
real hole size.  

Maybe something like this in your show_mem():

        for_each_pgdat(pgdat) {
		...
                for(i = 0; i < pgdat->node_spanned_pages; i++) {
                        struct page *page;
                        if (pfn_valid(pgdat->node_start_pfn + i))
                                page = pfn_to_page(pgdat->node_start_pfn + i);
                        else
-				continue;
+				/* -1 to offset i++ */
+                              	pfn += hole_nr_pages(pfn) - 1;

> Should we add the information
> about start and end of hole to the pgdat()?

No.  No.  Please, no. :)

Sparsemem is pretty good at this already.  Also, the whole idea of
DISCONTIGMEM was to have a pgdat that describes a contiguous area.
We've massacred that concept with NUMA stuff since then, but that _was_
the original idea.  

> Should we have one pgdat per chunk?

That's one concept that probably won't work today.  I went and tried to
untangle DISCONTIG node ids from NUMA node ids one day and failed
miserably.  They're too intertwined.

> Are there other better ideas out there?  Any direction would
> be greatly appreciated.

Get rid of the silly vmem_map[] :)

-- Dave


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: show_mem() for ia64 discontig takes a really long time on large systems.
  2006-03-28 18:43 show_mem() for ia64 discontig takes a really long time on large systems Robin Holt
  2006-03-28 19:16 ` show_mem() for ia64 discontig takes a really long time on Dave Hansen
@ 2006-03-28 19:23 ` Bob Picco
  2006-03-28 19:34 ` show_mem() for ia64 discontig takes a really long time on Dave Hansen
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Bob Picco @ 2006-03-28 19:23 UTC (permalink / raw)
  To: linux-ia64

Robin Holt wrote:	[Tue Mar 28 2006, 01:43:16PM EST]
> Recently, we ran a large system out of memory and the oom_kill() appeared
> to have frozen up.  When we looked at the backtraces, we noticed the cpu
> was making progress, but apparently not fast progress.  As a simple test,
> I did a 'echo m >/proc/sysrq-trigger' and that had not completed in more
> than a half-hour.
> 
> The system was a fully populated 512 node SGI machine.  The way that
> memory is physically layed out results in a single pgdat which covers
> the node with two holes in it.  This is new hardware with larger gaps
> between the chunks of memory that earlier version had.  As show_mem()
> is traversing the entire systems memory to print out stats on remaining
> memory, it takes faults while trying to look at holes in the array of
> struct pages.
> 
> At this point, I am looking for any sort of direction on what would be
> a reasonable fix.  Should show_mem() be made to skip to a page aligned
> point in the array when the fault fails?  Should we add the information
> about start and end of hole to the pgdat()?  Should we have one pgdat
> per chunk?  Are there other better ideas out there?  Any direction would
> be greatly appreciated.
This could work but you need to be cautious because struct page for ia64
isn't a power of 2. Also this would have to be done conditionally because 
SPARSEMEM doesn't require it but of course VIRTUAL_MEM_MAP does.
> 
> Thanks,
> Robin Holt
your welcome,

bob

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: show_mem() for ia64 discontig takes a really long time on
  2006-03-28 18:43 show_mem() for ia64 discontig takes a really long time on large systems Robin Holt
  2006-03-28 19:16 ` show_mem() for ia64 discontig takes a really long time on Dave Hansen
  2006-03-28 19:23 ` show_mem() for ia64 discontig takes a really long time on large systems Bob Picco
@ 2006-03-28 19:34 ` Dave Hansen
  2006-03-28 20:09 ` show_mem() for ia64 discontig takes a really long time on large systems Chen, Kenneth W
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Dave Hansen @ 2006-03-28 19:34 UTC (permalink / raw)
  To: linux-ia64

On Tue, 2006-03-28 at 14:23 -0500, Bob Picco wrote:
> This could work but you need to be cautious because struct page for ia64
> isn't a power of 2. Also this would have to be done conditionally because 
> SPARSEMEM doesn't require it but of course VIRTUAL_MEM_MAP does. 

If it is done correctly it at least wont _hurt_ sparsemem.  It should
save a bunch of (probably cached) references into the mem_section[]
array, along with a good number of arithmetic operations from each
pfn_valid().

-- Dave


^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: show_mem() for ia64 discontig takes a really long time on large systems.
  2006-03-28 18:43 show_mem() for ia64 discontig takes a really long time on large systems Robin Holt
                   ` (2 preceding siblings ...)
  2006-03-28 19:34 ` show_mem() for ia64 discontig takes a really long time on Dave Hansen
@ 2006-03-28 20:09 ` Chen, Kenneth W
  2006-03-28 20:56 ` Robin Holt
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Chen, Kenneth W @ 2006-03-28 20:09 UTC (permalink / raw)
  To: linux-ia64

Robin Holt wrote on Tuesday, March 28, 2006 10:43 AM
> Recently, we ran a large system out of memory and the oom_kill() appeared
> to have frozen up.  When we looked at the backtraces, we noticed the cpu
> was making progress, but apparently not fast progress.  As a simple test,
> I did a 'echo m >/proc/sysrq-trigger' and that had not completed in more
> than a half-hour.
> 
> The system was a fully populated 512 node SGI machine.  The way that
> memory is physically layed out results in a single pgdat which covers
> the node with two holes in it.  This is new hardware with larger gaps
> between the chunks of memory that earlier version had.  As show_mem()
> is traversing the entire systems memory to print out stats on remaining
> memory, it takes faults while trying to look at holes in the array of
> struct pages.
> 
> At this point, I am looking for any sort of direction on what would be
> a reasonable fix.  Should show_mem() be made to skip to a page aligned
> point in the array when the fault fails?  Should we add the information
> about start and end of hole to the pgdat()?  Should we have one pgdat
> per chunk?  Are there other better ideas out there?  Any direction would
> be greatly appreciated.


Can you walk the vmem_map's page table and look for none-zero entry, sort
of implement something like find_next_valid_pfn? There you can walk at pud,
pmd's granule step.

- Ken

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: show_mem() for ia64 discontig takes a really long time on large systems.
  2006-03-28 18:43 show_mem() for ia64 discontig takes a really long time on large systems Robin Holt
                   ` (3 preceding siblings ...)
  2006-03-28 20:09 ` show_mem() for ia64 discontig takes a really long time on large systems Chen, Kenneth W
@ 2006-03-28 20:56 ` Robin Holt
  2006-03-28 21:00 ` Robin Holt
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Robin Holt @ 2006-03-28 20:56 UTC (permalink / raw)
  To: linux-ia64

On Tue, Mar 28, 2006 at 11:16:19AM -0800, Dave Hansen wrote:
> Could you explain a bit how this works on ia64?  I know about the
> vmem_map.  Is the time spent on filling TLB entries when you hit a
> 'struct page' that isn't backed by real memory?

Time is wasted trying to fill the TLB entry for the vmem_map.  When it
fails, we show_mem() advances to the next page which repeats the sequence.
Jack had thrown out a couple suggestions.  One was essentially what
you proposed below.  The other was advance i to point the next page
of pfns.  He frowned when saying the second, but I don't recall exactly
why he frowned.

> Maybe something like this in your show_mem():
> 
>         for_each_pgdat(pgdat) {
> 		...
>                 for(i = 0; i < pgdat->node_spanned_pages; i++) {
>                         struct page *page;
>                         if (pfn_valid(pgdat->node_start_pfn + i))
>                                 page = pfn_to_page(pgdat->node_start_pfn + i);
>                         else
> -				continue;
> +				/* -1 to offset i++ */
> +                              	pfn += hole_nr_pages(pfn) - 1;
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: show_mem() for ia64 discontig takes a really long time on large systems.
  2006-03-28 18:43 show_mem() for ia64 discontig takes a really long time on large systems Robin Holt
                   ` (4 preceding siblings ...)
  2006-03-28 20:56 ` Robin Holt
@ 2006-03-28 21:00 ` Robin Holt
  2006-03-29  0:18 ` show_mem() for ia64 discontig takes a really long time on large KAMEZAWA Hiroyuki
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Robin Holt @ 2006-03-28 21:00 UTC (permalink / raw)
  To: linux-ia64

On Tue, Mar 28, 2006 at 12:09:00PM -0800, Chen, Kenneth W wrote:
> Can you walk the vmem_map's page table and look for none-zero entry, sort
> of implement something like find_next_valid_pfn? There you can walk at pud,
> pmd's granule step.

Short of breaking up nodes into multiple pgdat structures, I think this
will be the most efficient.  I will try this when I get the time.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: show_mem() for ia64 discontig takes a really long time on large
  2006-03-28 18:43 show_mem() for ia64 discontig takes a really long time on large systems Robin Holt
                   ` (5 preceding siblings ...)
  2006-03-28 21:00 ` Robin Holt
@ 2006-03-29  0:18 ` KAMEZAWA Hiroyuki
  2006-03-30 17:29 ` show_mem() for ia64 discontig takes a really long time on large systems Jack Steiner
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-03-29  0:18 UTC (permalink / raw)
  To: linux-ia64

On Tue, 28 Mar 2006 15:00:16 -0600
Robin Holt <holt@sgi.com> wrote:

> On Tue, Mar 28, 2006 at 12:09:00PM -0800, Chen, Kenneth W wrote:
> > Can you walk the vmem_map's page table and look for none-zero entry, sort
> > of implement something like find_next_valid_pfn? There you can walk at pud,
> > pmd's granule step.
> 
> Short of breaking up nodes into multiple pgdat structures, I think this
> will be the most efficient.  I will try this when I get the time.
> 
Hmm... How about using efi_memmap_walk(filter_rsvd_memory, hogehoge) in
show_mem() ? This will work until you does memory hotplug.

-Kame


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: show_mem() for ia64 discontig takes a really long time on large systems.
  2006-03-28 18:43 show_mem() for ia64 discontig takes a really long time on large systems Robin Holt
                   ` (6 preceding siblings ...)
  2006-03-29  0:18 ` show_mem() for ia64 discontig takes a really long time on large KAMEZAWA Hiroyuki
@ 2006-03-30 17:29 ` Jack Steiner
  2006-03-30 17:48 ` Chen, Kenneth W
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Jack Steiner @ 2006-03-30 17:29 UTC (permalink / raw)
  To: linux-ia64

On Tue, Mar 28, 2006 at 02:56:40PM -0600, Robin Holt wrote:
> On Tue, Mar 28, 2006 at 11:16:19AM -0800, Dave Hansen wrote:
> > Could you explain a bit how this works on ia64?  I know about the
> > vmem_map.  Is the time spent on filling TLB entries when you hit a
> > 'struct page' that isn't backed by real memory?
> 
> Time is wasted trying to fill the TLB entry for the vmem_map.  When it
> fails, we show_mem() advances to the next page which repeats the sequence.
> Jack had thrown out a couple suggestions.  One was essentially what
> you proposed below.  The other was advance i to point the next page
> of pfns.  He frowned when saying the second, but I don't recall exactly
> why he frowned.

Advancing to the next page will be considerably faster but I wonder if
it is fast enough.

There are huge gaps in the virtual vmem_map. On shub2, for example, it
is possible to have 180GB of unpopulated memory in the holes
between memory banks on a node (mode=0).

Assuming 56 bytes per struct_page, that gives:
	
	- 180GB = 11M pages 
	- 38000 pages of struct_page entries
	- 38000 TLB faults to scan the holes in a node

That is a lot of tlbmisses to scan a node. Multiply by 512 to
get the number of faults to scan a full 512n system.

My gut feeling is that is not good enough. 



	


> 
> > Maybe something like this in your show_mem():
> > 
> >         for_each_pgdat(pgdat) {
> > 		...
> >                 for(i = 0; i < pgdat->node_spanned_pages; i++) {
> >                         struct page *page;
> >                         if (pfn_valid(pgdat->node_start_pfn + i))
> >                                 page = pfn_to_page(pgdat->node_start_pfn + i);
> >                         else
> > -				continue;
> > +				/* -1 to offset i++ */
> > +                              	pfn += hole_nr_pages(pfn) - 1;
> > 

-- 
Thanks

Jack Steiner (steiner@sgi.com)          651-683-5302
Principal Engineer                      SGI - Silicon Graphics, Inc.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: show_mem() for ia64 discontig takes a really long time on large systems.
  2006-03-28 18:43 show_mem() for ia64 discontig takes a really long time on large systems Robin Holt
                   ` (7 preceding siblings ...)
  2006-03-30 17:29 ` show_mem() for ia64 discontig takes a really long time on large systems Jack Steiner
@ 2006-03-30 17:48 ` Chen, Kenneth W
  2006-03-30 18:18 ` Luck, Tony
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Chen, Kenneth W @ 2006-03-30 17:48 UTC (permalink / raw)
  To: linux-ia64

Jack Steiner wrote on Thursday, March 30, 2006 9:29 AM
> > Time is wasted trying to fill the TLB entry for the vmem_map.  When it
> > fails, we show_mem() advances to the next page which repeats the sequence.
> > Jack had thrown out a couple suggestions.  One was essentially what
> > you proposed below.  The other was advance i to point the next page
> > of pfns.  He frowned when saying the second, but I don't recall exactly
> > why he frowned.
> 
> Advancing to the next page will be considerably faster but I wonder if
> it is fast enough.
> 
> There are huge gaps in the virtual vmem_map. On shub2, for example, it
> is possible to have 180GB of unpopulated memory in the holes
> between memory banks on a node (mode=0).
> 
> Assuming 56 bytes per struct_page, that gives:
> 	
> 	- 180GB = 11M pages 
> 	- 38000 pages of struct_page entries
> 	- 38000 TLB faults to scan the holes in a node
> 
> That is a lot of tlbmisses to scan a node. Multiply by 512 to
> get the number of faults to scan a full 512n system.
> 
> My gut feeling is that is not good enough. 

What about the earlier proposal of advancing at pmd and pud granule by
walking the page table?  There it can walk at 32MB/64GB step.

- Ken

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: show_mem() for ia64 discontig takes a really long time on large systems.
  2006-03-28 18:43 show_mem() for ia64 discontig takes a really long time on large systems Robin Holt
                   ` (8 preceding siblings ...)
  2006-03-30 17:48 ` Chen, Kenneth W
@ 2006-03-30 18:18 ` Luck, Tony
  2006-03-30 18:26 ` show_mem() for ia64 discontig takes a really long time on Dave Hansen
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Luck, Tony @ 2006-03-30 18:18 UTC (permalink / raw)
  To: linux-ia64

> What about the earlier proposal of advancing at pmd and pud granule by
> walking the page table?  There it can walk at 32MB/64GB step.

That puts constraints on where the platform may align physical
memory.  32M may not be unreasonable (Can you even buy DIMMS that
small anymore?) But 64G is way too large.

Since we only deal with blocks of memory in IA64_GRANULE_SIZE
pieces (16M or 64M) that would seem to be the natural skip
amount.

-Tony

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: show_mem() for ia64 discontig takes a really long time on
  2006-03-28 18:43 show_mem() for ia64 discontig takes a really long time on large systems Robin Holt
                   ` (9 preceding siblings ...)
  2006-03-30 18:18 ` Luck, Tony
@ 2006-03-30 18:26 ` Dave Hansen
  2006-03-30 18:28 ` show_mem() for ia64 discontig takes a really long time on large systems Luck, Tony
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Dave Hansen @ 2006-03-30 18:26 UTC (permalink / raw)
  To: linux-ia64

On Thu, 2006-03-30 at 10:18 -0800, Luck, Tony wrote:
> > What about the earlier proposal of advancing at pmd and pud granule by
> > walking the page table?  There it can walk at 32MB/64GB step.
> 
> That puts constraints on where the platform may align physical
> memory.  32M may not be unreasonable (Can you even buy DIMMS that
> small anymore?) But 64G is way too large.
> 
> Since we only deal with blocks of memory in IA64_GRANULE_SIZE
> pieces (16M or 64M) that would seem to be the natural skip
> amount.

Instead of tacking on yet another hack to vmem_map[], is this perhaps a
time to think about moving to sparsemem on systems where you see these
issues?  Are there remaining issues that ia64 has with it? 

-- Dave


^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: show_mem() for ia64 discontig takes a really long time on large systems.
  2006-03-28 18:43 show_mem() for ia64 discontig takes a really long time on large systems Robin Holt
                   ` (10 preceding siblings ...)
  2006-03-30 18:26 ` show_mem() for ia64 discontig takes a really long time on Dave Hansen
@ 2006-03-30 18:28 ` Luck, Tony
  2006-03-30 18:34 ` show_mem() for ia64 discontig takes a really long time on Dave Hansen
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Luck, Tony @ 2006-03-30 18:28 UTC (permalink / raw)
  To: linux-ia64

Since we only deal with blocks of memory in IA64_GRANULE_SIZE
pieces (16M or 64M) that would seem to be the natural skip
amount.

Oh, but that is probably still too slow. Jack said he has 180GB
holes ... 180GB/16M = 11520 useless probes and TLB misses.  We'd
be better of to brute force seach the efi_memmap for the next
address when we get a miss.

-Tony

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: show_mem() for ia64 discontig takes a really long time on
  2006-03-28 18:43 show_mem() for ia64 discontig takes a really long time on large systems Robin Holt
                   ` (11 preceding siblings ...)
  2006-03-30 18:28 ` show_mem() for ia64 discontig takes a really long time on large systems Luck, Tony
@ 2006-03-30 18:34 ` Dave Hansen
  2006-03-30 19:24 ` show_mem() for ia64 discontig takes a really long time on large systems Chen, Kenneth W
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Dave Hansen @ 2006-03-30 18:34 UTC (permalink / raw)
  To: linux-ia64

One other random idea...

We could call memory_present() and populate the sparsemem tables, but
only _use_ sparsemem in show_mem().  Not for normal pfn_to_page() or
pfn_valid().

-- Dave


^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: show_mem() for ia64 discontig takes a really long time on large systems.
  2006-03-28 18:43 show_mem() for ia64 discontig takes a really long time on large systems Robin Holt
                   ` (12 preceding siblings ...)
  2006-03-30 18:34 ` show_mem() for ia64 discontig takes a really long time on Dave Hansen
@ 2006-03-30 19:24 ` Chen, Kenneth W
  2006-04-12  7:18 ` Robin Holt
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Chen, Kenneth W @ 2006-03-30 19:24 UTC (permalink / raw)
  To: linux-ia64

Luck, Tony wrote on Thursday, March 30, 2006 10:19 AM
> > What about the earlier proposal of advancing at pmd and pud granule by
> > walking the page table?  There it can walk at 32MB/64GB step.
> 
> That puts constraints on where the platform may align physical
> memory.  32M may not be unreasonable (Can you even buy DIMMS that
> small anymore?) But 64G is way too large.
> 
> Since we only deal with blocks of memory in IA64_GRANULE_SIZE
> pieces (16M or 64M) that would seem to be the natural skip
> amount.

I don't mean to walk unconditionally at 32MB or 64GB granule.  What I
proposed earlier is to look at page table of vmem_map and find next
none-zero pte.  If you know some of the pmd are not present, then one
can skip the entire pte page (or 2048 pages, hence the 32MB of virtual
address space used by vmem_map, the actual memory represent by 32MB
vmem_map space is a lot bigger than that!). Same apply to pud, i.e.,
pud not present meaning the entire 64GB of virtual address space used
by vmem_map is empty.

Of course, when you find a none-zero entry in pud, one the drill down
to next none-zero pmd, and so on so forth.  The loop then popping back
up upon hitting a none-zero pte.

- Ken

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: show_mem() for ia64 discontig takes a really long time on large systems.
  2006-03-28 18:43 show_mem() for ia64 discontig takes a really long time on large systems Robin Holt
                   ` (13 preceding siblings ...)
  2006-03-30 19:24 ` show_mem() for ia64 discontig takes a really long time on large systems Chen, Kenneth W
@ 2006-04-12  7:18 ` Robin Holt
  2006-04-12 17:22 ` Chen, Kenneth W
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Robin Holt @ 2006-04-12  7:18 UTC (permalink / raw)
  To: linux-ia64

On Thu, Mar 30, 2006 at 09:48:18AM -0800, Chen, Kenneth W wrote:
> Jack Steiner wrote on Thursday, March 30, 2006 9:29 AM
> > > Time is wasted trying to fill the TLB entry for the vmem_map.  When it
> > > fails, we show_mem() advances to the next page which repeats the sequence.
> > > Jack had thrown out a couple suggestions.  One was essentially what
> > > you proposed below.  The other was advance i to point the next page
> > > of pfns.  He frowned when saying the second, but I don't recall exactly
> > > why he frowned.
> > 
> > Advancing to the next page will be considerably faster but I wonder if
> > it is fast enough.
> > 
...
> > My gut feeling is that is not good enough. 
> 
> What about the earlier proposal of advancing at pmd and pud granule by
> walking the page table?  There it can walk at 32MB/64GB step.

Does the attached seem like the right direction?  I have tested it on
the simulator and it seems _much_ faster, but that is the simulator.
I have time reserved on the machine where the problem was first observed
later today to test it on actual hardware.

Thanks,
Robin


Index: linux-2.6/arch/ia64/mm/discontig.c
=================================--- linux-2.6.orig/arch/ia64/mm/discontig.c	2006-04-11 16:06:54.243967238 -0500
+++ linux-2.6/arch/ia64/mm/discontig.c	2006-04-12 02:16:46.111406150 -0500
@@ -567,8 +567,68 @@ void show_mem(void)
 			struct page *page;
 			if (pfn_valid(pgdat->node_start_pfn + i))
 				page = pfn_to_page(pgdat->node_start_pfn + i);
-			else
+			else {
+				/* At the beginning of a hole. Search for the end. */
+				unsigned long end_address, next_page_offset;
+				unsigned long stop_address;
+
+				end_address = (unsigned long) &vmem_map[pgdat->node_start_pfn + i];
+				end_address = PAGE_ALIGN(end_address);
+
+				stop_address = (unsigned long) &vmem_map[
+					pgdat->node_start_pfn + pgdat->node_spanned_pages];
+
+				/* walk vmem_map page tables until valid pfn found */
+				do {
+					pgd_t *pgd;
+					pud_t *pud;
+					pmd_t *pmd;
+					pte_t *pte;
+
+					pgd = pgd_offset_k(end_address);
+					if (pgd_none(*pgd)) {
+						end_address += PTRS_PER_PUD *
+							       PTRS_PER_PMD *
+							       PTRS_PER_PTE *
+							       PAGE_SIZE;
+						continue;
+					}
+
+					pud = pud_offset(pgd, end_address);
+					if (pud_none(*pud)) {
+						end_address += PTRS_PER_PMD *
+							       PTRS_PER_PTE *
+							       PAGE_SIZE;
+						continue;
+					}
+
+					pmd = pmd_offset(pud, end_address);
+					if (pmd_none(*pmd)) {
+						end_address += PTRS_PER_PTE *
+							       PAGE_SIZE;
+						continue;
+					}
+
+					pte = pte_offset_kernel(pmd, end_address);
+retry_pte:
+					if (pte_none(*pte)) {
+						end_address += PAGE_SIZE;
+						pte++;
+						if ((end_address < stop_address) &&
+						    (end_address != ALIGN(end_address, 1UL << PMD_SHIFT)))
+							goto retry_pte;
+						continue;
+					}
+					/* Found next valid vmem_map page */
+					break;
+				} while (end_address < stop_address);
+
+				end_address = end_address - (unsigned long) vmem_map;
+				next_page_offset = ALIGN(end_address, sizeof(struct page));
+				i = next_page_offset / sizeof(struct page) - pgdat->node_start_pfn;
+
 				continue;
+			}
 			if (PageReserved(page))
 				reserved++;
 			else if (PageSwapCache(page))

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: show_mem() for ia64 discontig takes a really long time on large systems.
  2006-03-28 18:43 show_mem() for ia64 discontig takes a really long time on large systems Robin Holt
                   ` (14 preceding siblings ...)
  2006-04-12  7:18 ` Robin Holt
@ 2006-04-12 17:22 ` Chen, Kenneth W
  2006-04-12 17:36 ` Bob Picco
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Chen, Kenneth W @ 2006-04-12 17:22 UTC (permalink / raw)
  To: linux-ia64

Robin Holt wrote on Wednesday, April 12, 2006 12:19 AM
> Does the attached seem like the right direction?  I have tested it on
> the simulator and it seems _much_ faster, but that is the simulator.
> I have time reserved on the machine where the problem was first observed
> later today to test it on actual hardware.

Very nice!


> +		next_page_offset = ALIGN(end_address, sizeof(struct page));

ALIGN only works on power of 2 alignment, I think you would get very
unpleasant rounding with the above.

- Ken

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: show_mem() for ia64 discontig takes a really long time on large systems.
  2006-03-28 18:43 show_mem() for ia64 discontig takes a really long time on large systems Robin Holt
                   ` (15 preceding siblings ...)
  2006-04-12 17:22 ` Chen, Kenneth W
@ 2006-04-12 17:36 ` Bob Picco
  2006-04-13  3:02 ` Robin Holt
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Bob Picco @ 2006-04-12 17:36 UTC (permalink / raw)
  To: linux-ia64

Robin Holt wrote:	[Wed Apr 12 2006, 03:18:55AM EDT]
> On Thu, Mar 30, 2006 at 09:48:18AM -0800, Chen, Kenneth W wrote:
> > Jack Steiner wrote on Thursday, March 30, 2006 9:29 AM
> > > > Time is wasted trying to fill the TLB entry for the vmem_map.  When it
> > > > fails, we show_mem() advances to the next page which repeats the sequence.
> > > > Jack had thrown out a couple suggestions.  One was essentially what
> > > > you proposed below.  The other was advance i to point the next page
> > > > of pfns.  He frowned when saying the second, but I don't recall exactly
> > > > why he frowned.
> > > 
> > > Advancing to the next page will be considerably faster but I wonder if
> > > it is fast enough.
> > > 
> ...
> > > My gut feeling is that is not good enough. 
> > 
> > What about the earlier proposal of advancing at pmd and pud granule by
> > walking the page table?  There it can walk at 32MB/64GB step.
> 
> Does the attached seem like the right direction?  I have tested it on
> the simulator and it seems _much_ faster, but that is the simulator.
> I have time reserved on the machine where the problem was first observed
> later today to test it on actual hardware.
> 
> Thanks,
> Robin
> 
> 
> Index: linux-2.6/arch/ia64/mm/discontig.c
> =================================> --- linux-2.6.orig/arch/ia64/mm/discontig.c	2006-04-11 16:06:54.243967238 -0500
> +++ linux-2.6/arch/ia64/mm/discontig.c	2006-04-12 02:16:46.111406150 -0500
> @@ -567,8 +567,68 @@ void show_mem(void)
>  			struct page *page;
>  			if (pfn_valid(pgdat->node_start_pfn + i))
>  				page = pfn_to_page(pgdat->node_start_pfn + i);
> -			else
> +			else {
Looks great and will work for VIRTUAL_MEM_MAP.  It's too bad SPARSEMEM
can't be used because this probably wouldn't be an issue.  

I think this will break SPARSEMEM. Perhaps it's time to create a new
module with VIRTUAL_MEM_MAP specific code in it. Say like vmemmap.c.
Lots of code in init.c and this new code could reside in this new
module. Just a thought.  You'll need a compile time optimized out feature for 
SPARSEMEM but for VIRTUAL_MEM_MAP a function call could be called.

bob
[snip]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: show_mem() for ia64 discontig takes a really long time on large systems.
  2006-03-28 18:43 show_mem() for ia64 discontig takes a really long time on large systems Robin Holt
                   ` (16 preceding siblings ...)
  2006-04-12 17:36 ` Bob Picco
@ 2006-04-13  3:02 ` Robin Holt
  2006-04-13 16:02 ` Robin Holt
  2006-04-13 16:15 ` Bob Picco
  19 siblings, 0 replies; 21+ messages in thread
From: Robin Holt @ 2006-04-13  3:02 UTC (permalink / raw)
  To: linux-ia64

On Wed, Apr 12, 2006 at 10:22:17AM -0700, Chen, Kenneth W wrote:
> Robin Holt wrote on Wednesday, April 12, 2006 12:19 AM
> > Does the attached seem like the right direction?  I have tested it on
> > the simulator and it seems _much_ faster, but that is the simulator.
> > I have time reserved on the machine where the problem was first observed
> > later today to test it on actual hardware.
> 
> Very nice!
> 
> 
> > +		next_page_offset = ALIGN(end_address, sizeof(struct page));
> 
> ALIGN only works on power of 2 alignment, I think you would get very
> unpleasant rounding with the above.

Fixed.  I will repost the patch in a few seconds.

Thanks,
Robin Holt

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: show_mem() for ia64 discontig takes a really long time on large systems.
  2006-03-28 18:43 show_mem() for ia64 discontig takes a really long time on large systems Robin Holt
                   ` (17 preceding siblings ...)
  2006-04-13  3:02 ` Robin Holt
@ 2006-04-13 16:02 ` Robin Holt
  2006-04-13 16:15 ` Bob Picco
  19 siblings, 0 replies; 21+ messages in thread
From: Robin Holt @ 2006-04-13 16:02 UTC (permalink / raw)
  To: linux-ia64

On Wed, Apr 12, 2006 at 01:36:40PM -0400, Bob Picco wrote:
> Looks great and will work for VIRTUAL_MEM_MAP.  It's too bad SPARSEMEM
> can't be used because this probably wouldn't be an issue.  
> 
> I think this will break SPARSEMEM. Perhaps it's time to create a new
> module with VIRTUAL_MEM_MAP specific code in it. Say like vmemmap.c.
> Lots of code in init.c and this new code could reside in this new
> module. Just a thought.  You'll need a compile time optimized out feature for 
> SPARSEMEM but for VIRTUAL_MEM_MAP a function call could be called.

Bob,

I forgot to address this concern.  I do nothing with SPARSEMEM and am
not really sure where to dive in.  What ia64 boxes use SPARSEMEM?
If I simply skip the doing anything to i in the SPARSEMEM case, would
that be sufficient?  Essentially:

                       else {
#ifdef CONFIG_SPARSEMEM
                                i = find_next_valid_pfn_for_pgdat(pgdat, i) - 1;
#endif
                                continue;
                        }

Only done with the function definition above.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: show_mem() for ia64 discontig takes a really long time on large systems.
  2006-03-28 18:43 show_mem() for ia64 discontig takes a really long time on large systems Robin Holt
                   ` (18 preceding siblings ...)
  2006-04-13 16:02 ` Robin Holt
@ 2006-04-13 16:15 ` Bob Picco
  19 siblings, 0 replies; 21+ messages in thread
From: Bob Picco @ 2006-04-13 16:15 UTC (permalink / raw)
  To: linux-ia64

Robin Holt wrote:	[Thu Apr 13 2006, 12:02:52PM EDT]
> On Wed, Apr 12, 2006 at 01:36:40PM -0400, Bob Picco wrote:
> > Looks great and will work for VIRTUAL_MEM_MAP.  It's too bad SPARSEMEM
> > can't be used because this probably wouldn't be an issue.  
> > 
> > I think this will break SPARSEMEM. Perhaps it's time to create a new
> > module with VIRTUAL_MEM_MAP specific code in it. Say like vmemmap.c.
> > Lots of code in init.c and this new code could reside in this new
> > module. Just a thought.  You'll need a compile time optimized out feature for 
> > SPARSEMEM but for VIRTUAL_MEM_MAP a function call could be called.
> 
> Bob,
> 
> I forgot to address this concern.  I do nothing with SPARSEMEM and am
> not really sure where to dive in.  What ia64 boxes use SPARSEMEM?
> If I simply skip the doing anything to i in the SPARSEMEM case, would
> that be sufficient?  Essentially:
> 
>                        else {
> #ifdef CONFIG_SPARSEMEM
>                                 i = find_next_valid_pfn_for_pgdat(pgdat, i) - 1;
> #endif
>                                 continue;
>                         }
> 
> Only done with the function definition above.
> 
> Thanks,
> Robin
Robin,

Sorry just saw this. We had a 16 hour imap server outage.  I posted to your 
patch about this issue.

bob

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2006-04-13 16:15 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-03-28 18:43 show_mem() for ia64 discontig takes a really long time on large systems Robin Holt
2006-03-28 19:16 ` show_mem() for ia64 discontig takes a really long time on Dave Hansen
2006-03-28 19:23 ` show_mem() for ia64 discontig takes a really long time on large systems Bob Picco
2006-03-28 19:34 ` show_mem() for ia64 discontig takes a really long time on Dave Hansen
2006-03-28 20:09 ` show_mem() for ia64 discontig takes a really long time on large systems Chen, Kenneth W
2006-03-28 20:56 ` Robin Holt
2006-03-28 21:00 ` Robin Holt
2006-03-29  0:18 ` show_mem() for ia64 discontig takes a really long time on large KAMEZAWA Hiroyuki
2006-03-30 17:29 ` show_mem() for ia64 discontig takes a really long time on large systems Jack Steiner
2006-03-30 17:48 ` Chen, Kenneth W
2006-03-30 18:18 ` Luck, Tony
2006-03-30 18:26 ` show_mem() for ia64 discontig takes a really long time on Dave Hansen
2006-03-30 18:28 ` show_mem() for ia64 discontig takes a really long time on large systems Luck, Tony
2006-03-30 18:34 ` show_mem() for ia64 discontig takes a really long time on Dave Hansen
2006-03-30 19:24 ` show_mem() for ia64 discontig takes a really long time on large systems Chen, Kenneth W
2006-04-12  7:18 ` Robin Holt
2006-04-12 17:22 ` Chen, Kenneth W
2006-04-12 17:36 ` Bob Picco
2006-04-13  3:02 ` Robin Holt
2006-04-13 16:02 ` Robin Holt
2006-04-13 16:15 ` Bob Picco

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox