[patch] #ifdef very expensive debug check in page fault path

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [patch] #ifdef very expensive debug check in page fault path
@ 2008-01-16 18:01 Carsten Otte
  2008-01-16 23:45 ` Nick Piggin
  0 siblings, 1 reply; 12+ messages in thread
From: Carsten Otte @ 2008-01-16 18:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, Nick Piggin, schwidefsky,
	holger.wolf

This patch puts #ifdef CONFIG_DEBUG_VM around a check in vm_normal_page
that verifies that a pfn is valid. This patch increases performance of
the page fault microbenchmark in lmbench by 13% and overall dbench
performance by 7% on s390x.  pfn_valid() is an expensive operation on
s390 that needs a high double digit amount of CPU cycles.
Nick Piggin suggested that pfn_valid() involves an array lookup on
systems with sparsemem, and therefore is an expensive operation there
too.
The check looks like a clear debug thing to me, it should never trigger
on regular kernels. And if a pte is created for an invalid pfn, we'll
find out once the memory gets accessed later on anyway. Please consider
inclusion of this patch into mm.

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
--- 
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -392,6 +392,7 @@ struct page *vm_normal_page(struct vm_ar
 			return NULL;
 	}

+#ifdef CONFIG_DEBUG_VM
 	/*
 	 * Add some anal sanity checks for now. Eventually,
 	 * we should just do "return pfn_to_page(pfn)", but
@@ -402,6 +403,7 @@ struct page *vm_normal_page(struct vm_ar
 		print_bad_pte(vma, pte, addr);
 		return NULL;
 	}
+#endif

 	/*
 	 * NOTE! We still have PageReserved() pages in the page 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch] #ifdef very expensive debug check in page fault path
  2008-01-16 18:01 [patch] #ifdef very expensive debug check in page fault path Carsten Otte
@ 2008-01-16 23:45 ` Nick Piggin
  2008-01-17  0:10   ` Andrew Morton
  2008-01-17  9:53   ` Martin Schwidefsky
  0 siblings, 2 replies; 12+ messages in thread
From: Nick Piggin @ 2008-01-16 23:45 UTC (permalink / raw)
  To: Carsten Otte
  Cc: Andrew Morton, Linux Memory Management List, schwidefsky,
	holger.wolf, Hugh Dickins, Linus Torvalds

On Wed, Jan 16, 2008 at 07:01:28PM +0100, Carsten Otte wrote:
> This patch puts #ifdef CONFIG_DEBUG_VM around a check in vm_normal_page
> that verifies that a pfn is valid. This patch increases performance of
> the page fault microbenchmark in lmbench by 13% and overall dbench
> performance by 7% on s390x.  pfn_valid() is an expensive operation on
> s390 that needs a high double digit amount of CPU cycles.
> Nick Piggin suggested that pfn_valid() involves an array lookup on
> systems with sparsemem, and therefore is an expensive operation there
> too.
> The check looks like a clear debug thing to me, it should never trigger
> on regular kernels. And if a pte is created for an invalid pfn, we'll
> find out once the memory gets accessed later on anyway. Please consider
> inclusion of this patch into mm.
> 
> Signed-off-by: Carsten Otte <cotte@de.ibm.com>

Wow, that's a big performance hit for a few instructions ;)
I haven't seen it to be quite so expensive on x86, but it definitely is
not zero cost, especially with NUMA kernels. Thanks for getting those
numbers.

I posted a version which got rid of that big comment block too, but
no feedback as yet.

http://marc.info/?l=linux-arch&m=120046068604222&w=2

The one actual upside of this code is that if there is pte corruption
detected, the failure should be a little more graceful... but there
is also lots of pte corruption that could go undetected and cause much
worse problems anyway so I don't feel it is something that needs to
be turned on in production kernels. It could be a good debugging aid
to mm/ or device driver writers though.

Anyway, again I've cc'ed Hugh, because he nacked this same patch a
while back. So let's try to get him on board before merging anything.

If we get an ack, why not send this upstream for 2.6.24? Those s390
numbers are pretty insane.

> --- 
> Index: linux-2.6/mm/memory.c
> ===================================================================
> --- linux-2.6.orig/mm/memory.c
> +++ linux-2.6/mm/memory.c
> @@ -392,6 +392,7 @@ struct page *vm_normal_page(struct vm_ar
>  			return NULL;
>  	}
>  
> +#ifdef CONFIG_DEBUG_VM
>  	/*
>  	 * Add some anal sanity checks for now. Eventually,
>  	 * we should just do "return pfn_to_page(pfn)", but
> @@ -402,6 +403,7 @@ struct page *vm_normal_page(struct vm_ar
>  		print_bad_pte(vma, pte, addr);
>  		return NULL;
>  	}
> +#endif
>  
>  	/*
>  	 * NOTE! We still have PageReserved() pages in the page 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch] #ifdef very expensive debug check in page fault path
  2008-01-16 23:45 ` Nick Piggin
@ 2008-01-17  0:10   ` Andrew Morton
  2008-01-17  0:27     ` Nick Piggin
  2008-01-18 20:56     ` Hugh Dickins
  2008-01-17  9:53   ` Martin Schwidefsky
  1 sibling, 2 replies; 12+ messages in thread
From: Andrew Morton @ 2008-01-17  0:10 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Carsten Otte, Linux Memory Management List, schwidefsky,
	holger.wolf, Hugh Dickins, Linus Torvalds

On Thu, 17 Jan 2008 00:45:40 +0100 Nick Piggin <npiggin@suse.de> wrote:

> On Wed, Jan 16, 2008 at 07:01:28PM +0100, Carsten Otte wrote:
> > This patch puts #ifdef CONFIG_DEBUG_VM around a check in vm_normal_page
> > that verifies that a pfn is valid. This patch increases performance of
> > the page fault microbenchmark in lmbench by 13% and overall dbench
> > performance by 7% on s390x.  pfn_valid() is an expensive operation on
> > s390 that needs a high double digit amount of CPU cycles.
> > Nick Piggin suggested that pfn_valid() involves an array lookup on
> > systems with sparsemem, and therefore is an expensive operation there
> > too.
> > The check looks like a clear debug thing to me, it should never trigger
> > on regular kernels. And if a pte is created for an invalid pfn, we'll
> > find out once the memory gets accessed later on anyway. Please consider
> > inclusion of this patch into mm.
> > 
> > Signed-off-by: Carsten Otte <cotte@de.ibm.com>
> 
> Wow, that's a big performance hit for a few instructions ;)
> I haven't seen it to be quite so expensive on x86, but it definitely is
> not zero cost, especially with NUMA kernels. Thanks for getting those
> numbers.
> 
> I posted a version which got rid of that big comment block too, but
> no feedback as yet.
> 
> http://marc.info/?l=linux-arch&m=120046068604222&w=2
> 
> The one actual upside of this code is that if there is pte corruption
> detected, the failure should be a little more graceful... but there
> is also lots of pte corruption that could go undetected and cause much
> worse problems anyway so I don't feel it is something that needs to
> be turned on in production kernels. It could be a good debugging aid
> to mm/ or device driver writers though.
> 
> Anyway, again I've cc'ed Hugh, because he nacked this same patch a
> while back. So let's try to get him on board before merging anything.
> 
> If we get an ack, why not send this upstream for 2.6.24? Those s390
> numbers are pretty insane.

I intend to merge this into 2.6.24.

> > --- 
> > Index: linux-2.6/mm/memory.c
> > ===================================================================
> > --- linux-2.6.orig/mm/memory.c
> > +++ linux-2.6/mm/memory.c
> > @@ -392,6 +392,7 @@ struct page *vm_normal_page(struct vm_ar
> >  			return NULL;
> >  	}
> >  
> > +#ifdef CONFIG_DEBUG_VM
> >  	/*
> >  	 * Add some anal sanity checks for now. Eventually,
> >  	 * we should just do "return pfn_to_page(pfn)", but
> > @@ -402,6 +403,7 @@ struct page *vm_normal_page(struct vm_ar
> >  		print_bad_pte(vma, pte, addr);
> >  		return NULL;
> >  	}
> > +#endif
> >  
> >  	/*
> >  	 * NOTE! We still have PageReserved() pages in the page 
> > 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch] #ifdef very expensive debug check in page fault path
  2008-01-17  0:10   ` Andrew Morton
@ 2008-01-17  0:27     ` Nick Piggin
  2008-01-18 20:56     ` Hugh Dickins
  1 sibling, 0 replies; 12+ messages in thread
From: Nick Piggin @ 2008-01-17  0:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Carsten Otte, Linux Memory Management List, schwidefsky,
	holger.wolf, Hugh Dickins, Linus Torvalds

On Wed, Jan 16, 2008 at 04:10:21PM -0800, Andrew Morton wrote:
> On Thu, 17 Jan 2008 00:45:40 +0100 Nick Piggin <npiggin@suse.de> wrote:
> 
> > On Wed, Jan 16, 2008 at 07:01:28PM +0100, Carsten Otte wrote:
> > > This patch puts #ifdef CONFIG_DEBUG_VM around a check in vm_normal_page
> > > that verifies that a pfn is valid. This patch increases performance of
> > > the page fault microbenchmark in lmbench by 13% and overall dbench
> > > performance by 7% on s390x.  pfn_valid() is an expensive operation on
> > > s390 that needs a high double digit amount of CPU cycles.
> > > Nick Piggin suggested that pfn_valid() involves an array lookup on
> > > systems with sparsemem, and therefore is an expensive operation there
> > > too.
> > > The check looks like a clear debug thing to me, it should never trigger
> > > on regular kernels. And if a pte is created for an invalid pfn, we'll
> > > find out once the memory gets accessed later on anyway. Please consider
> > > inclusion of this patch into mm.
> > > 
> > > Signed-off-by: Carsten Otte <cotte@de.ibm.com>
> > 
> > Wow, that's a big performance hit for a few instructions ;)
> > I haven't seen it to be quite so expensive on x86, but it definitely is
> > not zero cost, especially with NUMA kernels. Thanks for getting those
> > numbers.
> > 
> > I posted a version which got rid of that big comment block too, but
> > no feedback as yet.
> > 
> > http://marc.info/?l=linux-arch&m=120046068604222&w=2
> > 
> > The one actual upside of this code is that if there is pte corruption
> > detected, the failure should be a little more graceful... but there
> > is also lots of pte corruption that could go undetected and cause much
> > worse problems anyway so I don't feel it is something that needs to
> > be turned on in production kernels. It could be a good debugging aid
> > to mm/ or device driver writers though.
> > 
> > Anyway, again I've cc'ed Hugh, because he nacked this same patch a
> > while back. So let's try to get him on board before merging anything.
> > 
> > If we get an ack, why not send this upstream for 2.6.24? Those s390
> > numbers are pretty insane.
> 
> I intend to merge this into 2.6.24.

Good. If it helps,
Acked-by: Nick Piggin <npiggin@suse.de>

> 
> > > --- 
> > > Index: linux-2.6/mm/memory.c
> > > ===================================================================
> > > --- linux-2.6.orig/mm/memory.c
> > > +++ linux-2.6/mm/memory.c
> > > @@ -392,6 +392,7 @@ struct page *vm_normal_page(struct vm_ar
> > >  			return NULL;
> > >  	}
> > >  
> > > +#ifdef CONFIG_DEBUG_VM
> > >  	/*
> > >  	 * Add some anal sanity checks for now. Eventually,
> > >  	 * we should just do "return pfn_to_page(pfn)", but
> > > @@ -402,6 +403,7 @@ struct page *vm_normal_page(struct vm_ar
> > >  		print_bad_pte(vma, pte, addr);
> > >  		return NULL;
> > >  	}
> > > +#endif
> > >  
> > >  	/*
> > >  	 * NOTE! We still have PageReserved() pages in the page 
> > > 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch] #ifdef very expensive debug check in page fault path
  2008-01-17  0:10   ` Andrew Morton
  2008-01-17  0:27     ` Nick Piggin
@ 2008-01-18 20:56     ` Hugh Dickins
  2008-01-21  9:45       ` Carsten Otte
  1 sibling, 1 reply; 12+ messages in thread
From: Hugh Dickins @ 2008-01-18 20:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Carsten Otte, Linux Memory Management List,
	schwidefsky, holger.wolf, Linus Torvalds

On Wed, 16 Jan 2008, Andrew Morton wrote:
> On Thu, 17 Jan 2008 00:45:40 +0100 Nick Piggin <npiggin@suse.de> wrote:
> > 
> > The one actual upside of this code is that if there is pte corruption
> > detected, the failure should be a little more graceful... but there
> > is also lots of pte corruption that could go undetected and cause much
> > worse problems anyway so I don't feel it is something that needs to
> > be turned on in production kernels. It could be a good debugging aid
> > to mm/ or device driver writers though.
> > 
> > Anyway, again I've cc'ed Hugh, because he nacked this same patch a
> > while back. So let's try to get him on board before merging anything.

Thanks, Nick.

> > 
> > If we get an ack, why not send this upstream for 2.6.24? Those s390
> > numbers are pretty insane.
> 
> I intend to merge this into 2.6.24.

And so you have already, commit 9723198c219f3546982cb469e5aed26e68399055

I'm very surprised that's considered suitable material for post-rc8.

We know very well that page tables are almost honeytraps for random
corruption, and we've long adhered to the practice of preceding any
pfn_to_page by a pfn_valid check, to make sure we don't veer off the
end of the mem_map (or whatever), causing our page manipulations to
corrupt unrelated memory with less foreseeable consequences.  That
good practice dates from long before some of the checking got
separated out into vm_normal_page.

That said, perhaps I'm over-reacting: I can remember countless rmap.c
BUGs or Eeeks, and countless Bad page states, and quite a number of
swap_free errors, all usually symptoms of similar corruption; yet
not a single instance of that Bad pte message which Nick was good
enough to add a couple of years back.

Hmm, it's my own memory that's bad: a grep through old mailboxes
does show them; though I've not stopped to look, to see whether
perhaps they can all be argued away on some good grounds.

Well: that patch still gets my Nack, but I guess I'm too late.  If
s390 pagetables are better protected than x86 ones, add an s390 ifdef?

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch] #ifdef very expensive debug check in page fault path
  2008-01-18 20:56     ` Hugh Dickins
@ 2008-01-21  9:45       ` Carsten Otte
  2008-01-22 22:35         ` Hugh Dickins
  0 siblings, 1 reply; 12+ messages in thread
From: Carsten Otte @ 2008-01-21  9:45 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Nick Piggin, Linux Memory Management List,
	mschwid2, Holger Wolf, Linus Torvalds

Hugh Dickins wrote:
> And so you have already, commit 9723198c219f3546982cb469e5aed26e68399055
> 
> I'm very surprised that's considered suitable material for post-rc8.
> 
> We know very well that page tables are almost honeytraps for random
> corruption, and we've long adhered to the practice of preceding any
> pfn_to_page by a pfn_valid check, to make sure we don't veer off the
> end of the mem_map (or whatever), causing our page manipulations to
> corrupt unrelated memory with less foreseeable consequences.  That
> good practice dates from long before some of the checking got
> separated out into vm_normal_page.
> 
> That said, perhaps I'm over-reacting: I can remember countless rmap.c
> BUGs or Eeeks, and countless Bad page states, and quite a number of
> swap_free errors, all usually symptoms of similar corruption; yet
> not a single instance of that Bad pte message which Nick was good
> enough to add a couple of years back.
> 
> Hmm, it's my own memory that's bad: a grep through old mailboxes
> does show them; though I've not stopped to look, to see whether
> perhaps they can all be argued away on some good grounds.
> 
> Well: that patch still gets my Nack, but I guess I'm too late.  If
> s390 pagetables are better protected than x86 ones, add an s390 ifdef?
The alternative would be to just make
#define pfn_valid(pfn) (1)
on s390. That would also get _us_ rid of the check while others do 
benefit. We would trap access to mem_map beyond its limits because we 
don't have a kernel mapping there. For us, it would not silently 
corrupt things but crash proper.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch] #ifdef very expensive debug check in page fault path
  2008-01-21  9:45       ` Carsten Otte
@ 2008-01-22 22:35         ` Hugh Dickins
  2008-01-22 23:39           ` Nick Piggin
  2008-01-23  9:14           ` Martin Schwidefsky
  0 siblings, 2 replies; 12+ messages in thread
From: Hugh Dickins @ 2008-01-22 22:35 UTC (permalink / raw)
  To: carsteno
  Cc: Andrew Morton, Nick Piggin, Linux Memory Management List,
	mschwid2, Holger Wolf, Linus Torvalds

On Mon, 21 Jan 2008, Carsten Otte wrote:
> Hugh Dickins wrote:
> > 
> > Well: that patch still gets my Nack, but I guess I'm too late.  If
> > s390 pagetables are better protected than x86 ones, add an s390 ifdef?
> 
> The alternative would be to just make
> #define pfn_valid(pfn) (1)
> on s390. That would also get _us_ rid of the check while others do benefit. We
> would trap access to mem_map beyond its limits because we don't have a kernel
> mapping there. For us, it would not silently corrupt things but crash proper.

Whilst I quite like the sound of that, I wonder whether it's safe to
change s390's pfn_valid (rather surprisingly) for all its users.  And
note that nobody but me has voiced any regret at the loss of the check.
My guess is we let it rest for now, and reconsider if a case comes up
later which would have got caught by the check (but the problem is that
such a case is much harder to identify than it was).

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch] #ifdef very expensive debug check in page fault path
  2008-01-22 22:35         ` Hugh Dickins
@ 2008-01-22 23:39           ` Nick Piggin
  2008-01-23  0:02             ` Hugh Dickins
  2008-01-23  9:14           ` Martin Schwidefsky
  1 sibling, 1 reply; 12+ messages in thread
From: Nick Piggin @ 2008-01-22 23:39 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: carsteno, Andrew Morton, Linux Memory Management List, mschwid2,
	Holger Wolf, Linus Torvalds

On Tue, Jan 22, 2008 at 10:35:17PM +0000, Hugh Dickins wrote:
> On Mon, 21 Jan 2008, Carsten Otte wrote:
> > Hugh Dickins wrote:
> > > 
> > > Well: that patch still gets my Nack, but I guess I'm too late.  If
> > > s390 pagetables are better protected than x86 ones, add an s390 ifdef?
> > 
> > The alternative would be to just make
> > #define pfn_valid(pfn) (1)
> > on s390. That would also get _us_ rid of the check while others do benefit. We
> > would trap access to mem_map beyond its limits because we don't have a kernel
> > mapping there. For us, it would not silently corrupt things but crash proper.
> 
> Whilst I quite like the sound of that, I wonder whether it's safe to
> change s390's pfn_valid (rather surprisingly) for all its users.  And
> note that nobody but me has voiced any regret at the loss of the check.

I did want to get rid of the test, but not in a "sneak it in before he
notices" way. So I am disappointed it was merged before you replied.


> My guess is we let it rest for now, and reconsider if a case comes up
> later which would have got caught by the check (but the problem is that
> such a case is much harder to identify than it was).

The only cases I had imagined were repeatable things like a bug in pte
manipulation somewhere, which will hopefully be caught with
CONFIG_DEBUG_VM turned on. 

Are there many other cases where the test is useful? For hardware
failures, I'd say not -- those just tend to waste developers time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch] #ifdef very expensive debug check in page fault path
  2008-01-22 23:39           ` Nick Piggin
@ 2008-01-23  0:02             ` Hugh Dickins
  0 siblings, 0 replies; 12+ messages in thread
From: Hugh Dickins @ 2008-01-23  0:02 UTC (permalink / raw)
  To: Nick Piggin
  Cc: carsteno, Andrew Morton, Linux Memory Management List, mschwid2,
	Holger Wolf, Linus Torvalds

On Wed, 23 Jan 2008, Nick Piggin wrote:
> 
> I did want to get rid of the test, but not in a "sneak it in before he
> notices" way. So I am disappointed it was merged before you replied.

Not everybody can wait that indefinite interval for a response from me!

(And, by the by, I'm not ignoring the many mails you've addressed
to me or Cc'ed me in the last week or more; but some things are
easier to think about and come to conclusion on than others.
Take it as a compliment that your patches deserve consideration ;)

> > My guess is we let it rest for now, and reconsider if a case comes up
> > later which would have got caught by the check (but the problem is that
> > such a case is much harder to identify than it was).
> 
> The only cases I had imagined were repeatable things like a bug in pte
> manipulation somewhere, which will hopefully be caught with
> CONFIG_DEBUG_VM turned on. 

For things like that, repeatable occurrences from coding bugs,
which should get caught before release: yes I agree, the
CONFIG_DEBUG_VM would be entirely appropriate.

> Are there many other cases where the test is useful? For hardware
> failures, I'd say not -- those just tend to waste developers time.

Bad RAM bitflips etc., or some subsystem corrupting random memory:
those kind of things which so often end up as rmap.c Eeeks or Bad
page states.  Yes, a fair amount of developers time is wasted on
these: which is precisely why they're better caught sooner (by
a pfn_valid test in vm_normal_page) than later (by going on to
corrupt other memory in fictitious "struct page" manipulations).

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch] #ifdef very expensive debug check in page fault path
  2008-01-22 22:35         ` Hugh Dickins
  2008-01-22 23:39           ` Nick Piggin
@ 2008-01-23  9:14           ` Martin Schwidefsky
  1 sibling, 0 replies; 12+ messages in thread
From: Martin Schwidefsky @ 2008-01-23  9:14 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: carsteno, Andrew Morton, Nick Piggin,
	Linux Memory Management List, mschwid2, Holger Wolf,
	Linus Torvalds

On Tue, 2008-01-22 at 22:35 +0000, Hugh Dickins wrote:
> On Mon, 21 Jan 2008, Carsten Otte wrote:
> > Hugh Dickins wrote:
> > > 
> > > Well: that patch still gets my Nack, but I guess I'm too late.  If
> > > s390 pagetables are better protected than x86 ones, add an s390 ifdef?
> > 
> > The alternative would be to just make
> > #define pfn_valid(pfn) (1)
> > on s390. That would also get _us_ rid of the check while others do benefit. We
> > would trap access to mem_map beyond its limits because we don't have a kernel
> > mapping there. For us, it would not silently corrupt things but crash proper.
> 
> Whilst I quite like the sound of that, I wonder whether it's safe to
> change s390's pfn_valid (rather surprisingly) for all its users.  And
> note that nobody but me has voiced any regret at the loss of the check.
> My guess is we let it rest for now, and reconsider if a case comes up
> later which would have got caught by the check (but the problem is that
> such a case is much harder to identify than it was).

Nick has said that pfn_valid as a primitive is supposed to return
whether a pfn has a struct page behind it or not. If you follow the
principle of least surprise it is not really an option to define
pfn_valid as (1).

So far the s390 method of using the kernel address space mapping to
implement pfn_valid has been correct. The new code that allows to have
memory areas without struct page changes things. All memory has to be
added to the kernel address space to be able to do user copy. pfn_valid
returns incorrect results for struct page less memory now. We are
searching for a fast way to implement a correct pfn_valid, it should not
use lists, it should not required a lock and it should not waste vast
amounts of memory. To make things worse there is not really a limit to
the maximum address, the full 64 bit address space can be used. That
makes 2**52 pages.
The best I could come up so far is a page table like scheme that uses 3
indirection levels and a bitfield at the end. Ugly, ugly..

-- 
blue skies,
  Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch] #ifdef very expensive debug check in page fault path
  2008-01-16 23:45 ` Nick Piggin
  2008-01-17  0:10   ` Andrew Morton
@ 2008-01-17  9:53   ` Martin Schwidefsky
  2008-01-18  4:09     ` Nick Piggin
  1 sibling, 1 reply; 12+ messages in thread
From: Martin Schwidefsky @ 2008-01-17  9:53 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Carsten Otte, Andrew Morton, Linux Memory Management List,
	holger.wolf, Hugh Dickins, Linus Torvalds

On Thu, 2008-01-17 at 00:45 +0100, Nick Piggin wrote:
> On Wed, Jan 16, 2008 at 07:01:28PM +0100, Carsten Otte wrote:
> > This patch puts #ifdef CONFIG_DEBUG_VM around a check in vm_normal_page
> > that verifies that a pfn is valid. This patch increases performance of
> > the page fault microbenchmark in lmbench by 13% and overall dbench
> > performance by 7% on s390x.  pfn_valid() is an expensive operation on
> > s390 that needs a high double digit amount of CPU cycles.
> > Nick Piggin suggested that pfn_valid() involves an array lookup on
> > systems with sparsemem, and therefore is an expensive operation there
> > too.
> > The check looks like a clear debug thing to me, it should never trigger
> > on regular kernels. And if a pte is created for an invalid pfn, we'll
> > find out once the memory gets accessed later on anyway. Please consider
> > inclusion of this patch into mm.
> > 
> > Signed-off-by: Carsten Otte <cotte@de.ibm.com>
> 
> Wow, that's a big performance hit for a few instructions ;)
> I haven't seen it to be quite so expensive on x86, but it definitely is
> not zero cost, especially with NUMA kernels. Thanks for getting those
> numbers.

These number have been a surprise. We knew that the LRA instruction we
use in pfn_valid has a cost, but from the cycle count we did not expect
that the difference in the minor fault benchmark would be 13%. Most
probably a cache effect.

I shortly discussed with Carsten what we should do with pfn_valid. One
idea was to make it a nop - always return 1. The current implementation
of pfn_valid uses the kernel address space mapping to decide if a page
frame is valid. All available memory areas that fit into the 4TB kernel
address space get mapped. If a page is mapped pfn_valid returns true.
But what is the background of pfn_valid, what does it protect against?
What is the exact semantics if pfn_valid returns true? From the name
page-frame-number-valid you could argue that it should always return
true if the number is smaller than 2**52. The number is valid, if there
is accessible memory is another question.

-- 
blue skies,
  Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch] #ifdef very expensive debug check in page fault path
  2008-01-17  9:53   ` Martin Schwidefsky
@ 2008-01-18  4:09     ` Nick Piggin
  0 siblings, 0 replies; 12+ messages in thread
From: Nick Piggin @ 2008-01-18  4:09 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Carsten Otte, Andrew Morton, Linux Memory Management List,
	holger.wolf, Hugh Dickins, Linus Torvalds

On Thu, Jan 17, 2008 at 10:53:34AM +0100, Martin Schwidefsky wrote:
> On Thu, 2008-01-17 at 00:45 +0100, Nick Piggin wrote:
> > On Wed, Jan 16, 2008 at 07:01:28PM +0100, Carsten Otte wrote:
> > > This patch puts #ifdef CONFIG_DEBUG_VM around a check in vm_normal_page
> > > that verifies that a pfn is valid. This patch increases performance of
> > > the page fault microbenchmark in lmbench by 13% and overall dbench
> > > performance by 7% on s390x.  pfn_valid() is an expensive operation on
> > > s390 that needs a high double digit amount of CPU cycles.
> > > Nick Piggin suggested that pfn_valid() involves an array lookup on
> > > systems with sparsemem, and therefore is an expensive operation there
> > > too.
> > > The check looks like a clear debug thing to me, it should never trigger
> > > on regular kernels. And if a pte is created for an invalid pfn, we'll
> > > find out once the memory gets accessed later on anyway. Please consider
> > > inclusion of this patch into mm.
> > > 
> > > Signed-off-by: Carsten Otte <cotte@de.ibm.com>
> > 
> > Wow, that's a big performance hit for a few instructions ;)
> > I haven't seen it to be quite so expensive on x86, but it definitely is
> > not zero cost, especially with NUMA kernels. Thanks for getting those
> > numbers.
> 
> These number have been a surprise. We knew that the LRA instruction we
> use in pfn_valid has a cost, but from the cycle count we did not expect
> that the difference in the minor fault benchmark would be 13%. Most
> probably a cache effect.
> 
> I shortly discussed with Carsten what we should do with pfn_valid. One
> idea was to make it a nop - always return 1. The current implementation
> of pfn_valid uses the kernel address space mapping to decide if a page
> frame is valid. All available memory areas that fit into the 4TB kernel
> address space get mapped. If a page is mapped pfn_valid returns true.
> But what is the background of pfn_valid, what does it protect against?
> What is the exact semantics if pfn_valid returns true? From the name
> page-frame-number-valid you could argue that it should always return
> true if the number is smaller than 2**52. The number is valid, if there
> is accessible memory is another question.

It is supposed to return true if there is a valid struct page for the
pfn AFAIK.  s390 can probably get around without implementing it like
that because it is mostly only used in memory mapping setup code
(once we have removed it from vm_normal_page).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2008-01-23  9:14 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-16 18:01 [patch] #ifdef very expensive debug check in page fault path Carsten Otte
2008-01-16 23:45 ` Nick Piggin
2008-01-17  0:10   ` Andrew Morton
2008-01-17  0:27     ` Nick Piggin
2008-01-18 20:56     ` Hugh Dickins
2008-01-21  9:45       ` Carsten Otte
2008-01-22 22:35         ` Hugh Dickins
2008-01-22 23:39           ` Nick Piggin
2008-01-23  0:02             ` Hugh Dickins
2008-01-23  9:14           ` Martin Schwidefsky
2008-01-17  9:53   ` Martin Schwidefsky
2008-01-18  4:09     ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).