* [patch] #ifdef very expensive debug check in page fault path @ 2008-01-16 18:01 Carsten Otte 2008-01-16 23:45 ` Nick Piggin 0 siblings, 1 reply; 12+ messages in thread From: Carsten Otte @ 2008-01-16 18:01 UTC (permalink / raw) To: Andrew Morton Cc: Linux Memory Management List, Nick Piggin, schwidefsky, holger.wolf This patch puts #ifdef CONFIG_DEBUG_VM around a check in vm_normal_page that verifies that a pfn is valid. This patch increases performance of the page fault microbenchmark in lmbench by 13% and overall dbench performance by 7% on s390x. pfn_valid() is an expensive operation on s390 that needs a high double digit amount of CPU cycles. Nick Piggin suggested that pfn_valid() involves an array lookup on systems with sparsemem, and therefore is an expensive operation there too. The check looks like a clear debug thing to me, it should never trigger on regular kernels. And if a pte is created for an invalid pfn, we'll find out once the memory gets accessed later on anyway. Please consider inclusion of this patch into mm. Signed-off-by: Carsten Otte <cotte@de.ibm.com> --- Index: linux-2.6/mm/memory.c =================================================================== --- linux-2.6.orig/mm/memory.c +++ linux-2.6/mm/memory.c @@ -392,6 +392,7 @@ struct page *vm_normal_page(struct vm_ar return NULL; } +#ifdef CONFIG_DEBUG_VM /* * Add some anal sanity checks for now. Eventually, * we should just do "return pfn_to_page(pfn)", but @@ -402,6 +403,7 @@ struct page *vm_normal_page(struct vm_ar print_bad_pte(vma, pte, addr); return NULL; } +#endif /* * NOTE! We still have PageReserved() pages in the page -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [patch] #ifdef very expensive debug check in page fault path 2008-01-16 18:01 [patch] #ifdef very expensive debug check in page fault path Carsten Otte @ 2008-01-16 23:45 ` Nick Piggin 2008-01-17 0:10 ` Andrew Morton 2008-01-17 9:53 ` Martin Schwidefsky 0 siblings, 2 replies; 12+ messages in thread From: Nick Piggin @ 2008-01-16 23:45 UTC (permalink / raw) To: Carsten Otte Cc: Andrew Morton, Linux Memory Management List, schwidefsky, holger.wolf, Hugh Dickins, Linus Torvalds On Wed, Jan 16, 2008 at 07:01:28PM +0100, Carsten Otte wrote: > This patch puts #ifdef CONFIG_DEBUG_VM around a check in vm_normal_page > that verifies that a pfn is valid. This patch increases performance of > the page fault microbenchmark in lmbench by 13% and overall dbench > performance by 7% on s390x. pfn_valid() is an expensive operation on > s390 that needs a high double digit amount of CPU cycles. > Nick Piggin suggested that pfn_valid() involves an array lookup on > systems with sparsemem, and therefore is an expensive operation there > too. > The check looks like a clear debug thing to me, it should never trigger > on regular kernels. And if a pte is created for an invalid pfn, we'll > find out once the memory gets accessed later on anyway. Please consider > inclusion of this patch into mm. > > Signed-off-by: Carsten Otte <cotte@de.ibm.com> Wow, that's a big performance hit for a few instructions ;) I haven't seen it to be quite so expensive on x86, but it definitely is not zero cost, especially with NUMA kernels. Thanks for getting those numbers. I posted a version which got rid of that big comment block too, but no feedback as yet. http://marc.info/?l=linux-arch&m=120046068604222&w=2 The one actual upside of this code is that if there is pte corruption detected, the failure should be a little more graceful... but there is also lots of pte corruption that could go undetected and cause much worse problems anyway so I don't feel it is something that needs to be turned on in production kernels. It could be a good debugging aid to mm/ or device driver writers though. Anyway, again I've cc'ed Hugh, because he nacked this same patch a while back. So let's try to get him on board before merging anything. If we get an ack, why not send this upstream for 2.6.24? Those s390 numbers are pretty insane. > --- > Index: linux-2.6/mm/memory.c > =================================================================== > --- linux-2.6.orig/mm/memory.c > +++ linux-2.6/mm/memory.c > @@ -392,6 +392,7 @@ struct page *vm_normal_page(struct vm_ar > return NULL; > } > > +#ifdef CONFIG_DEBUG_VM > /* > * Add some anal sanity checks for now. Eventually, > * we should just do "return pfn_to_page(pfn)", but > @@ -402,6 +403,7 @@ struct page *vm_normal_page(struct vm_ar > print_bad_pte(vma, pte, addr); > return NULL; > } > +#endif > > /* > * NOTE! We still have PageReserved() pages in the page > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [patch] #ifdef very expensive debug check in page fault path 2008-01-16 23:45 ` Nick Piggin @ 2008-01-17 0:10 ` Andrew Morton 2008-01-17 0:27 ` Nick Piggin 2008-01-18 20:56 ` Hugh Dickins 2008-01-17 9:53 ` Martin Schwidefsky 1 sibling, 2 replies; 12+ messages in thread From: Andrew Morton @ 2008-01-17 0:10 UTC (permalink / raw) To: Nick Piggin Cc: Carsten Otte, Linux Memory Management List, schwidefsky, holger.wolf, Hugh Dickins, Linus Torvalds On Thu, 17 Jan 2008 00:45:40 +0100 Nick Piggin <npiggin@suse.de> wrote: > On Wed, Jan 16, 2008 at 07:01:28PM +0100, Carsten Otte wrote: > > This patch puts #ifdef CONFIG_DEBUG_VM around a check in vm_normal_page > > that verifies that a pfn is valid. This patch increases performance of > > the page fault microbenchmark in lmbench by 13% and overall dbench > > performance by 7% on s390x. pfn_valid() is an expensive operation on > > s390 that needs a high double digit amount of CPU cycles. > > Nick Piggin suggested that pfn_valid() involves an array lookup on > > systems with sparsemem, and therefore is an expensive operation there > > too. > > The check looks like a clear debug thing to me, it should never trigger > > on regular kernels. And if a pte is created for an invalid pfn, we'll > > find out once the memory gets accessed later on anyway. Please consider > > inclusion of this patch into mm. > > > > Signed-off-by: Carsten Otte <cotte@de.ibm.com> > > Wow, that's a big performance hit for a few instructions ;) > I haven't seen it to be quite so expensive on x86, but it definitely is > not zero cost, especially with NUMA kernels. Thanks for getting those > numbers. > > I posted a version which got rid of that big comment block too, but > no feedback as yet. > > http://marc.info/?l=linux-arch&m=120046068604222&w=2 > > The one actual upside of this code is that if there is pte corruption > detected, the failure should be a little more graceful... but there > is also lots of pte corruption that could go undetected and cause much > worse problems anyway so I don't feel it is something that needs to > be turned on in production kernels. It could be a good debugging aid > to mm/ or device driver writers though. > > Anyway, again I've cc'ed Hugh, because he nacked this same patch a > while back. So let's try to get him on board before merging anything. > > If we get an ack, why not send this upstream for 2.6.24? Those s390 > numbers are pretty insane. I intend to merge this into 2.6.24. > > --- > > Index: linux-2.6/mm/memory.c > > =================================================================== > > --- linux-2.6.orig/mm/memory.c > > +++ linux-2.6/mm/memory.c > > @@ -392,6 +392,7 @@ struct page *vm_normal_page(struct vm_ar > > return NULL; > > } > > > > +#ifdef CONFIG_DEBUG_VM > > /* > > * Add some anal sanity checks for now. Eventually, > > * we should just do "return pfn_to_page(pfn)", but > > @@ -402,6 +403,7 @@ struct page *vm_normal_page(struct vm_ar > > print_bad_pte(vma, pte, addr); > > return NULL; > > } > > +#endif > > > > /* > > * NOTE! We still have PageReserved() pages in the page > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [patch] #ifdef very expensive debug check in page fault path 2008-01-17 0:10 ` Andrew Morton @ 2008-01-17 0:27 ` Nick Piggin 2008-01-18 20:56 ` Hugh Dickins 1 sibling, 0 replies; 12+ messages in thread From: Nick Piggin @ 2008-01-17 0:27 UTC (permalink / raw) To: Andrew Morton Cc: Carsten Otte, Linux Memory Management List, schwidefsky, holger.wolf, Hugh Dickins, Linus Torvalds On Wed, Jan 16, 2008 at 04:10:21PM -0800, Andrew Morton wrote: > On Thu, 17 Jan 2008 00:45:40 +0100 Nick Piggin <npiggin@suse.de> wrote: > > > On Wed, Jan 16, 2008 at 07:01:28PM +0100, Carsten Otte wrote: > > > This patch puts #ifdef CONFIG_DEBUG_VM around a check in vm_normal_page > > > that verifies that a pfn is valid. This patch increases performance of > > > the page fault microbenchmark in lmbench by 13% and overall dbench > > > performance by 7% on s390x. pfn_valid() is an expensive operation on > > > s390 that needs a high double digit amount of CPU cycles. > > > Nick Piggin suggested that pfn_valid() involves an array lookup on > > > systems with sparsemem, and therefore is an expensive operation there > > > too. > > > The check looks like a clear debug thing to me, it should never trigger > > > on regular kernels. And if a pte is created for an invalid pfn, we'll > > > find out once the memory gets accessed later on anyway. Please consider > > > inclusion of this patch into mm. > > > > > > Signed-off-by: Carsten Otte <cotte@de.ibm.com> > > > > Wow, that's a big performance hit for a few instructions ;) > > I haven't seen it to be quite so expensive on x86, but it definitely is > > not zero cost, especially with NUMA kernels. Thanks for getting those > > numbers. > > > > I posted a version which got rid of that big comment block too, but > > no feedback as yet. > > > > http://marc.info/?l=linux-arch&m=120046068604222&w=2 > > > > The one actual upside of this code is that if there is pte corruption > > detected, the failure should be a little more graceful... but there > > is also lots of pte corruption that could go undetected and cause much > > worse problems anyway so I don't feel it is something that needs to > > be turned on in production kernels. It could be a good debugging aid > > to mm/ or device driver writers though. > > > > Anyway, again I've cc'ed Hugh, because he nacked this same patch a > > while back. So let's try to get him on board before merging anything. > > > > If we get an ack, why not send this upstream for 2.6.24? Those s390 > > numbers are pretty insane. > > I intend to merge this into 2.6.24. Good. If it helps, Acked-by: Nick Piggin <npiggin@suse.de> > > > > --- > > > Index: linux-2.6/mm/memory.c > > > =================================================================== > > > --- linux-2.6.orig/mm/memory.c > > > +++ linux-2.6/mm/memory.c > > > @@ -392,6 +392,7 @@ struct page *vm_normal_page(struct vm_ar > > > return NULL; > > > } > > > > > > +#ifdef CONFIG_DEBUG_VM > > > /* > > > * Add some anal sanity checks for now. Eventually, > > > * we should just do "return pfn_to_page(pfn)", but > > > @@ -402,6 +403,7 @@ struct page *vm_normal_page(struct vm_ar > > > print_bad_pte(vma, pte, addr); > > > return NULL; > > > } > > > +#endif > > > > > > /* > > > * NOTE! We still have PageReserved() pages in the page > > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [patch] #ifdef very expensive debug check in page fault path 2008-01-17 0:10 ` Andrew Morton 2008-01-17 0:27 ` Nick Piggin @ 2008-01-18 20:56 ` Hugh Dickins 2008-01-21 9:45 ` Carsten Otte 1 sibling, 1 reply; 12+ messages in thread From: Hugh Dickins @ 2008-01-18 20:56 UTC (permalink / raw) To: Andrew Morton Cc: Nick Piggin, Carsten Otte, Linux Memory Management List, schwidefsky, holger.wolf, Linus Torvalds On Wed, 16 Jan 2008, Andrew Morton wrote: > On Thu, 17 Jan 2008 00:45:40 +0100 Nick Piggin <npiggin@suse.de> wrote: > > > > The one actual upside of this code is that if there is pte corruption > > detected, the failure should be a little more graceful... but there > > is also lots of pte corruption that could go undetected and cause much > > worse problems anyway so I don't feel it is something that needs to > > be turned on in production kernels. It could be a good debugging aid > > to mm/ or device driver writers though. > > > > Anyway, again I've cc'ed Hugh, because he nacked this same patch a > > while back. So let's try to get him on board before merging anything. Thanks, Nick. > > > > If we get an ack, why not send this upstream for 2.6.24? Those s390 > > numbers are pretty insane. > > I intend to merge this into 2.6.24. And so you have already, commit 9723198c219f3546982cb469e5aed26e68399055 I'm very surprised that's considered suitable material for post-rc8. We know very well that page tables are almost honeytraps for random corruption, and we've long adhered to the practice of preceding any pfn_to_page by a pfn_valid check, to make sure we don't veer off the end of the mem_map (or whatever), causing our page manipulations to corrupt unrelated memory with less foreseeable consequences. That good practice dates from long before some of the checking got separated out into vm_normal_page. That said, perhaps I'm over-reacting: I can remember countless rmap.c BUGs or Eeeks, and countless Bad page states, and quite a number of swap_free errors, all usually symptoms of similar corruption; yet not a single instance of that Bad pte message which Nick was good enough to add a couple of years back. Hmm, it's my own memory that's bad: a grep through old mailboxes does show them; though I've not stopped to look, to see whether perhaps they can all be argued away on some good grounds. Well: that patch still gets my Nack, but I guess I'm too late. If s390 pagetables are better protected than x86 ones, add an s390 ifdef? Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [patch] #ifdef very expensive debug check in page fault path 2008-01-18 20:56 ` Hugh Dickins @ 2008-01-21 9:45 ` Carsten Otte 2008-01-22 22:35 ` Hugh Dickins 0 siblings, 1 reply; 12+ messages in thread From: Carsten Otte @ 2008-01-21 9:45 UTC (permalink / raw) To: Hugh Dickins Cc: Andrew Morton, Nick Piggin, Linux Memory Management List, mschwid2, Holger Wolf, Linus Torvalds Hugh Dickins wrote: > And so you have already, commit 9723198c219f3546982cb469e5aed26e68399055 > > I'm very surprised that's considered suitable material for post-rc8. > > We know very well that page tables are almost honeytraps for random > corruption, and we've long adhered to the practice of preceding any > pfn_to_page by a pfn_valid check, to make sure we don't veer off the > end of the mem_map (or whatever), causing our page manipulations to > corrupt unrelated memory with less foreseeable consequences. That > good practice dates from long before some of the checking got > separated out into vm_normal_page. > > That said, perhaps I'm over-reacting: I can remember countless rmap.c > BUGs or Eeeks, and countless Bad page states, and quite a number of > swap_free errors, all usually symptoms of similar corruption; yet > not a single instance of that Bad pte message which Nick was good > enough to add a couple of years back. > > Hmm, it's my own memory that's bad: a grep through old mailboxes > does show them; though I've not stopped to look, to see whether > perhaps they can all be argued away on some good grounds. > > Well: that patch still gets my Nack, but I guess I'm too late. If > s390 pagetables are better protected than x86 ones, add an s390 ifdef? The alternative would be to just make #define pfn_valid(pfn) (1) on s390. That would also get _us_ rid of the check while others do benefit. We would trap access to mem_map beyond its limits because we don't have a kernel mapping there. For us, it would not silently corrupt things but crash proper. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [patch] #ifdef very expensive debug check in page fault path 2008-01-21 9:45 ` Carsten Otte @ 2008-01-22 22:35 ` Hugh Dickins 2008-01-22 23:39 ` Nick Piggin 2008-01-23 9:14 ` Martin Schwidefsky 0 siblings, 2 replies; 12+ messages in thread From: Hugh Dickins @ 2008-01-22 22:35 UTC (permalink / raw) To: carsteno Cc: Andrew Morton, Nick Piggin, Linux Memory Management List, mschwid2, Holger Wolf, Linus Torvalds On Mon, 21 Jan 2008, Carsten Otte wrote: > Hugh Dickins wrote: > > > > Well: that patch still gets my Nack, but I guess I'm too late. If > > s390 pagetables are better protected than x86 ones, add an s390 ifdef? > > The alternative would be to just make > #define pfn_valid(pfn) (1) > on s390. That would also get _us_ rid of the check while others do benefit. We > would trap access to mem_map beyond its limits because we don't have a kernel > mapping there. For us, it would not silently corrupt things but crash proper. Whilst I quite like the sound of that, I wonder whether it's safe to change s390's pfn_valid (rather surprisingly) for all its users. And note that nobody but me has voiced any regret at the loss of the check. My guess is we let it rest for now, and reconsider if a case comes up later which would have got caught by the check (but the problem is that such a case is much harder to identify than it was). Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [patch] #ifdef very expensive debug check in page fault path 2008-01-22 22:35 ` Hugh Dickins @ 2008-01-22 23:39 ` Nick Piggin 2008-01-23 0:02 ` Hugh Dickins 2008-01-23 9:14 ` Martin Schwidefsky 1 sibling, 1 reply; 12+ messages in thread From: Nick Piggin @ 2008-01-22 23:39 UTC (permalink / raw) To: Hugh Dickins Cc: carsteno, Andrew Morton, Linux Memory Management List, mschwid2, Holger Wolf, Linus Torvalds On Tue, Jan 22, 2008 at 10:35:17PM +0000, Hugh Dickins wrote: > On Mon, 21 Jan 2008, Carsten Otte wrote: > > Hugh Dickins wrote: > > > > > > Well: that patch still gets my Nack, but I guess I'm too late. If > > > s390 pagetables are better protected than x86 ones, add an s390 ifdef? > > > > The alternative would be to just make > > #define pfn_valid(pfn) (1) > > on s390. That would also get _us_ rid of the check while others do benefit. We > > would trap access to mem_map beyond its limits because we don't have a kernel > > mapping there. For us, it would not silently corrupt things but crash proper. > > Whilst I quite like the sound of that, I wonder whether it's safe to > change s390's pfn_valid (rather surprisingly) for all its users. And > note that nobody but me has voiced any regret at the loss of the check. I did want to get rid of the test, but not in a "sneak it in before he notices" way. So I am disappointed it was merged before you replied. > My guess is we let it rest for now, and reconsider if a case comes up > later which would have got caught by the check (but the problem is that > such a case is much harder to identify than it was). The only cases I had imagined were repeatable things like a bug in pte manipulation somewhere, which will hopefully be caught with CONFIG_DEBUG_VM turned on. Are there many other cases where the test is useful? For hardware failures, I'd say not -- those just tend to waste developers time. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [patch] #ifdef very expensive debug check in page fault path 2008-01-22 23:39 ` Nick Piggin @ 2008-01-23 0:02 ` Hugh Dickins 0 siblings, 0 replies; 12+ messages in thread From: Hugh Dickins @ 2008-01-23 0:02 UTC (permalink / raw) To: Nick Piggin Cc: carsteno, Andrew Morton, Linux Memory Management List, mschwid2, Holger Wolf, Linus Torvalds On Wed, 23 Jan 2008, Nick Piggin wrote: > > I did want to get rid of the test, but not in a "sneak it in before he > notices" way. So I am disappointed it was merged before you replied. Not everybody can wait that indefinite interval for a response from me! (And, by the by, I'm not ignoring the many mails you've addressed to me or Cc'ed me in the last week or more; but some things are easier to think about and come to conclusion on than others. Take it as a compliment that your patches deserve consideration ;) > > My guess is we let it rest for now, and reconsider if a case comes up > > later which would have got caught by the check (but the problem is that > > such a case is much harder to identify than it was). > > The only cases I had imagined were repeatable things like a bug in pte > manipulation somewhere, which will hopefully be caught with > CONFIG_DEBUG_VM turned on. For things like that, repeatable occurrences from coding bugs, which should get caught before release: yes I agree, the CONFIG_DEBUG_VM would be entirely appropriate. > Are there many other cases where the test is useful? For hardware > failures, I'd say not -- those just tend to waste developers time. Bad RAM bitflips etc., or some subsystem corrupting random memory: those kind of things which so often end up as rmap.c Eeeks or Bad page states. Yes, a fair amount of developers time is wasted on these: which is precisely why they're better caught sooner (by a pfn_valid test in vm_normal_page) than later (by going on to corrupt other memory in fictitious "struct page" manipulations). Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [patch] #ifdef very expensive debug check in page fault path 2008-01-22 22:35 ` Hugh Dickins 2008-01-22 23:39 ` Nick Piggin @ 2008-01-23 9:14 ` Martin Schwidefsky 1 sibling, 0 replies; 12+ messages in thread From: Martin Schwidefsky @ 2008-01-23 9:14 UTC (permalink / raw) To: Hugh Dickins Cc: carsteno, Andrew Morton, Nick Piggin, Linux Memory Management List, mschwid2, Holger Wolf, Linus Torvalds On Tue, 2008-01-22 at 22:35 +0000, Hugh Dickins wrote: > On Mon, 21 Jan 2008, Carsten Otte wrote: > > Hugh Dickins wrote: > > > > > > Well: that patch still gets my Nack, but I guess I'm too late. If > > > s390 pagetables are better protected than x86 ones, add an s390 ifdef? > > > > The alternative would be to just make > > #define pfn_valid(pfn) (1) > > on s390. That would also get _us_ rid of the check while others do benefit. We > > would trap access to mem_map beyond its limits because we don't have a kernel > > mapping there. For us, it would not silently corrupt things but crash proper. > > Whilst I quite like the sound of that, I wonder whether it's safe to > change s390's pfn_valid (rather surprisingly) for all its users. And > note that nobody but me has voiced any regret at the loss of the check. > My guess is we let it rest for now, and reconsider if a case comes up > later which would have got caught by the check (but the problem is that > such a case is much harder to identify than it was). Nick has said that pfn_valid as a primitive is supposed to return whether a pfn has a struct page behind it or not. If you follow the principle of least surprise it is not really an option to define pfn_valid as (1). So far the s390 method of using the kernel address space mapping to implement pfn_valid has been correct. The new code that allows to have memory areas without struct page changes things. All memory has to be added to the kernel address space to be able to do user copy. pfn_valid returns incorrect results for struct page less memory now. We are searching for a fast way to implement a correct pfn_valid, it should not use lists, it should not required a lock and it should not waste vast amounts of memory. To make things worse there is not really a limit to the maximum address, the full 64 bit address space can be used. That makes 2**52 pages. The best I could come up so far is a page table like scheme that uses 3 indirection levels and a bitfield at the end. Ugly, ugly.. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [patch] #ifdef very expensive debug check in page fault path 2008-01-16 23:45 ` Nick Piggin 2008-01-17 0:10 ` Andrew Morton @ 2008-01-17 9:53 ` Martin Schwidefsky 2008-01-18 4:09 ` Nick Piggin 1 sibling, 1 reply; 12+ messages in thread From: Martin Schwidefsky @ 2008-01-17 9:53 UTC (permalink / raw) To: Nick Piggin Cc: Carsten Otte, Andrew Morton, Linux Memory Management List, holger.wolf, Hugh Dickins, Linus Torvalds On Thu, 2008-01-17 at 00:45 +0100, Nick Piggin wrote: > On Wed, Jan 16, 2008 at 07:01:28PM +0100, Carsten Otte wrote: > > This patch puts #ifdef CONFIG_DEBUG_VM around a check in vm_normal_page > > that verifies that a pfn is valid. This patch increases performance of > > the page fault microbenchmark in lmbench by 13% and overall dbench > > performance by 7% on s390x. pfn_valid() is an expensive operation on > > s390 that needs a high double digit amount of CPU cycles. > > Nick Piggin suggested that pfn_valid() involves an array lookup on > > systems with sparsemem, and therefore is an expensive operation there > > too. > > The check looks like a clear debug thing to me, it should never trigger > > on regular kernels. And if a pte is created for an invalid pfn, we'll > > find out once the memory gets accessed later on anyway. Please consider > > inclusion of this patch into mm. > > > > Signed-off-by: Carsten Otte <cotte@de.ibm.com> > > Wow, that's a big performance hit for a few instructions ;) > I haven't seen it to be quite so expensive on x86, but it definitely is > not zero cost, especially with NUMA kernels. Thanks for getting those > numbers. These number have been a surprise. We knew that the LRA instruction we use in pfn_valid has a cost, but from the cycle count we did not expect that the difference in the minor fault benchmark would be 13%. Most probably a cache effect. I shortly discussed with Carsten what we should do with pfn_valid. One idea was to make it a nop - always return 1. The current implementation of pfn_valid uses the kernel address space mapping to decide if a page frame is valid. All available memory areas that fit into the 4TB kernel address space get mapped. If a page is mapped pfn_valid returns true. But what is the background of pfn_valid, what does it protect against? What is the exact semantics if pfn_valid returns true? From the name page-frame-number-valid you could argue that it should always return true if the number is smaller than 2**52. The number is valid, if there is accessible memory is another question. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [patch] #ifdef very expensive debug check in page fault path 2008-01-17 9:53 ` Martin Schwidefsky @ 2008-01-18 4:09 ` Nick Piggin 0 siblings, 0 replies; 12+ messages in thread From: Nick Piggin @ 2008-01-18 4:09 UTC (permalink / raw) To: Martin Schwidefsky Cc: Carsten Otte, Andrew Morton, Linux Memory Management List, holger.wolf, Hugh Dickins, Linus Torvalds On Thu, Jan 17, 2008 at 10:53:34AM +0100, Martin Schwidefsky wrote: > On Thu, 2008-01-17 at 00:45 +0100, Nick Piggin wrote: > > On Wed, Jan 16, 2008 at 07:01:28PM +0100, Carsten Otte wrote: > > > This patch puts #ifdef CONFIG_DEBUG_VM around a check in vm_normal_page > > > that verifies that a pfn is valid. This patch increases performance of > > > the page fault microbenchmark in lmbench by 13% and overall dbench > > > performance by 7% on s390x. pfn_valid() is an expensive operation on > > > s390 that needs a high double digit amount of CPU cycles. > > > Nick Piggin suggested that pfn_valid() involves an array lookup on > > > systems with sparsemem, and therefore is an expensive operation there > > > too. > > > The check looks like a clear debug thing to me, it should never trigger > > > on regular kernels. And if a pte is created for an invalid pfn, we'll > > > find out once the memory gets accessed later on anyway. Please consider > > > inclusion of this patch into mm. > > > > > > Signed-off-by: Carsten Otte <cotte@de.ibm.com> > > > > Wow, that's a big performance hit for a few instructions ;) > > I haven't seen it to be quite so expensive on x86, but it definitely is > > not zero cost, especially with NUMA kernels. Thanks for getting those > > numbers. > > These number have been a surprise. We knew that the LRA instruction we > use in pfn_valid has a cost, but from the cycle count we did not expect > that the difference in the minor fault benchmark would be 13%. Most > probably a cache effect. > > I shortly discussed with Carsten what we should do with pfn_valid. One > idea was to make it a nop - always return 1. The current implementation > of pfn_valid uses the kernel address space mapping to decide if a page > frame is valid. All available memory areas that fit into the 4TB kernel > address space get mapped. If a page is mapped pfn_valid returns true. > But what is the background of pfn_valid, what does it protect against? > What is the exact semantics if pfn_valid returns true? From the name > page-frame-number-valid you could argue that it should always return > true if the number is smaller than 2**52. The number is valid, if there > is accessible memory is another question. It is supposed to return true if there is a valid struct page for the pfn AFAIK. s390 can probably get around without implementing it like that because it is mostly only used in memory mapping setup code (once we have removed it from vm_normal_page). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2008-01-23 9:14 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-01-16 18:01 [patch] #ifdef very expensive debug check in page fault path Carsten Otte 2008-01-16 23:45 ` Nick Piggin 2008-01-17 0:10 ` Andrew Morton 2008-01-17 0:27 ` Nick Piggin 2008-01-18 20:56 ` Hugh Dickins 2008-01-21 9:45 ` Carsten Otte 2008-01-22 22:35 ` Hugh Dickins 2008-01-22 23:39 ` Nick Piggin 2008-01-23 0:02 ` Hugh Dickins 2008-01-23 9:14 ` Martin Schwidefsky 2008-01-17 9:53 ` Martin Schwidefsky 2008-01-18 4:09 ` Nick Piggin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).