* vmalloc performance @ 2010-04-12 16:27 Steven Whitehouse 2010-04-14 12:49 ` Steven Whitehouse 0 siblings, 1 reply; 30+ messages in thread From: Steven Whitehouse @ 2010-04-12 16:27 UTC (permalink / raw) To: linux-mm, linux-kernel Hi, I've noticed that vmalloc seems to be rather slow. I wrote a test kernel module to track down what was going wrong. The kernel module does one million vmalloc/touch mem/vfree in a loop and prints out how long it takes. The source of the test kernel module can be found as an attachment to this bz: https://bugzilla.redhat.com/show_bug.cgi?id=581459 When this module is run on my x86_64, 8 core, 12 Gb machine, then on an otherwise idle system I get the following results: vmalloc took 148798983 us vmalloc took 151664529 us vmalloc took 152416398 us vmalloc took 151837733 us After applying the two line patch (see the same bz) which disabled the delayed removal of the structures, which appears to be intended to improve performance in the smp case by reducing TLB flushes across cpus, I get the following results: vmalloc took 15363634 us vmalloc took 15358026 us vmalloc took 15240955 us vmalloc took 15402302 us So thats a speed up of around 10x, which isn't too bad. The question is whether it is possible to come to a compromise where it is possible to retain the benefits of the delayed TLB flushing code, but reduce the overhead for other users. My two line patch basically disables the delay by forcing a removal on each and every vfree. What is the correct way to fix this I wonder? Steve. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: vmalloc performance 2010-04-12 16:27 vmalloc performance Steven Whitehouse @ 2010-04-14 12:49 ` Steven Whitehouse 2010-04-14 14:24 ` Steven Whitehouse 2010-04-14 15:13 ` Minchan Kim 0 siblings, 2 replies; 30+ messages in thread From: Steven Whitehouse @ 2010-04-14 12:49 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel Since this didn't attract much interest the first time around, and at the risk of appearing to be talking to myself, here is the patch from the bugzilla to better illustrate the issue: diff --git a/mm/vmalloc.c b/mm/vmalloc.c index ae00746..63c8178 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -605,8 +605,7 @@ static void free_unmap_vmap_area_noflush(struct vmap_area *va) { va->flags |= VM_LAZY_FREE; atomic_add((va->va_end - va->va_start) >> PAGE_SHIFT, &vmap_lazy_nr); - if (unlikely(atomic_read(&vmap_lazy_nr) > lazy_max_pages())) - try_purge_vmap_area_lazy(); + try_purge_vmap_area_lazy(); } /* Steve. On Mon, 2010-04-12 at 17:27 +0100, Steven Whitehouse wrote: > Hi, > > I've noticed that vmalloc seems to be rather slow. I wrote a test kernel > module to track down what was going wrong. The kernel module does one > million vmalloc/touch mem/vfree in a loop and prints out how long it > takes. > > The source of the test kernel module can be found as an attachment to > this bz: https://bugzilla.redhat.com/show_bug.cgi?id=581459 > > When this module is run on my x86_64, 8 core, 12 Gb machine, then on an > otherwise idle system I get the following results: > > vmalloc took 148798983 us > vmalloc took 151664529 us > vmalloc took 152416398 us > vmalloc took 151837733 us > > After applying the two line patch (see the same bz) which disabled the > delayed removal of the structures, which appears to be intended to > improve performance in the smp case by reducing TLB flushes across cpus, > I get the following results: > > vmalloc took 15363634 us > vmalloc took 15358026 us > vmalloc took 15240955 us > vmalloc took 15402302 us > > So thats a speed up of around 10x, which isn't too bad. The question is > whether it is possible to come to a compromise where it is possible to > retain the benefits of the delayed TLB flushing code, but reduce the > overhead for other users. My two line patch basically disables the delay > by forcing a removal on each and every vfree. > > What is the correct way to fix this I wonder? > > Steve. > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: vmalloc performance 2010-04-14 12:49 ` Steven Whitehouse @ 2010-04-14 14:24 ` Steven Whitehouse 2010-04-14 15:12 ` Minchan Kim 2010-04-14 15:13 ` Minchan Kim 1 sibling, 1 reply; 30+ messages in thread From: Steven Whitehouse @ 2010-04-14 14:24 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel Hi, Also, what lock should be protecting this code: va->flags |= VM_LAZY_FREE; atomic_add((va->va_end - va->va_start) >> PAGE_SHIFT, &vmap_lazy_nr); in free_unmap_vmap_area_noflush() ? It seem that if __purge_vmap_area_lazy runs between the two statements above that the number of pages contained in vmap_lazy_nr will be incorrect. Maybe the two statements should just be reversed? I can't see any reason that the flag assignment would be atomic either. In recent tests, including the patch below, the following has been reported to me: Apr 13 17:19:57 bigi kernel: ------------[ cut here ]------------ Apr 13 17:19:57 bigi kernel: kernel BUG at mm/vmalloc.c:559! Apr 13 17:19:57 bigi kernel: invalid opcode: 0000 [#1] SMP etc. as the result of a vfree() and I think that is probably the reason for it. I'll try and verify whether that really is the issue, but it looks highly probably at the moment, Steve. On Wed, 2010-04-14 at 13:49 +0100, Steven Whitehouse wrote: > Since this didn't attract much interest the first time around, and at > the risk of appearing to be talking to myself, here is the patch from > the bugzilla to better illustrate the issue: > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index ae00746..63c8178 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -605,8 +605,7 @@ static void free_unmap_vmap_area_noflush(struct > vmap_area *va) > { > va->flags |= VM_LAZY_FREE; > atomic_add((va->va_end - va->va_start) >> PAGE_SHIFT, &vmap_lazy_nr); > - if (unlikely(atomic_read(&vmap_lazy_nr) > lazy_max_pages())) > - try_purge_vmap_area_lazy(); > + try_purge_vmap_area_lazy(); > } > > /* > > > Steve. > > On Mon, 2010-04-12 at 17:27 +0100, Steven Whitehouse wrote: > > Hi, > > > > I've noticed that vmalloc seems to be rather slow. I wrote a test kernel > > module to track down what was going wrong. The kernel module does one > > million vmalloc/touch mem/vfree in a loop and prints out how long it > > takes. > > > > The source of the test kernel module can be found as an attachment to > > this bz: https://bugzilla.redhat.com/show_bug.cgi?id=581459 > > > > When this module is run on my x86_64, 8 core, 12 Gb machine, then on an > > otherwise idle system I get the following results: > > > > vmalloc took 148798983 us > > vmalloc took 151664529 us > > vmalloc took 152416398 us > > vmalloc took 151837733 us > > > > After applying the two line patch (see the same bz) which disabled the > > delayed removal of the structures, which appears to be intended to > > improve performance in the smp case by reducing TLB flushes across cpus, > > I get the following results: > > > > vmalloc took 15363634 us > > vmalloc took 15358026 us > > vmalloc took 15240955 us > > vmalloc took 15402302 us > > > > So thats a speed up of around 10x, which isn't too bad. The question is > > whether it is possible to come to a compromise where it is possible to > > retain the benefits of the delayed TLB flushing code, but reduce the > > overhead for other users. My two line patch basically disables the delay > > by forcing a removal on each and every vfree. > > > > What is the correct way to fix this I wonder? > > > > Steve. > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: vmalloc performance 2010-04-14 14:24 ` Steven Whitehouse @ 2010-04-14 15:12 ` Minchan Kim 0 siblings, 0 replies; 30+ messages in thread From: Minchan Kim @ 2010-04-14 15:12 UTC (permalink / raw) To: Steven Whitehouse; +Cc: linux-mm, linux-kernel On Wed, Apr 14, 2010 at 11:24 PM, Steven Whitehouse <steve@chygwyn.com> wrote: > Hi, > > Also, what lock should be protecting this code: > > va->flags |= VM_LAZY_FREE; > atomic_add((va->va_end - va->va_start) >> PAGE_SHIFT, > &vmap_lazy_nr); > > in free_unmap_vmap_area_noflush() ? It seem that if > __purge_vmap_area_lazy runs between the two statements above that the > number of pages contained in vmap_lazy_nr will be incorrect. Maybe the > two statements should just be reversed? I can't see any reason that the > flag assignment would be atomic either. In recent tests, including the > patch below, the following has been reported to me: It was already fixed. https://patchwork.kernel.org/patch/89783/ Thanks. -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: vmalloc performance 2010-04-14 12:49 ` Steven Whitehouse 2010-04-14 14:24 ` Steven Whitehouse @ 2010-04-14 15:13 ` Minchan Kim 2010-04-14 16:35 ` Minchan Kim 1 sibling, 1 reply; 30+ messages in thread From: Minchan Kim @ 2010-04-14 15:13 UTC (permalink / raw) To: Steven Whitehouse; +Cc: linux-mm, linux-kernel, Nick Piggin Cced Nick. He's Mr. Vmalloc. On Wed, Apr 14, 2010 at 9:49 PM, Steven Whitehouse <swhiteho@redhat.com> wrote: > > Since this didn't attract much interest the first time around, and at > the risk of appearing to be talking to myself, here is the patch from > the bugzilla to better illustrate the issue: > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index ae00746..63c8178 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -605,8 +605,7 @@ static void free_unmap_vmap_area_noflush(struct > vmap_area *va) > { > va->flags |= VM_LAZY_FREE; > atomic_add((va->va_end - va->va_start) >> PAGE_SHIFT, &vmap_lazy_nr); > - if (unlikely(atomic_read(&vmap_lazy_nr) > lazy_max_pages())) > - try_purge_vmap_area_lazy(); > + try_purge_vmap_area_lazy(); > } > > /* > > > Steve. > > On Mon, 2010-04-12 at 17:27 +0100, Steven Whitehouse wrote: >> Hi, >> >> I've noticed that vmalloc seems to be rather slow. I wrote a test kernel >> module to track down what was going wrong. The kernel module does one >> million vmalloc/touch mem/vfree in a loop and prints out how long it >> takes. >> >> The source of the test kernel module can be found as an attachment to >> this bz: https://bugzilla.redhat.com/show_bug.cgi?id=581459 >> >> When this module is run on my x86_64, 8 core, 12 Gb machine, then on an >> otherwise idle system I get the following results: >> >> vmalloc took 148798983 us >> vmalloc took 151664529 us >> vmalloc took 152416398 us >> vmalloc took 151837733 us >> >> After applying the two line patch (see the same bz) which disabled the >> delayed removal of the structures, which appears to be intended to >> improve performance in the smp case by reducing TLB flushes across cpus, >> I get the following results: >> >> vmalloc took 15363634 us >> vmalloc took 15358026 us >> vmalloc took 15240955 us >> vmalloc took 15402302 us >> >> So thats a speed up of around 10x, which isn't too bad. The question is >> whether it is possible to come to a compromise where it is possible to >> retain the benefits of the delayed TLB flushing code, but reduce the >> overhead for other users. My two line patch basically disables the delay >> by forcing a removal on each and every vfree. >> >> What is the correct way to fix this I wonder? >> >> Steve. >> > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: vmalloc performance 2010-04-14 15:13 ` Minchan Kim @ 2010-04-14 16:35 ` Minchan Kim 2010-04-15 8:33 ` Steven Whitehouse 0 siblings, 1 reply; 30+ messages in thread From: Minchan Kim @ 2010-04-14 16:35 UTC (permalink / raw) To: Steven Whitehouse; +Cc: linux-mm, linux-kernel, Nick Piggin On Thu, 2010-04-15 at 00:13 +0900, Minchan Kim wrote: > Cced Nick. > He's Mr. Vmalloc. > > On Wed, Apr 14, 2010 at 9:49 PM, Steven Whitehouse <swhiteho@redhat.com> wrote: > > > > Since this didn't attract much interest the first time around, and at > > the risk of appearing to be talking to myself, here is the patch from > > the bugzilla to better illustrate the issue: > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > index ae00746..63c8178 100644 > > --- a/mm/vmalloc.c > > +++ b/mm/vmalloc.c > > @@ -605,8 +605,7 @@ static void free_unmap_vmap_area_noflush(struct > > vmap_area *va) > > { > > va->flags |= VM_LAZY_FREE; > > atomic_add((va->va_end - va->va_start) >> PAGE_SHIFT, &vmap_lazy_nr); > > - if (unlikely(atomic_read(&vmap_lazy_nr) > lazy_max_pages())) > > - try_purge_vmap_area_lazy(); > > + try_purge_vmap_area_lazy(); > > } > > > > /* > > > > > > Steve. > > > > On Mon, 2010-04-12 at 17:27 +0100, Steven Whitehouse wrote: > >> Hi, > >> > >> I've noticed that vmalloc seems to be rather slow. I wrote a test kernel > >> module to track down what was going wrong. The kernel module does one > >> million vmalloc/touch mem/vfree in a loop and prints out how long it > >> takes. > >> > >> The source of the test kernel module can be found as an attachment to > >> this bz: https://bugzilla.redhat.com/show_bug.cgi?id=581459 > >> > >> When this module is run on my x86_64, 8 core, 12 Gb machine, then on an > >> otherwise idle system I get the following results: > >> > >> vmalloc took 148798983 us > >> vmalloc took 151664529 us > >> vmalloc took 152416398 us > >> vmalloc took 151837733 us > >> > >> After applying the two line patch (see the same bz) which disabled the > >> delayed removal of the structures, which appears to be intended to > >> improve performance in the smp case by reducing TLB flushes across cpus, > >> I get the following results: > >> > >> vmalloc took 15363634 us > >> vmalloc took 15358026 us > >> vmalloc took 15240955 us > >> vmalloc took 15402302 us > >> > >> So thats a speed up of around 10x, which isn't too bad. The question is > >> whether it is possible to come to a compromise where it is possible to > >> retain the benefits of the delayed TLB flushing code, but reduce the > >> overhead for other users. My two line patch basically disables the delay > >> by forcing a removal on each and every vfree. > >> > >> What is the correct way to fix this I wonder? > >> > >> Steve. > >> In my case(2 core, mem 2G system), 50300661 vs 11569357. It improves 4 times. It would result from larger number of lazy_max_pages. It would prevent many vmap_area freed. So alloc_vmap_area takes long time to find new vmap_area. (ie, lookup rbtree) How about calling purge_vmap_area_lazy at the middle of loop in alloc_vmap_area if rbtree lookup were long? BTW, Steve. Is is real issue or some test? I doubt such vmalloc bomb workload is real. -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: vmalloc performance 2010-04-14 16:35 ` Minchan Kim @ 2010-04-15 8:33 ` Steven Whitehouse 2010-04-15 16:51 ` Minchan Kim 2010-04-16 6:12 ` Nick Piggin 0 siblings, 2 replies; 30+ messages in thread From: Steven Whitehouse @ 2010-04-15 8:33 UTC (permalink / raw) To: Minchan Kim; +Cc: linux-mm, linux-kernel, Nick Piggin Hi, On Thu, 2010-04-15 at 01:35 +0900, Minchan Kim wrote: > On Thu, 2010-04-15 at 00:13 +0900, Minchan Kim wrote: > > Cced Nick. > > He's Mr. Vmalloc. > > > > On Wed, Apr 14, 2010 at 9:49 PM, Steven Whitehouse <swhiteho@redhat.com> wrote: > > > > > > Since this didn't attract much interest the first time around, and at > > > the risk of appearing to be talking to myself, here is the patch from > > > the bugzilla to better illustrate the issue: > > > > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > > index ae00746..63c8178 100644 > > > --- a/mm/vmalloc.c > > > +++ b/mm/vmalloc.c > > > @@ -605,8 +605,7 @@ static void free_unmap_vmap_area_noflush(struct > > > vmap_area *va) > > > { > > > va->flags |= VM_LAZY_FREE; > > > atomic_add((va->va_end - va->va_start) >> PAGE_SHIFT, &vmap_lazy_nr); > > > - if (unlikely(atomic_read(&vmap_lazy_nr) > lazy_max_pages())) > > > - try_purge_vmap_area_lazy(); > > > + try_purge_vmap_area_lazy(); > > > } > > > > > > /* > > > > > > > > > Steve. > > > > > > On Mon, 2010-04-12 at 17:27 +0100, Steven Whitehouse wrote: > > >> Hi, > > >> > > >> I've noticed that vmalloc seems to be rather slow. I wrote a test kernel > > >> module to track down what was going wrong. The kernel module does one > > >> million vmalloc/touch mem/vfree in a loop and prints out how long it > > >> takes. > > >> > > >> The source of the test kernel module can be found as an attachment to > > >> this bz: https://bugzilla.redhat.com/show_bug.cgi?id=581459 > > >> > > >> When this module is run on my x86_64, 8 core, 12 Gb machine, then on an > > >> otherwise idle system I get the following results: > > >> > > >> vmalloc took 148798983 us > > >> vmalloc took 151664529 us > > >> vmalloc took 152416398 us > > >> vmalloc took 151837733 us > > >> > > >> After applying the two line patch (see the same bz) which disabled the > > >> delayed removal of the structures, which appears to be intended to > > >> improve performance in the smp case by reducing TLB flushes across cpus, > > >> I get the following results: > > >> > > >> vmalloc took 15363634 us > > >> vmalloc took 15358026 us > > >> vmalloc took 15240955 us > > >> vmalloc took 15402302 us > > > > >> > > >> So thats a speed up of around 10x, which isn't too bad. The question is > > >> whether it is possible to come to a compromise where it is possible to > > >> retain the benefits of the delayed TLB flushing code, but reduce the > > >> overhead for other users. My two line patch basically disables the delay > > >> by forcing a removal on each and every vfree. > > >> > > >> What is the correct way to fix this I wonder? > > >> > > >> Steve. > > >> > > In my case(2 core, mem 2G system), 50300661 vs 11569357. > It improves 4 times. > Looking at the code, it seems that the limit, against which my patch removes a test, scales according to the number of cpu cores. So with more cores, I'd expect the difference to be greater. I have a feeling that the original reporter had a greater number than the 8 of my test machine. > It would result from larger number of lazy_max_pages. > It would prevent many vmap_area freed. > So alloc_vmap_area takes long time to find new vmap_area. (ie, lookup > rbtree) > > How about calling purge_vmap_area_lazy at the middle of loop in > alloc_vmap_area if rbtree lookup were long? > That may be a good solution - I'm happy to test any patches but my worry is that any change here might result in a regression in whatever workload the lazy purge code was originally designed to improve. Is there any way to test that I wonder? > BTW, Steve. Is is real issue or some test? > I doubt such vmalloc bomb workload is real. Well the answer is both yes and no :-) So this is how I came across the issue. I received a report that GFS2 performance had regressed in recent kernels in relation to a test which basically fires lots of requests at it via NFS. The reporter of this problem gave me two bits of information: firstly that by eliminating all readdir calls from the test, the regression is never seen and secondly that oprofile showed that two functions related to vmalloc (rb_next, find_vmap_area, alloc_vmap_area in that order) were taking between them about 60% of the total cpu time. Now between the two kernel versions being tested, probably not a single line of GFS2 code for readdir has changed since that code has been stable for a fair while now. So my attention turned to vmalloc, even though it would be unusual for a filesystem to be limited by cpu, it did seem odd that it was so high in the oprofile result. I should also mention at this point that the backing device for the fs is a very high performance disk array, so that increases the chances of cpu being a limiting factor. Anyway, having looked briefly at the vmalloc code, I spotted that there was a cache of objects which might have an effect, so I wrote the test kernel module in the bz to test the two line patch just to see what effect it had. Since I got a good speed up, I sent the patch to the reporter who was able to get further on the NFS/GFS2 tests before running into the oops. I hadn't spotted that there had been a fix for that bug in the mean time though, so I'll get that applied. Thanks for pointing it out. We'll try and get some more testing done in order to try and prove whether the regression we are seeing in GFS2 readdir performance is entirely due to this factor, or only partially. I think it does have a measurable effect though, even if it is not the whole story, Steve. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: vmalloc performance 2010-04-15 8:33 ` Steven Whitehouse @ 2010-04-15 16:51 ` Minchan Kim 2010-04-16 14:10 ` Steven Whitehouse 2010-04-16 6:12 ` Nick Piggin 1 sibling, 1 reply; 30+ messages in thread From: Minchan Kim @ 2010-04-15 16:51 UTC (permalink / raw) To: Steven Whitehouse; +Cc: linux-mm, linux-kernel, Nick Piggin On Thu, 2010-04-15 at 09:33 +0100, Steven Whitehouse wrote: > Hi, > > On Thu, 2010-04-15 at 01:35 +0900, Minchan Kim wrote: > > On Thu, 2010-04-15 at 00:13 +0900, Minchan Kim wrote: > > > Cced Nick. > > > He's Mr. Vmalloc. > > > > > > On Wed, Apr 14, 2010 at 9:49 PM, Steven Whitehouse <swhiteho@redhat.com> wrote: > > > > > > > > Since this didn't attract much interest the first time around, and at > > > > the risk of appearing to be talking to myself, here is the patch from > > > > the bugzilla to better illustrate the issue: > > > > > > > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > > > index ae00746..63c8178 100644 > > > > --- a/mm/vmalloc.c > > > > +++ b/mm/vmalloc.c > > > > @@ -605,8 +605,7 @@ static void free_unmap_vmap_area_noflush(struct > > > > vmap_area *va) > > > > { > > > > va->flags |= VM_LAZY_FREE; > > > > atomic_add((va->va_end - va->va_start) >> PAGE_SHIFT, &vmap_lazy_nr); > > > > - if (unlikely(atomic_read(&vmap_lazy_nr) > lazy_max_pages())) > > > > - try_purge_vmap_area_lazy(); > > > > + try_purge_vmap_area_lazy(); > > > > } > > > > > > > > /* > > > > > > > > > > > > Steve. > > > > > > > > On Mon, 2010-04-12 at 17:27 +0100, Steven Whitehouse wrote: > > > >> Hi, > > > >> > > > >> I've noticed that vmalloc seems to be rather slow. I wrote a test kernel > > > >> module to track down what was going wrong. The kernel module does one > > > >> million vmalloc/touch mem/vfree in a loop and prints out how long it > > > >> takes. > > > >> > > > >> The source of the test kernel module can be found as an attachment to > > > >> this bz: https://bugzilla.redhat.com/show_bug.cgi?id=581459 > > > >> > > > >> When this module is run on my x86_64, 8 core, 12 Gb machine, then on an > > > >> otherwise idle system I get the following results: > > > >> > > > >> vmalloc took 148798983 us > > > >> vmalloc took 151664529 us > > > >> vmalloc took 152416398 us > > > >> vmalloc took 151837733 us > > > >> > > > >> After applying the two line patch (see the same bz) which disabled the > > > >> delayed removal of the structures, which appears to be intended to > > > >> improve performance in the smp case by reducing TLB flushes across cpus, > > > >> I get the following results: > > > >> > > > >> vmalloc took 15363634 us > > > >> vmalloc took 15358026 us > > > >> vmalloc took 15240955 us > > > >> vmalloc took 15402302 us > > > > > > > >> > > > >> So thats a speed up of around 10x, which isn't too bad. The question is > > > >> whether it is possible to come to a compromise where it is possible to > > > >> retain the benefits of the delayed TLB flushing code, but reduce the > > > >> overhead for other users. My two line patch basically disables the delay > > > >> by forcing a removal on each and every vfree. > > > >> > > > >> What is the correct way to fix this I wonder? > > > >> > > > >> Steve. > > > >> > > > > In my case(2 core, mem 2G system), 50300661 vs 11569357. > > It improves 4 times. > > > Looking at the code, it seems that the limit, against which my patch > removes a test, scales according to the number of cpu cores. So with > more cores, I'd expect the difference to be greater. I have a feeling > that the original reporter had a greater number than the 8 of my test > machine. > > > It would result from larger number of lazy_max_pages. > > It would prevent many vmap_area freed. > > So alloc_vmap_area takes long time to find new vmap_area. (ie, lookup > > rbtree) > > > > How about calling purge_vmap_area_lazy at the middle of loop in > > alloc_vmap_area if rbtree lookup were long? > > > That may be a good solution - I'm happy to test any patches but my worry > is that any change here might result in a regression in whatever > workload the lazy purge code was originally designed to improve. Is > there any way to test that I wonder? > > > BTW, Steve. Is is real issue or some test? > > I doubt such vmalloc bomb workload is real. > > Well the answer is both yes and no :-) So this is how I came across the > issue. I received a report that GFS2 performance had regressed in recent > kernels in relation to a test which basically fires lots of requests at > it via NFS. The reporter of this problem gave me two bits of > information: firstly that by eliminating all readdir calls from the > test, the regression is never seen and secondly that oprofile showed > that two functions related to vmalloc (rb_next, find_vmap_area, > alloc_vmap_area in that order) were taking between them about 60% of the > total cpu time. > > Now between the two kernel versions being tested, probably not a single > line of GFS2 code for readdir has changed since that code has been > stable for a fair while now. So my attention turned to vmalloc, even > though it would be unusual for a filesystem to be limited by cpu, it did > seem odd that it was so high in the oprofile result. I should also > mention at this point that the backing device for the fs is a very high > performance disk array, so that increases the chances of cpu being a > limiting factor. > > Anyway, having looked briefly at the vmalloc code, I spotted that there > was a cache of objects which might have an effect, so I wrote the test > kernel module in the bz to test the two line patch just to see what > effect it had. > > Since I got a good speed up, I sent the patch to the reporter who was > able to get further on the NFS/GFS2 tests before running into the oops. > I hadn't spotted that there had been a fix for that bug in the mean time > though, so I'll get that applied. Thanks for pointing it out. > > We'll try and get some more testing done in order to try and prove > whether the regression we are seeing in GFS2 readdir performance is > entirely due to this factor, or only partially. I think it does have a > measurable effect though, even if it is not the whole story, > > Steve. > Thanks for the explanation. It seems to be real issue. I tested to see effect with flush during rb tree search. Before I applied your patch, the time is 50300661 us. After your patch, 11569357 us. After my debug patch, 6104875 us. I tested it as changing threshold value. threshold time 1000 13892809 500 9062110 200 6714172 100 6104875 50 6758316 And perf shows smp_call_function is very low percentage. In my cases, 100 is best. I have no server machine so can't test TLB effect. Could you meaure it with this patch? Maybe you can see TLB effect with perf record like Nick was done. You can refer db64fe02. If you can't see sn_send_IPI_phys and smp_call_function in IA64, maybe TLB issue isn't a big problem. P.S) I am not a full time developer but just hobbyist. So I can't make patch and test in office. Please, understand lazy response. :) Here is just for debug patch. diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 8686b0f..ef6beb2 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -933,6 +933,9 @@ static struct ctl_table kern_table[] = { { } }; +extern unsigned long max_lookup_count; +extern unsigned long threshold_lookup_count; + static struct ctl_table vm_table[] = { { .procname = "overcommit_memory", @@ -1251,6 +1254,22 @@ static struct ctl_table vm_table[] = { .mode = 0644, .proc_handler = scan_unevictable_handler, }, + { + .procname = "max_lookup_count", + .data = &max_lookup_count, + .maxlen = sizeof(max_lookup_count), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + }, + + { + .procname = "threshold_lookup_count", + .data = &threshold_lookup_count, + .maxlen = sizeof(threshold_lookup_count), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + }, + #ifdef CONFIG_MEMORY_FAILURE { .procname = "memory_failure_early_kill", diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 7abf423..95a1390 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -319,6 +319,9 @@ static void __insert_vmap_area(struct vmap_area *va) static void purge_vmap_area_lazy(void); +unsigned long lookup_count; +unsigned long max_lookup_count; +unsigned long threshold_lookup_count = 100000; /* * Allocate a region of KVA of the specified size and alignment, within the * vstart and vend. @@ -332,6 +335,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, struct rb_node *n; unsigned long addr; int purged = 0; + int nlookup = 0; BUG_ON(!size); BUG_ON(size & ~PAGE_MASK); @@ -344,6 +348,10 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, retry: addr = ALIGN(vstart, align); + if (lookup_count > threshold_lookup_count) { + purge_vmap_area_lazy(); + lookup_count = 0; + } spin_lock(&vmap_area_lock); if (addr + size - 1 < addr) goto overflow; @@ -364,6 +372,7 @@ retry: first = tmp; n = n->rb_right; } + nlookup++; } while (n); if (!first) @@ -371,6 +380,7 @@ retry: if (first->va_end < addr) { n = rb_next(&first->rb_node); + nlookup++; if (n) first = rb_entry(n, struct vmap_area, rb_node); else @@ -387,6 +397,8 @@ retry: first = rb_entry(n, struct vmap_area, rb_node); else goto found; + nlookup++; + } } found: @@ -396,6 +408,7 @@ overflow: if (!purged) { purge_vmap_area_lazy(); purged = 1; + nlookup = 0; goto retry; } if (printk_ratelimit()) @@ -412,6 +425,9 @@ overflow: va->va_end = addr + size; va->flags = 0; __insert_vmap_area(va); + if (nlookup > max_lookup_count) + max_lookup_count = nlookup; + lookup_count = nlookup; spin_unlock(&vmap_area_lock); return va; -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: vmalloc performance 2010-04-15 16:51 ` Minchan Kim @ 2010-04-16 14:10 ` Steven Whitehouse 2010-04-18 15:14 ` Minchan Kim 0 siblings, 1 reply; 30+ messages in thread From: Steven Whitehouse @ 2010-04-16 14:10 UTC (permalink / raw) To: Minchan Kim; +Cc: linux-mm, linux-kernel, Nick Piggin Hi, On Fri, 2010-04-16 at 01:51 +0900, Minchan Kim wrote: [snip] > Thanks for the explanation. It seems to be real issue. > > I tested to see effect with flush during rb tree search. > > Before I applied your patch, the time is 50300661 us. > After your patch, 11569357 us. > After my debug patch, 6104875 us. > > I tested it as changing threshold value. > > threshold time > 1000 13892809 > 500 9062110 > 200 6714172 > 100 6104875 > 50 6758316 > My results show: threshold time 100000 139309948 1000 13555878 500 10069801 200 7813667 100 18523172 50 18546256 > And perf shows smp_call_function is very low percentage. > > In my cases, 100 is best. > Looks like 200 for me. I think you meant to use the non _minmax version of proc_dointvec too? Although it doesn't make any difference for this basic test. The original reporter also has 8 cpu cores I've discovered. In his case divided by 4 cpus where as mine are divided by 2 cpus, but I think that makes no real difference in this case. I'll try and get some further test results ready shortly. Many thanks for all your efforts in tracking this down, Steve. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: vmalloc performance 2010-04-16 14:10 ` Steven Whitehouse @ 2010-04-18 15:14 ` Minchan Kim 2010-04-19 12:58 ` Steven Whitehouse 2010-04-19 13:38 ` vmalloc performance Nick Piggin 0 siblings, 2 replies; 30+ messages in thread From: Minchan Kim @ 2010-04-18 15:14 UTC (permalink / raw) To: Steven Whitehouse; +Cc: linux-mm, linux-kernel, Nick Piggin On Fri, 2010-04-16 at 15:10 +0100, Steven Whitehouse wrote: > Hi, > > On Fri, 2010-04-16 at 01:51 +0900, Minchan Kim wrote: > [snip] > > Thanks for the explanation. It seems to be real issue. > > > > I tested to see effect with flush during rb tree search. > > > > Before I applied your patch, the time is 50300661 us. > > After your patch, 11569357 us. > > After my debug patch, 6104875 us. > > > > I tested it as changing threshold value. > > > > threshold time > > 1000 13892809 > > 500 9062110 > > 200 6714172 > > 100 6104875 > > 50 6758316 > > > My results show: > > threshold time > 100000 139309948 > 1000 13555878 > 500 10069801 > 200 7813667 > 100 18523172 > 50 18546256 > > > And perf shows smp_call_function is very low percentage. > > > > In my cases, 100 is best. > > > Looks like 200 for me. > > I think you meant to use the non _minmax version of proc_dointvec too? Yes. My fault :) > Although it doesn't make any difference for this basic test. > > The original reporter also has 8 cpu cores I've discovered. In his case > divided by 4 cpus where as mine are divided by 2 cpus, but I think that > makes no real difference in this case. > > I'll try and get some further test results ready shortly. Many thanks > for all your efforts in tracking this down, > > Steve. I voted "free area cache". I tested below patch in my machine. The result is following as. 1) vanilla elapsed time # search of rbtree vmalloc took 49121724 us 5535 vmalloc took 50675245 us 5535 vmalloc took 48987711 us 5535 vmalloc took 54232479 us 5535 vmalloc took 50258117 us 5535 vmalloc took 49424859 us 5535 3) Steven's patch elapsed time # search of rbtree vmalloc took 11363341 us 62 vmalloc took 12798868 us 62 vmalloc took 13247942 us 62 vmalloc took 11434647 us 62 vmalloc took 13221733 us 62 vmalloc took 12134019 us 62 2) my patch(vmap cache) elapsed time # search of rbtree vmalloc took 5159893 us 8 vmalloc took 5124434 us 8 vmalloc took 5123291 us 8 vmalloc took 5145396 us 12 vmalloc took 5163605 us 8 vmalloc took 5945663 us 8 My version is faster than 9 times of vanilla. Steve, Could you measure this patch with your test? (Sorry, maybe you have to apply the patch by hands. That's because patch is based on mmotm-2010-04-05-16-09) Nick, What do you think about "free area cache" approach? In this version, I don't consider last hole and backward cache movement which is like mmap's cached_hole_size That's because I want to flush vmap_areas freed intentionally if we meet vend. It makes flush frequent than old but it's trade-off. In addition, vmalloc isn't critical compared to mmap about performance. So I think that's enough. If you don't opposed, I will repost formal patch without code related to debug. --- kernel/sysctl.c | 9 +++++++++ mm/vmalloc.c | 55 +++++++++++++++++++++++++++++++++++++++++-------------- 2 files changed, 50 insertions(+), 14 deletions(-) diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 8686b0f..20d7bfd 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -933,7 +933,16 @@ static struct ctl_table kern_table[] = { { } }; +extern unsigned long max_lookup_count; + static struct ctl_table vm_table[] = { + { + .procname = "max_lookup_count", + .data = &max_lookup_count, + .maxlen = sizeof(max_lookup_count), + .mode = 0644, + .proc_handler = proc_dointvec, + }, { .procname = "overcommit_memory", .data = &sysctl_overcommit_memory, diff --git a/mm/vmalloc.c b/mm/vmalloc.c index ae00746..dac3223 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -263,6 +263,7 @@ struct vmap_area { static DEFINE_SPINLOCK(vmap_area_lock); static struct rb_root vmap_area_root = RB_ROOT; +static struct rb_node *free_vmap_cache; static LIST_HEAD(vmap_area_list); static unsigned long vmap_area_pcpu_hole; @@ -319,6 +320,7 @@ static void __insert_vmap_area(struct vmap_area *va) static void purge_vmap_area_lazy(void); +unsigned long max_lookup_count; /* * Allocate a region of KVA of the specified size and alignment, within the * vstart and vend. @@ -332,6 +334,9 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, struct rb_node *n; unsigned long addr; int purged = 0; + int lookup_cache = 0; + struct vmap_area *first; + unsigned int nlookup = 0; BUG_ON(!size); BUG_ON(size & ~PAGE_MASK); @@ -342,35 +347,50 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, return ERR_PTR(-ENOMEM); retry: + first = NULL; addr = ALIGN(vstart, align); spin_lock(&vmap_area_lock); if (addr + size - 1 < addr) goto overflow; - /* XXX: could have a last_hole cache */ n = vmap_area_root.rb_node; - if (n) { - struct vmap_area *first = NULL; + if (free_vmap_cache && !purged) { + struct vmap_area *cache; + cache = rb_entry(free_vmap_cache, struct vmap_area, rb_node); + if (cache->va_start >= addr && cache->va_end < vend) { + lookup_cache = 1; + n = free_vmap_cache; + } + } - do { - struct vmap_area *tmp; - tmp = rb_entry(n, struct vmap_area, rb_node); - if (tmp->va_end >= addr) { - if (!first && tmp->va_start < addr + size) + if (n) { + if (!lookup_cache) { + do { + struct vmap_area *tmp; + tmp = rb_entry(n, struct vmap_area, rb_node); + if (tmp->va_end >= addr) { + if (!first && tmp->va_start < addr + size) + first = tmp; + n = n->rb_left; + } else { first = tmp; - n = n->rb_left; - } else { - first = tmp; - n = n->rb_right; - } - } while (n); + n = n->rb_right; + } + nlookup++; + } while (n); + } + else { + first = rb_entry(n, struct vmap_area, rb_node); + addr = first->va_start; + } if (!first) goto found; if (first->va_end < addr) { n = rb_next(&first->rb_node); + nlookup++; if (n) first = rb_entry(n, struct vmap_area, rb_node); else @@ -383,6 +403,7 @@ retry: goto overflow; n = rb_next(&first->rb_node); + nlookup++; if (n) first = rb_entry(n, struct vmap_area, rb_node); else @@ -396,6 +417,7 @@ overflow: if (!purged) { purge_vmap_area_lazy(); purged = 1; + lookup_cache = 0; goto retry; } if (printk_ratelimit()) @@ -412,6 +434,9 @@ overflow: va->va_end = addr + size; va->flags = 0; __insert_vmap_area(va); + free_vmap_cache = &va->rb_node; + if (max_lookup_count < nlookup) + max_lookup_count = nlookup; spin_unlock(&vmap_area_lock); return va; @@ -426,7 +451,9 @@ static void rcu_free_va(struct rcu_head *head) static void __free_vmap_area(struct vmap_area *va) { + struct rb_node *prev; BUG_ON(RB_EMPTY_NODE(&va->rb_node)); + free_vmap_cache = rb_prev(&va->rb_node); rb_erase(&va->rb_node, &vmap_area_root); RB_CLEAR_NODE(&va->rb_node); list_del_rcu(&va->list); -- 1.7.0.5 -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: vmalloc performance 2010-04-18 15:14 ` Minchan Kim @ 2010-04-19 12:58 ` Steven Whitehouse 2010-04-19 14:12 ` Minchan Kim 2010-04-19 13:38 ` vmalloc performance Nick Piggin 1 sibling, 1 reply; 30+ messages in thread From: Steven Whitehouse @ 2010-04-19 12:58 UTC (permalink / raw) To: Minchan Kim; +Cc: linux-mm, linux-kernel, Nick Piggin Hi, On Mon, 2010-04-19 at 00:14 +0900, Minchan Kim wrote: > On Fri, 2010-04-16 at 15:10 +0100, Steven Whitehouse wrote: > > Hi, > > > > On Fri, 2010-04-16 at 01:51 +0900, Minchan Kim wrote: > > [snip] > > > Thanks for the explanation. It seems to be real issue. > > > > > > I tested to see effect with flush during rb tree search. > > > > > > Before I applied your patch, the time is 50300661 us. > > > After your patch, 11569357 us. > > > After my debug patch, 6104875 us. > > > > > > I tested it as changing threshold value. > > > > > > threshold time > > > 1000 13892809 > > > 500 9062110 > > > 200 6714172 > > > 100 6104875 > > > 50 6758316 > > > > > My results show: > > > > threshold time > > 100000 139309948 > > 1000 13555878 > > 500 10069801 > > 200 7813667 > > 100 18523172 > > 50 18546256 > > > > > And perf shows smp_call_function is very low percentage. > > > > > > In my cases, 100 is best. > > > > > Looks like 200 for me. > > > > I think you meant to use the non _minmax version of proc_dointvec too? > > Yes. My fault :) > > > Although it doesn't make any difference for this basic test. > > > > The original reporter also has 8 cpu cores I've discovered. In his case > > divided by 4 cpus where as mine are divided by 2 cpus, but I think that > > makes no real difference in this case. > > > > I'll try and get some further test results ready shortly. Many thanks > > for all your efforts in tracking this down, > > > > Steve. > > I voted "free area cache". My results with this patch are: vmalloc took 5419238 us vmalloc took 5432874 us vmalloc took 5425568 us vmalloc took 5423867 us So thats about a third of the time it took with my original patch, so very much going in the right direction :-) I did get a compile warning: CC mm/vmalloc.o mm/vmalloc.c: In function a??__free_vmap_areaa??: mm/vmalloc.c:454: warning: unused variable a??preva?? ....harmless, but it should be fixed before the final version, Steve. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: vmalloc performance 2010-04-19 12:58 ` Steven Whitehouse @ 2010-04-19 14:12 ` Minchan Kim 2010-04-29 13:43 ` Steven Whitehouse 0 siblings, 1 reply; 30+ messages in thread From: Minchan Kim @ 2010-04-19 14:12 UTC (permalink / raw) To: Steven Whitehouse; +Cc: linux-mm, linux-kernel, Nick Piggin On Mon, Apr 19, 2010 at 9:58 PM, Steven Whitehouse <swhiteho@redhat.com> wrote: > Hi, > > On Mon, 2010-04-19 at 00:14 +0900, Minchan Kim wrote: >> On Fri, 2010-04-16 at 15:10 +0100, Steven Whitehouse wrote: >> > Hi, >> > >> > On Fri, 2010-04-16 at 01:51 +0900, Minchan Kim wrote: >> > [snip] >> > > Thanks for the explanation. It seems to be real issue. >> > > >> > > I tested to see effect with flush during rb tree search. >> > > >> > > Before I applied your patch, the time is 50300661 us. >> > > After your patch, 11569357 us. >> > > After my debug patch, 6104875 us. >> > > >> > > I tested it as changing threshold value. >> > > >> > > threshold time >> > > 1000 13892809 >> > > 500 9062110 >> > > 200 6714172 >> > > 100 6104875 >> > > 50 6758316 >> > > >> > My results show: >> > >> > threshold time >> > 100000 139309948 >> > 1000 13555878 >> > 500 10069801 >> > 200 7813667 >> > 100 18523172 >> > 50 18546256 >> > >> > > And perf shows smp_call_function is very low percentage. >> > > >> > > In my cases, 100 is best. >> > > >> > Looks like 200 for me. >> > >> > I think you meant to use the non _minmax version of proc_dointvec too? >> >> Yes. My fault :) >> >> > Although it doesn't make any difference for this basic test. >> > >> > The original reporter also has 8 cpu cores I've discovered. In his case >> > divided by 4 cpus where as mine are divided by 2 cpus, but I think that >> > makes no real difference in this case. >> > >> > I'll try and get some further test results ready shortly. Many thanks >> > for all your efforts in tracking this down, >> > >> > Steve. >> >> I voted "free area cache". > My results with this patch are: > > vmalloc took 5419238 us > vmalloc took 5432874 us > vmalloc took 5425568 us > vmalloc took 5423867 us > > So thats about a third of the time it took with my original patch, so > very much going in the right direction :-) Good. :) > > I did get a compile warning: > CC mm/vmalloc.o > mm/vmalloc.c: In function ‘__free_vmap_area’: > mm/vmalloc.c:454: warning: unused variable ‘prev’ > > ....harmless, but it should be fixed before the final version, Of course. It's not formal patch but for showing concept . :) Thanks for consuming precious your time. :) As Nick comments, I have to do further work. Maybe Nick could do it faster than me. Anyway, I hope it can solve your problem. Thanks, Steven. > > Steve. > > > -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: vmalloc performance 2010-04-19 14:12 ` Minchan Kim @ 2010-04-29 13:43 ` Steven Whitehouse 2010-05-02 17:29 ` [PATCH] cache last free vmap_area to avoid restarting beginning Minchan Kim 0 siblings, 1 reply; 30+ messages in thread From: Steven Whitehouse @ 2010-04-29 13:43 UTC (permalink / raw) To: Minchan Kim; +Cc: linux-mm, linux-kernel, Nick Piggin Hi, On Mon, 2010-04-19 at 23:12 +0900, Minchan Kim wrote: > On Mon, Apr 19, 2010 at 9:58 PM, Steven Whitehouse <swhiteho@redhat.com> wrote: > > Hi, > > > > On Mon, 2010-04-19 at 00:14 +0900, Minchan Kim wrote: > >> On Fri, 2010-04-16 at 15:10 +0100, Steven Whitehouse wrote: > >> > Hi, > >> > > >> > On Fri, 2010-04-16 at 01:51 +0900, Minchan Kim wrote: > >> > [snip] > >> > > Thanks for the explanation. It seems to be real issue. > >> > > > >> > > I tested to see effect with flush during rb tree search. > >> > > > >> > > Before I applied your patch, the time is 50300661 us. > >> > > After your patch, 11569357 us. > >> > > After my debug patch, 6104875 us. > >> > > > >> > > I tested it as changing threshold value. > >> > > > >> > > threshold time > >> > > 1000 13892809 > >> > > 500 9062110 > >> > > 200 6714172 > >> > > 100 6104875 > >> > > 50 6758316 > >> > > > >> > My results show: > >> > > >> > threshold time > >> > 100000 139309948 > >> > 1000 13555878 > >> > 500 10069801 > >> > 200 7813667 > >> > 100 18523172 > >> > 50 18546256 > >> > > >> > > And perf shows smp_call_function is very low percentage. > >> > > > >> > > In my cases, 100 is best. > >> > > > >> > Looks like 200 for me. > >> > > >> > I think you meant to use the non _minmax version of proc_dointvec too? > >> > >> Yes. My fault :) > >> > >> > Although it doesn't make any difference for this basic test. > >> > > >> > The original reporter also has 8 cpu cores I've discovered. In his case > >> > divided by 4 cpus where as mine are divided by 2 cpus, but I think that > >> > makes no real difference in this case. > >> > > >> > I'll try and get some further test results ready shortly. Many thanks > >> > for all your efforts in tracking this down, > >> > > >> > Steve. > >> > >> I voted "free area cache". > > My results with this patch are: > > > > vmalloc took 5419238 us > > vmalloc took 5432874 us > > vmalloc took 5425568 us > > vmalloc took 5423867 us > > > > So thats about a third of the time it took with my original patch, so > > very much going in the right direction :-) > > Good. :) > > > > > I did get a compile warning: > > CC mm/vmalloc.o > > mm/vmalloc.c: In function a??__free_vmap_areaa??: > > mm/vmalloc.c:454: warning: unused variable a??preva?? > > > > ....harmless, but it should be fixed before the final version, > > Of course. It's not formal patch but for showing concept . :) > > Thanks for consuming precious your time. :) > As Nick comments, I have to do further work. > Maybe Nick could do it faster than me. > Anyway, I hope it can solve your problem. > > Thanks, Steven. > > > > > Steve. > > > > Your latest patch has now been run though the GFS2 tests which originally triggered my investigation. It seems to solve the problem completely. Maybe thanks for your efforts in helping us find and fix the problem. The next question is what remains to be done in order to get the patch into a form suitable for upstream merge? Steve. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH] cache last free vmap_area to avoid restarting beginning 2010-04-29 13:43 ` Steven Whitehouse @ 2010-05-02 17:29 ` Minchan Kim 2010-05-05 12:48 ` Steven Whitehouse 0 siblings, 1 reply; 30+ messages in thread From: Minchan Kim @ 2010-05-02 17:29 UTC (permalink / raw) To: Steven Whitehouse, Andrew Morton; +Cc: linux-mm, linux-kernel, Nick Piggin On Thu, 2010-04-29 at 14:43 +0100, Steven Whitehouse wrote: > Hi, > > On Mon, 2010-04-19 at 23:12 +0900, Minchan Kim wrote: > > On Mon, Apr 19, 2010 at 9:58 PM, Steven Whitehouse <swhiteho@redhat.com> wrote: > > > Hi, > > > > > > On Mon, 2010-04-19 at 00:14 +0900, Minchan Kim wrote: > > >> On Fri, 2010-04-16 at 15:10 +0100, Steven Whitehouse wrote: > > >> > Hi, > > >> > > > >> > On Fri, 2010-04-16 at 01:51 +0900, Minchan Kim wrote: > > >> > [snip] > > >> > > Thanks for the explanation. It seems to be real issue. > > >> > > > > >> > > I tested to see effect with flush during rb tree search. > > >> > > > > >> > > Before I applied your patch, the time is 50300661 us. > > >> > > After your patch, 11569357 us. > > >> > > After my debug patch, 6104875 us. > > >> > > > > >> > > I tested it as changing threshold value. > > >> > > > > >> > > threshold time > > >> > > 1000 13892809 > > >> > > 500 9062110 > > >> > > 200 6714172 > > >> > > 100 6104875 > > >> > > 50 6758316 > > >> > > > > >> > My results show: > > >> > > > >> > threshold time > > >> > 100000 139309948 > > >> > 1000 13555878 > > >> > 500 10069801 > > >> > 200 7813667 > > >> > 100 18523172 > > >> > 50 18546256 > > >> > > > >> > > And perf shows smp_call_function is very low percentage. > > >> > > > > >> > > In my cases, 100 is best. > > >> > > > > >> > Looks like 200 for me. > > >> > > > >> > I think you meant to use the non _minmax version of proc_dointvec too? > > >> > > >> Yes. My fault :) > > >> > > >> > Although it doesn't make any difference for this basic test. > > >> > > > >> > The original reporter also has 8 cpu cores I've discovered. In his case > > >> > divided by 4 cpus where as mine are divided by 2 cpus, but I think that > > >> > makes no real difference in this case. > > >> > > > >> > I'll try and get some further test results ready shortly. Many thanks > > >> > for all your efforts in tracking this down, > > >> > > > >> > Steve. > > >> > > >> I voted "free area cache". > > > My results with this patch are: > > > > > > vmalloc took 5419238 us > > > vmalloc took 5432874 us > > > vmalloc took 5425568 us > > > vmalloc took 5423867 us > > > > > > So thats about a third of the time it took with my original patch, so > > > very much going in the right direction :-) > > > > Good. :) > > > > > > > > I did get a compile warning: > > > CC mm/vmalloc.o > > > mm/vmalloc.c: In function a??__free_vmap_areaa??: > > > mm/vmalloc.c:454: warning: unused variable a??preva?? > > > > > > ....harmless, but it should be fixed before the final version, > > > > Of course. It's not formal patch but for showing concept . :) > > > > Thanks for consuming precious your time. :) > > As Nick comments, I have to do further work. > > Maybe Nick could do it faster than me. > > Anyway, I hope it can solve your problem. > > > > Thanks, Steven. > > > > > > > > Steve. > > > > > > > > Your latest patch has now been run though the GFS2 tests which > originally triggered my investigation. It seems to solve the problem > completely. Maybe thanks for your efforts in helping us find and fix the > problem. The next question is what remains to be done in order to get > the patch into a form suitable for upstream merge? > > Steve. > Hi, Steven. Sorry for lazy response. I wanted to submit the patch which implement Nick's request whole. And unfortunately, I am so busy now. But if it's urgent, I want to submit this one firstly and at next version, maybe I will submit remained TODO things after middle of May. I think this patch can't make regression other usages. Nick. What do you think about? == CUT_HERE == >From c93437583b5ff476fcfe13901898f981baa672d8 Mon Sep 17 00:00:00 2001 From: Minchan Kim <minchan.kim@gmail.com> Date: Mon, 3 May 2010 01:43:30 +0900 Subject: [PATCH] cache last free vmap_area to avoid restarting beginning. Steven Whitehouse reported that GFS2 had a regression about vmalloc. He measured some test module to compare vmalloc speed on the two cases. 1. lazy TLB flush 2. disable lazy TLB flush by hard coding 1) vmalloc took 148798983 us vmalloc took 151664529 us vmalloc took 152416398 us vmalloc took 151837733 us 2) vmalloc took 15363634 us vmalloc took 15358026 us vmalloc took 15240955 us vmalloc took 15402302 us You can refer test module and Steven's patch with https://bugzilla.redhat.com/show_bug.cgi?id=581459. The cause is that lazy TLB flush can delay release vmap_area. OTOH, To find free vmap_area is always started from beginnig of rbnode. So before lazy TLB flush happens, searching free vmap_area could take long time. Steven's experiment can do 9 times faster than old. But Always disable lazy TLB flush is not good. This patch caches next free vmap_area to accelerate. In my test case, following as. The result is following as. 1) vanilla elapsed time # search of rbtree vmalloc took 49121724 us 5535 vmalloc took 50675245 us 5535 vmalloc took 48987711 us 5535 vmalloc took 54232479 us 5535 vmalloc took 50258117 us 5535 vmalloc took 49424859 us 5535 3) Steven's patch elapsed time # search of rbtree vmalloc took 11363341 us 62 vmalloc took 12798868 us 62 vmalloc took 13247942 us 62 vmalloc took 11434647 us 62 vmalloc took 13221733 us 62 vmalloc took 12134019 us 62 2) my patch(vmap cache) elapsed time # search of rbtree vmalloc took 5159893 us 8 vmalloc took 5124434 us 8 vmalloc took 5123291 us 8 vmalloc took 5145396 us 12 vmalloc took 5163605 us 8 vmalloc took 5945663 us 8 Nick commented some advise. " - invalidating the cache in the case of vstart being decreased. - Don't unconditionally reset the cache to the last vm area freed, because you might have a higher area freed after a lower area. Only reset if the freed area is lower. - Do keep a cached hole size, so smaller lookups can restart a full search. - refactoring rbtree search code to manage alloc_vmap_area complexity " Now, it's on my TODO list. Cc: Nick Piggin <npiggin@suse.de> Reported-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Tested-by: Steven Whitehouse <swhiteho@redhat.com> --- mm/vmalloc.c | 49 +++++++++++++++++++++++++++++++++++-------------- 1 files changed, 35 insertions(+), 14 deletions(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index ae00746..56f09ec 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -263,6 +263,7 @@ struct vmap_area { static DEFINE_SPINLOCK(vmap_area_lock); static struct rb_root vmap_area_root = RB_ROOT; +static struct rb_node *free_vmap_cache; static LIST_HEAD(vmap_area_list); static unsigned long vmap_area_pcpu_hole; @@ -319,6 +320,7 @@ static void __insert_vmap_area(struct vmap_area *va) static void purge_vmap_area_lazy(void); +unsigned long max_lookup_count; /* * Allocate a region of KVA of the specified size and alignment, within the * vstart and vend. @@ -332,6 +334,8 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, struct rb_node *n; unsigned long addr; int purged = 0; + int lookup_cache = 0; + struct vmap_area *first; BUG_ON(!size); BUG_ON(size & ~PAGE_MASK); @@ -342,29 +346,42 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, return ERR_PTR(-ENOMEM); retry: + first = NULL; addr = ALIGN(vstart, align); spin_lock(&vmap_area_lock); if (addr + size - 1 < addr) goto overflow; - /* XXX: could have a last_hole cache */ n = vmap_area_root.rb_node; - if (n) { - struct vmap_area *first = NULL; + if (free_vmap_cache && !purged) { + struct vmap_area *cache; + cache = rb_entry(free_vmap_cache, struct vmap_area, rb_node); + if (cache->va_start >= addr && cache->va_end < vend) { + lookup_cache = 1; + n = free_vmap_cache; + } + } - do { - struct vmap_area *tmp; - tmp = rb_entry(n, struct vmap_area, rb_node); - if (tmp->va_end >= addr) { - if (!first && tmp->va_start < addr + size) + if (n) { + if (!lookup_cache) { + do { + struct vmap_area *tmp; + tmp = rb_entry(n, struct vmap_area, rb_node); + if (tmp->va_end >= addr) { + if (!first && tmp->va_start < addr + size) + first = tmp; + n = n->rb_left; + } else { first = tmp; - n = n->rb_left; - } else { - first = tmp; - n = n->rb_right; - } - } while (n); + n = n->rb_right; + } + } while (n); + } + else { + first = rb_entry(n, struct vmap_area, rb_node); + addr = first->va_start; + } if (!first) goto found; @@ -396,6 +413,7 @@ overflow: if (!purged) { purge_vmap_area_lazy(); purged = 1; + lookup_cache = 0; goto retry; } if (printk_ratelimit()) @@ -412,6 +430,7 @@ overflow: va->va_end = addr + size; va->flags = 0; __insert_vmap_area(va); + free_vmap_cache = &va->rb_node; spin_unlock(&vmap_area_lock); return va; @@ -426,7 +445,9 @@ static void rcu_free_va(struct rcu_head *head) static void __free_vmap_area(struct vmap_area *va) { + struct rb_node *prev; BUG_ON(RB_EMPTY_NODE(&va->rb_node)); + free_vmap_cache = rb_prev(&va->rb_node); rb_erase(&va->rb_node, &vmap_area_root); RB_CLEAR_NODE(&va->rb_node); list_del_rcu(&va->list); -- 1.7.0.5 -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH] cache last free vmap_area to avoid restarting beginning 2010-05-02 17:29 ` [PATCH] cache last free vmap_area to avoid restarting beginning Minchan Kim @ 2010-05-05 12:48 ` Steven Whitehouse 2010-05-05 16:16 ` Nick Piggin 0 siblings, 1 reply; 30+ messages in thread From: Steven Whitehouse @ 2010-05-05 12:48 UTC (permalink / raw) To: Minchan Kim; +Cc: Andrew Morton, linux-mm, linux-kernel, Nick Piggin Hi, On Mon, 2010-05-03 at 02:29 +0900, Minchan Kim wrote: > Hi, Steven. > > Sorry for lazy response. > I wanted to submit the patch which implement Nick's request whole. > And unfortunately, I am so busy now. > But if it's urgent, I want to submit this one firstly and > at next version, maybe I will submit remained TODO things > after middle of May. > > I think this patch can't make regression other usages. > Nick. What do you think about? > I guess the question is whether the remaining items are essential for correct functioning of this patch, or whether they are "it would be nice if" items. I suspect that they are the latter (I'm not a VM expert, but from the brief descriptions it looks like that to me) in which case I'd suggest send the currently existing patch first and the following up with the remaining changes later. We have got a nice speed up with your current patch and so far as I'm aware not introduced any new bugs or regressions with it. Nick, does that sound ok? Steve. > == CUT_HERE == > >From c93437583b5ff476fcfe13901898f981baa672d8 Mon Sep 17 00:00:00 2001 > From: Minchan Kim <minchan.kim@gmail.com> > Date: Mon, 3 May 2010 01:43:30 +0900 > Subject: [PATCH] cache last free vmap_area to avoid restarting beginning. > > Steven Whitehouse reported that GFS2 had a regression about vmalloc. > He measured some test module to compare vmalloc speed on the two cases. > > 1. lazy TLB flush > 2. disable lazy TLB flush by hard coding > > 1) > vmalloc took 148798983 us > vmalloc took 151664529 us > vmalloc took 152416398 us > vmalloc took 151837733 us > > 2) > vmalloc took 15363634 us > vmalloc took 15358026 us > vmalloc took 15240955 us > vmalloc took 15402302 us > > You can refer test module and Steven's patch > with https://bugzilla.redhat.com/show_bug.cgi?id=581459. > > The cause is that lazy TLB flush can delay release vmap_area. > OTOH, To find free vmap_area is always started from beginnig of rbnode. > So before lazy TLB flush happens, searching free vmap_area could take > long time. > > Steven's experiment can do 9 times faster than old. > But Always disable lazy TLB flush is not good. > > This patch caches next free vmap_area to accelerate. > In my test case, following as. > > The result is following as. > > 1) vanilla > elapsed time # search of rbtree > vmalloc took 49121724 us 5535 > vmalloc took 50675245 us 5535 > vmalloc took 48987711 us 5535 > vmalloc took 54232479 us 5535 > vmalloc took 50258117 us 5535 > vmalloc took 49424859 us 5535 > > 3) Steven's patch > > elapsed time # search of rbtree > vmalloc took 11363341 us 62 > vmalloc took 12798868 us 62 > vmalloc took 13247942 us 62 > vmalloc took 11434647 us 62 > vmalloc took 13221733 us 62 > vmalloc took 12134019 us 62 > > 2) my patch(vmap cache) > elapsed time # search of rbtree > vmalloc took 5159893 us 8 > vmalloc took 5124434 us 8 > vmalloc took 5123291 us 8 > vmalloc took 5145396 us 12 > vmalloc took 5163605 us 8 > vmalloc took 5945663 us 8 > > Nick commented some advise. > " > - invalidating the cache in the case of vstart being decreased. > - Don't unconditionally reset the cache to the last vm area freed, > because you might have a higher area freed after a lower area. Only > reset if the freed area is lower. > - Do keep a cached hole size, so smaller lookups can restart a full > search. > - refactoring rbtree search code to manage alloc_vmap_area complexity > " > > Now, it's on my TODO list. > > Cc: Nick Piggin <npiggin@suse.de> > Reported-by: Steven Whitehouse <swhiteho@redhat.com> > Signed-off-by: Minchan Kim <minchan.kim@gmail.com> > Tested-by: Steven Whitehouse <swhiteho@redhat.com> > --- > mm/vmalloc.c | 49 +++++++++++++++++++++++++++++++++++-------------- > 1 files changed, 35 insertions(+), 14 deletions(-) > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index ae00746..56f09ec 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -263,6 +263,7 @@ struct vmap_area { > > static DEFINE_SPINLOCK(vmap_area_lock); > static struct rb_root vmap_area_root = RB_ROOT; > +static struct rb_node *free_vmap_cache; > static LIST_HEAD(vmap_area_list); > static unsigned long vmap_area_pcpu_hole; > > @@ -319,6 +320,7 @@ static void __insert_vmap_area(struct vmap_area *va) > > static void purge_vmap_area_lazy(void); > > +unsigned long max_lookup_count; > /* > * Allocate a region of KVA of the specified size and alignment, within the > * vstart and vend. > @@ -332,6 +334,8 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, > struct rb_node *n; > unsigned long addr; > int purged = 0; > + int lookup_cache = 0; > + struct vmap_area *first; > > BUG_ON(!size); > BUG_ON(size & ~PAGE_MASK); > @@ -342,29 +346,42 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, > return ERR_PTR(-ENOMEM); > > retry: > + first = NULL; > addr = ALIGN(vstart, align); > > spin_lock(&vmap_area_lock); > if (addr + size - 1 < addr) > goto overflow; > > - /* XXX: could have a last_hole cache */ > n = vmap_area_root.rb_node; > - if (n) { > - struct vmap_area *first = NULL; > + if (free_vmap_cache && !purged) { > + struct vmap_area *cache; > + cache = rb_entry(free_vmap_cache, struct vmap_area, rb_node); > + if (cache->va_start >= addr && cache->va_end < vend) { > + lookup_cache = 1; > + n = free_vmap_cache; > + } > + } > > - do { > - struct vmap_area *tmp; > - tmp = rb_entry(n, struct vmap_area, rb_node); > - if (tmp->va_end >= addr) { > - if (!first && tmp->va_start < addr + size) > + if (n) { > + if (!lookup_cache) { > + do { > + struct vmap_area *tmp; > + tmp = rb_entry(n, struct vmap_area, rb_node); > + if (tmp->va_end >= addr) { > + if (!first && tmp->va_start < addr + size) > + first = tmp; > + n = n->rb_left; > + } else { > first = tmp; > - n = n->rb_left; > - } else { > - first = tmp; > - n = n->rb_right; > - } > - } while (n); > + n = n->rb_right; > + } > + } while (n); > + } > + else { > + first = rb_entry(n, struct vmap_area, rb_node); > + addr = first->va_start; > + } > > if (!first) > goto found; > @@ -396,6 +413,7 @@ overflow: > if (!purged) { > purge_vmap_area_lazy(); > purged = 1; > + lookup_cache = 0; > goto retry; > } > if (printk_ratelimit()) > @@ -412,6 +430,7 @@ overflow: > va->va_end = addr + size; > va->flags = 0; > __insert_vmap_area(va); > + free_vmap_cache = &va->rb_node; > spin_unlock(&vmap_area_lock); > > return va; > @@ -426,7 +445,9 @@ static void rcu_free_va(struct rcu_head *head) > > static void __free_vmap_area(struct vmap_area *va) > { > + struct rb_node *prev; > BUG_ON(RB_EMPTY_NODE(&va->rb_node)); > + free_vmap_cache = rb_prev(&va->rb_node); > rb_erase(&va->rb_node, &vmap_area_root); > RB_CLEAR_NODE(&va->rb_node); > list_del_rcu(&va->list); > -- > 1.7.0.5 > > > > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] cache last free vmap_area to avoid restarting beginning 2010-05-05 12:48 ` Steven Whitehouse @ 2010-05-05 16:16 ` Nick Piggin 2010-05-17 12:42 ` Steven Whitehouse ` (2 more replies) 0 siblings, 3 replies; 30+ messages in thread From: Nick Piggin @ 2010-05-05 16:16 UTC (permalink / raw) To: Steven Whitehouse; +Cc: Minchan Kim, Andrew Morton, linux-mm, linux-kernel On Wed, May 05, 2010 at 01:48:48PM +0100, Steven Whitehouse wrote: > Hi, > > On Mon, 2010-05-03 at 02:29 +0900, Minchan Kim wrote: > > Hi, Steven. > > > > Sorry for lazy response. > > I wanted to submit the patch which implement Nick's request whole. > > And unfortunately, I am so busy now. > > But if it's urgent, I want to submit this one firstly and > > at next version, maybe I will submit remained TODO things > > after middle of May. > > > > I think this patch can't make regression other usages. > > Nick. What do you think about? > > > I guess the question is whether the remaining items are essential for > correct functioning of this patch, or whether they are "it would be nice > if" items. I suspect that they are the latter (I'm not a VM expert, but > from the brief descriptions it looks like that to me) in which case I'd > suggest send the currently existing patch first and the following up > with the remaining changes later. > > We have got a nice speed up with your current patch and so far as I'm > aware not introduced any new bugs or regressions with it. > > Nick, does that sound ok? Just got around to looking at it again. I definitely agree we need to fix the regression, however I'm concerned about introducing other possible problems while doing that. The following patch should (modulo bugs, but it's somewhat tested) give no difference in the allocation patterns, so won't introduce virtual memory layout changes. Any chance you could test it? --- mm/vmalloc.c | 49 +++++++++++++++++++++++++++++++++++-------------- 1 files changed, 35 insertions(+), 14 deletions(-) Index: linux-2.6/mm/vmalloc.c =================================================================== --- linux-2.6.orig/mm/vmalloc.c +++ linux-2.6/mm/vmalloc.c @@ -262,8 +262,13 @@ struct vmap_area { }; static DEFINE_SPINLOCK(vmap_area_lock); -static struct rb_root vmap_area_root = RB_ROOT; static LIST_HEAD(vmap_area_list); +static struct rb_root vmap_area_root = RB_ROOT; + +static struct rb_node *free_vmap_cache; +static unsigned long cached_hole_size; +static unsigned long cached_start; + static unsigned long vmap_area_pcpu_hole; static struct vmap_area *__find_vmap_area(unsigned long addr) @@ -332,6 +337,7 @@ static struct vmap_area *alloc_vmap_area struct rb_node *n; unsigned long addr; int purged = 0; + struct vmap_area *first; BUG_ON(!size); BUG_ON(size & ~PAGE_MASK); @@ -348,11 +354,23 @@ retry: if (addr + size - 1 < addr) goto overflow; - /* XXX: could have a last_hole cache */ - n = vmap_area_root.rb_node; - if (n) { - struct vmap_area *first = NULL; + if (size <= cached_hole_size || addr < cached_start || !free_vmap_cache) { + cached_hole_size = 0; + cached_start = addr; + free_vmap_cache = NULL; + } + /* find starting point for our search */ + if (free_vmap_cache) { + first = rb_entry(free_vmap_cache, struct vmap_area, rb_node); + addr = ALIGN(first->va_end + PAGE_SIZE, align); + + } else { + n = vmap_area_root.rb_node; + if (!n) + goto found; + + first = NULL; do { struct vmap_area *tmp; tmp = rb_entry(n, struct vmap_area, rb_node); @@ -369,26 +387,36 @@ retry: if (!first) goto found; - if (first->va_end < addr) { + if (first->va_start < addr) { + BUG_ON(first->va_end < addr); n = rb_next(&first->rb_node); + addr = ALIGN(first->va_end + PAGE_SIZE, align); if (n) first = rb_entry(n, struct vmap_area, rb_node); else goto found; } + BUG_ON(first->va_start < addr); + if (addr + cached_hole_size < first->va_start) + cached_hole_size = first->va_start - addr; + } - while (addr + size > first->va_start && addr + size <= vend) { - addr = ALIGN(first->va_end + PAGE_SIZE, align); - if (addr + size - 1 < addr) - goto overflow; + /* from the starting point, walk areas until a suitable hole is found */ - n = rb_next(&first->rb_node); - if (n) - first = rb_entry(n, struct vmap_area, rb_node); - else - goto found; - } + while (addr + size > first->va_start && addr + size <= vend) { + if (addr + cached_hole_size < first->va_start) + cached_hole_size = first->va_start - addr; + addr = ALIGN(first->va_end + PAGE_SIZE, align); + if (addr + size - 1 < addr) + goto overflow; + + n = rb_next(&first->rb_node); + if (n) + first = rb_entry(n, struct vmap_area, rb_node); + else + goto found; } + found: if (addr + size > vend) { overflow: @@ -412,6 +440,7 @@ overflow: va->va_end = addr + size; va->flags = 0; __insert_vmap_area(va); + free_vmap_cache = &va->rb_node; spin_unlock(&vmap_area_lock); return va; @@ -427,6 +456,21 @@ static void rcu_free_va(struct rcu_head static void __free_vmap_area(struct vmap_area *va) { BUG_ON(RB_EMPTY_NODE(&va->rb_node)); + + if (free_vmap_cache) { + if (va->va_end < cached_start) { + cached_hole_size = 0; + cached_start = 0; + free_vmap_cache = NULL; + } else { + struct vmap_area *cache; + cache = rb_entry(free_vmap_cache, struct vmap_area, rb_node); + if (va->va_start <= cache->va_start) { + free_vmap_cache = rb_prev(&va->rb_node); + cache = rb_entry(free_vmap_cache, struct vmap_area, rb_node); + } + } + } rb_erase(&va->rb_node, &vmap_area_root); RB_CLEAR_NODE(&va->rb_node); list_del_rcu(&va->list); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] cache last free vmap_area to avoid restarting beginning 2010-05-05 16:16 ` Nick Piggin @ 2010-05-17 12:42 ` Steven Whitehouse 2010-05-18 13:44 ` Steven Whitehouse 2010-05-19 13:54 ` Steven Whitehouse 2010-05-22 9:53 ` Minchan Kim 2 siblings, 1 reply; 30+ messages in thread From: Steven Whitehouse @ 2010-05-17 12:42 UTC (permalink / raw) To: Nick Piggin; +Cc: Minchan Kim, Andrew Morton, linux-mm, linux-kernel Hi, On Thu, 2010-05-06 at 02:16 +1000, Nick Piggin wrote: > On Wed, May 05, 2010 at 01:48:48PM +0100, Steven Whitehouse wrote: > > Hi, > > > > On Mon, 2010-05-03 at 02:29 +0900, Minchan Kim wrote: > > > Hi, Steven. > > > > > > Sorry for lazy response. > > > I wanted to submit the patch which implement Nick's request whole. > > > And unfortunately, I am so busy now. > > > But if it's urgent, I want to submit this one firstly and > > > at next version, maybe I will submit remained TODO things > > > after middle of May. > > > > > > I think this patch can't make regression other usages. > > > Nick. What do you think about? > > > > > I guess the question is whether the remaining items are essential for > > correct functioning of this patch, or whether they are "it would be nice > > if" items. I suspect that they are the latter (I'm not a VM expert, but > > from the brief descriptions it looks like that to me) in which case I'd > > suggest send the currently existing patch first and the following up > > with the remaining changes later. > > > > We have got a nice speed up with your current patch and so far as I'm > > aware not introduced any new bugs or regressions with it. > > > > Nick, does that sound ok? > > Just got around to looking at it again. I definitely agree we need to > fix the regression, however I'm concerned about introducing other > possible problems while doing that. > > The following patch should (modulo bugs, but it's somewhat tested) give > no difference in the allocation patterns, so won't introduce virtual > memory layout changes. > > Any chance you could test it? > Apologies for the delay. I tried the patch on my test box and it worked perfectly ok. When the original test was tried which triggered the investigation in the first place, it failed to boot. Since that box is remote and with limited remote console access, all I've been able to find out is "it didn't work" which isn't very helpful. I'm currently trying to figure out how we can work out whats wrong. It isn't at all certain that it is an issue with this patch - it could be almost anything :( Steve. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] cache last free vmap_area to avoid restarting beginning 2010-05-17 12:42 ` Steven Whitehouse @ 2010-05-18 13:44 ` Steven Whitehouse 0 siblings, 0 replies; 30+ messages in thread From: Steven Whitehouse @ 2010-05-18 13:44 UTC (permalink / raw) To: Nick Piggin; +Cc: Minchan Kim, Andrew Morton, linux-mm, linux-kernel Hi, On Mon, 2010-05-17 at 13:42 +0100, Steven Whitehouse wrote: > Hi, > > On Thu, 2010-05-06 at 02:16 +1000, Nick Piggin wrote: > > On Wed, May 05, 2010 at 01:48:48PM +0100, Steven Whitehouse wrote: > > > Hi, > > > > > > On Mon, 2010-05-03 at 02:29 +0900, Minchan Kim wrote: > > > > Hi, Steven. > > > > > > > > Sorry for lazy response. > > > > I wanted to submit the patch which implement Nick's request whole. > > > > And unfortunately, I am so busy now. > > > > But if it's urgent, I want to submit this one firstly and > > > > at next version, maybe I will submit remained TODO things > > > > after middle of May. > > > > > > > > I think this patch can't make regression other usages. > > > > Nick. What do you think about? > > > > > > > I guess the question is whether the remaining items are essential for > > > correct functioning of this patch, or whether they are "it would be nice > > > if" items. I suspect that they are the latter (I'm not a VM expert, but > > > from the brief descriptions it looks like that to me) in which case I'd > > > suggest send the currently existing patch first and the following up > > > with the remaining changes later. > > > > > > We have got a nice speed up with your current patch and so far as I'm > > > aware not introduced any new bugs or regressions with it. > > > > > > Nick, does that sound ok? > > > > Just got around to looking at it again. I definitely agree we need to > > fix the regression, however I'm concerned about introducing other > > possible problems while doing that. > > > > The following patch should (modulo bugs, but it's somewhat tested) give > > no difference in the allocation patterns, so won't introduce virtual > > memory layout changes. > > > > Any chance you could test it? > > > > Apologies for the delay. I tried the patch on my test box and it worked > perfectly ok. When the original test was tried which triggered the > investigation in the first place, it failed to boot. Since that box is > remote and with limited remote console access, all I've been able to > find out is "it didn't work" which isn't very helpful. > > I'm currently trying to figure out how we can work out whats wrong. It > isn't at all certain that it is an issue with this patch - it could be > almost anything :( > > Steve. > > Further tests show that exactly the same kernel, without that single patch works ok, and but that with the patch we get the crash on boot. We are trying to arrange for better console access to the test box (which is remote) and will report back if we manage that and capture any output, Steve. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] cache last free vmap_area to avoid restarting beginning 2010-05-05 16:16 ` Nick Piggin 2010-05-17 12:42 ` Steven Whitehouse @ 2010-05-19 13:54 ` Steven Whitehouse 2010-05-19 13:56 ` Nick Piggin 2010-05-25 8:43 ` Nick Piggin 2010-05-22 9:53 ` Minchan Kim 2 siblings, 2 replies; 30+ messages in thread From: Steven Whitehouse @ 2010-05-19 13:54 UTC (permalink / raw) To: Nick Piggin; +Cc: Minchan Kim, Andrew Morton, linux-mm, linux-kernel Hi, On Thu, 2010-05-06 at 02:16 +1000, Nick Piggin wrote: > On Wed, May 05, 2010 at 01:48:48PM +0100, Steven Whitehouse wrote: > > Hi, > > > > On Mon, 2010-05-03 at 02:29 +0900, Minchan Kim wrote: > > > Hi, Steven. > > > > > > Sorry for lazy response. > > > I wanted to submit the patch which implement Nick's request whole. > > > And unfortunately, I am so busy now. > > > But if it's urgent, I want to submit this one firstly and > > > at next version, maybe I will submit remained TODO things > > > after middle of May. > > > > > > I think this patch can't make regression other usages. > > > Nick. What do you think about? > > > > > I guess the question is whether the remaining items are essential for > > correct functioning of this patch, or whether they are "it would be nice > > if" items. I suspect that they are the latter (I'm not a VM expert, but > > from the brief descriptions it looks like that to me) in which case I'd > > suggest send the currently existing patch first and the following up > > with the remaining changes later. > > > > We have got a nice speed up with your current patch and so far as I'm > > aware not introduced any new bugs or regressions with it. > > > > Nick, does that sound ok? > > Just got around to looking at it again. I definitely agree we need to > fix the regression, however I'm concerned about introducing other > possible problems while doing that. > > The following patch should (modulo bugs, but it's somewhat tested) give > no difference in the allocation patterns, so won't introduce virtual > memory layout changes. > > Any chance you could test it? > At last some further info on the failed boot during testing. The messages look like this: dracut: Starting plymouth daemon G------------[ cut here ]------------ kernel BUG at mm/vmalloc.c:391! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/devices/virtual/vtconsole/vtcon0/uevent CPU 7 Modules linked in: Pid: 193, comm: modprobe Tainted: G W 2.6.32-23.el6.bz583026.patches2.3.7.x86_64 #1 ProLiant DL580 G3 RIP: 0010:[<ffffffff8113c161>] [<ffffffff8113c161>] alloc_vmap_area+0x431/0x440 RSP: 0018:ffff8803dae3bcf8 EFLAGS: 00010287 RAX: ffffc9001232e000 RBX: 0000000000004000 RCX: 0000000000000000 RDX: ffffffffa0000000 RSI: ffff8803db66fdc0 RDI: ffffffff81b6d0a0 RBP: ffff8803dae3bd88 R08: 000000000000000a R09: 00000000000000d0 R10: ffff8803db6b6e40 R11: 0000000000000040 R12: 0000000000000001 R13: ffffffffff000000 R14: ffffffffffffffff R15: ffffffffa0000000 FS: 00007f5872189700(0000) GS:ffff88002c2e0000(0000) knlGS:0000000000000000 and the code around that point is: static struct vmap_area *alloc_vmap_area(unsigned long size, unsigned long align, unsigned long vstart, unsigned long vend, int node, gfp_t gfp_mask) { ... if (!first) goto found; if (first->va_start < addr) { 391> BUG_ON(first->va_end < addr); n = rb_next(&first->rb_node); addr = ALIGN(first->va_end + PAGE_SIZE, align); if (n) first = rb_entry(n, struct vmap_area, rb_node); else goto found; } so that seems to pinpoint the line on which the problem occurred. Let us know if you'd like us to do some more testing. I think we have the console access issue fixed now. Many thanks for all you help in this so far, Steve. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] cache last free vmap_area to avoid restarting beginning 2010-05-19 13:54 ` Steven Whitehouse @ 2010-05-19 13:56 ` Nick Piggin 2010-05-25 8:43 ` Nick Piggin 1 sibling, 0 replies; 30+ messages in thread From: Nick Piggin @ 2010-05-19 13:56 UTC (permalink / raw) To: Steven Whitehouse; +Cc: Minchan Kim, Andrew Morton, linux-mm, linux-kernel On Wed, May 19, 2010 at 02:54:54PM +0100, Steven Whitehouse wrote: > On Thu, 2010-05-06 at 02:16 +1000, Nick Piggin wrote: > > On Wed, May 05, 2010 at 01:48:48PM +0100, Steven Whitehouse wrote: > > Any chance you could test it? > > > > At last some further info on the failed boot during testing. The > messages look like this: > > dracut: Starting plymouth daemon > G------------[ cut here ]------------ > kernel BUG at mm/vmalloc.c:391! > invalid opcode: 0000 [#1] SMP > last sysfs file: /sys/devices/virtual/vtconsole/vtcon0/uevent > CPU 7 > Modules linked in: > Pid: 193, comm: modprobe Tainted: G W 2.6.32-23.el6.bz583026.patches2.3.7.x86_64 #1 ProLiant DL580 G3 > RIP: 0010:[<ffffffff8113c161>] [<ffffffff8113c161>] alloc_vmap_area+0x431/0x440 > RSP: 0018:ffff8803dae3bcf8 EFLAGS: 00010287 > RAX: ffffc9001232e000 RBX: 0000000000004000 RCX: 0000000000000000 > RDX: ffffffffa0000000 RSI: ffff8803db66fdc0 RDI: ffffffff81b6d0a0 > RBP: ffff8803dae3bd88 R08: 000000000000000a R09: 00000000000000d0 > R10: ffff8803db6b6e40 R11: 0000000000000040 R12: 0000000000000001 > R13: ffffffffff000000 R14: ffffffffffffffff R15: ffffffffa0000000 > FS: 00007f5872189700(0000) GS:ffff88002c2e0000(0000) knlGS:0000000000000000 > > and the code around that point is: > > static struct vmap_area *alloc_vmap_area(unsigned long size, > unsigned long align, > unsigned long vstart, unsigned long vend, > int node, gfp_t gfp_mask) > { > > ... > > if (!first) > goto found; > > if (first->va_start < addr) { > 391> BUG_ON(first->va_end < addr); > n = rb_next(&first->rb_node); > addr = ALIGN(first->va_end + PAGE_SIZE, align); > if (n) > first = rb_entry(n, struct vmap_area, rb_node); > else > goto found; > } > > > so that seems to pinpoint the line on which the problem occurred. Let us > know if you'd like us to do some more testing. I think we have the > console access issue fixed now. Many thanks for all you help in this > so far, Thanks for testing it out. Hmm, I thought I'd shaken out these bugs -- I put the code in a userspace test harness and hammered it pretty hard, but I must have overlooked something or you're triggering a really specific sequence. Let me get back to you if I cannot trigger anything here. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] cache last free vmap_area to avoid restarting beginning 2010-05-19 13:54 ` Steven Whitehouse 2010-05-19 13:56 ` Nick Piggin @ 2010-05-25 8:43 ` Nick Piggin 2010-05-25 15:00 ` Minchan Kim 1 sibling, 1 reply; 30+ messages in thread From: Nick Piggin @ 2010-05-25 8:43 UTC (permalink / raw) To: Steven Whitehouse; +Cc: Minchan Kim, Andrew Morton, linux-mm, linux-kernel On Wed, May 19, 2010 at 02:54:54PM +0100, Steven Whitehouse wrote: > so that seems to pinpoint the line on which the problem occurred. Let us > know if you'd like us to do some more testing. I think we have the > console access issue fixed now. Many thanks for all you help in this > so far, Sorry for the delay. Ended up requiring a bit of surgery and several bug fixes. I added a lot more test cases to my userspace tester, and found several bugs including the one you hit. Most of them were due to changing vstart,vend or changing requested alignment. I can't guarantee it's going to work for you (it boots here, but the last version booted as well). But I think it's in much better shape. It is very careful to reproduce exactly the same allocation behaviour, so the effectiveness of the cache can be reduced if sizes, alignments, or start,end ranges are very frequently changing. But I'd hope that for most vmap heavy workloads, they should cache quite well. We could look at doing smarter things if it isn't effective enough. -- Provide a free area cache for the vmalloc virtual address allocator, based on the approach taken in the user virtual memory allocator. This reduces the number of rbtree operations and linear traversals over the vmap extents to find a free area. The lazy vmap flushing makes this problem worse because because freed but not yet flushed vmaps tend to build up in the address space between flushes. Steven noticed a performance problem with GFS2. Results are as follows... mm/vmalloc.c | 49 +++++++++++++++++++++++++++++++++++-------------- 1 files changed, 35 insertions(+), 14 deletions(-) Index: linux-2.6/mm/vmalloc.c =================================================================== --- linux-2.6.orig/mm/vmalloc.c +++ linux-2.6/mm/vmalloc.c @@ -262,8 +262,14 @@ struct vmap_area { }; static DEFINE_SPINLOCK(vmap_area_lock); -static struct rb_root vmap_area_root = RB_ROOT; static LIST_HEAD(vmap_area_list); +static struct rb_root vmap_area_root = RB_ROOT; + +static struct rb_node *free_vmap_cache; +static unsigned long cached_hole_size; +static unsigned long cached_start; +static unsigned long cached_align; + static unsigned long vmap_area_pcpu_hole; static struct vmap_area *__find_vmap_area(unsigned long addr) @@ -332,27 +338,52 @@ static struct vmap_area *alloc_vmap_area struct rb_node *n; unsigned long addr; int purged = 0; + struct vmap_area *first; BUG_ON(!size); BUG_ON(size & ~PAGE_MASK); + BUG_ON(!is_power_of_2(align)); va = kmalloc_node(sizeof(struct vmap_area), gfp_mask & GFP_RECLAIM_MASK, node); if (unlikely(!va)) return ERR_PTR(-ENOMEM); + spin_lock(&vmap_area_lock); retry: - addr = ALIGN(vstart, align); + /* invalidate cache if we have more permissive parameters */ + if (!free_vmap_cache || + size <= cached_hole_size || + vstart < cached_start || + align < cached_align) { + cached_hole_size = 0; + free_vmap_cache = NULL; + } + /* record if we encounter less permissive parameters */ + cached_start = vstart; + cached_align = align; + + /* find starting point for our search */ + if (free_vmap_cache) { + first = rb_entry(free_vmap_cache, struct vmap_area, rb_node); + addr = ALIGN(first->va_end + PAGE_SIZE, align); + if (addr < vstart) { + free_vmap_cache = NULL; + goto retry; + } + if (addr + size - 1 < addr) + goto overflow; - spin_lock(&vmap_area_lock); - if (addr + size - 1 < addr) - goto overflow; + } else { + addr = ALIGN(vstart, align); + if (addr + size - 1 < addr) + goto overflow; - /* XXX: could have a last_hole cache */ - n = vmap_area_root.rb_node; - if (n) { - struct vmap_area *first = NULL; + n = vmap_area_root.rb_node; + if (!n) + goto found; + first = NULL; do { struct vmap_area *tmp; tmp = rb_entry(n, struct vmap_area, rb_node); @@ -369,26 +400,37 @@ retry: if (!first) goto found; - if (first->va_end < addr) { - n = rb_next(&first->rb_node); - if (n) - first = rb_entry(n, struct vmap_area, rb_node); - else - goto found; - } - - while (addr + size > first->va_start && addr + size <= vend) { - addr = ALIGN(first->va_end + PAGE_SIZE, align); + if (first->va_start < addr) { + addr = ALIGN(max(first->va_end + PAGE_SIZE, addr), align); if (addr + size - 1 < addr) goto overflow; - n = rb_next(&first->rb_node); if (n) first = rb_entry(n, struct vmap_area, rb_node); else goto found; } + BUG_ON(first->va_start < addr); + if (addr + cached_hole_size < first->va_start) + cached_hole_size = first->va_start - addr; + } + + /* from the starting point, walk areas until a suitable hole is found */ + + while (addr + size > first->va_start && addr + size <= vend) { + if (addr + cached_hole_size < first->va_start) + cached_hole_size = first->va_start - addr; + addr = ALIGN(first->va_end + PAGE_SIZE, align); + if (addr + size - 1 < addr) + goto overflow; + + n = rb_next(&first->rb_node); + if (n) + first = rb_entry(n, struct vmap_area, rb_node); + else + goto found; } + found: if (addr + size > vend) { overflow: @@ -406,14 +448,17 @@ overflow: return ERR_PTR(-EBUSY); } - BUG_ON(addr & (align-1)); - va->va_start = addr; va->va_end = addr + size; va->flags = 0; __insert_vmap_area(va); + free_vmap_cache = &va->rb_node; spin_unlock(&vmap_area_lock); + BUG_ON(va->va_start & (align-1)); + BUG_ON(va->va_start < vstart); + BUG_ON(va->va_end > vend); + return va; } @@ -427,6 +472,19 @@ static void rcu_free_va(struct rcu_head static void __free_vmap_area(struct vmap_area *va) { BUG_ON(RB_EMPTY_NODE(&va->rb_node)); + + if (free_vmap_cache) { + if (va->va_end < cached_start) { + free_vmap_cache = NULL; + } else { + struct vmap_area *cache; + cache = rb_entry(free_vmap_cache, struct vmap_area, rb_node); + if (va->va_start <= cache->va_start) { + free_vmap_cache = rb_prev(&va->rb_node); + cache = rb_entry(free_vmap_cache, struct vmap_area, rb_node); + } + } + } rb_erase(&va->rb_node, &vmap_area_root); RB_CLEAR_NODE(&va->rb_node); list_del_rcu(&va->list); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] cache last free vmap_area to avoid restarting beginning 2010-05-25 8:43 ` Nick Piggin @ 2010-05-25 15:00 ` Minchan Kim 2010-05-25 15:48 ` Steven Whitehouse 0 siblings, 1 reply; 30+ messages in thread From: Minchan Kim @ 2010-05-25 15:00 UTC (permalink / raw) To: Nick Piggin; +Cc: Steven Whitehouse, Andrew Morton, linux-mm, linux-kernel On Tue, May 25, 2010 at 06:43:23PM +1000, Nick Piggin wrote: > On Wed, May 19, 2010 at 02:54:54PM +0100, Steven Whitehouse wrote: > > so that seems to pinpoint the line on which the problem occurred. Let us > > know if you'd like us to do some more testing. I think we have the > > console access issue fixed now. Many thanks for all you help in this > > so far, > > Sorry for the delay. Ended up requiring a bit of surgery and several bug > fixes. I added a lot more test cases to my userspace tester, and found > several bugs including the one you hit. > > Most of them were due to changing vstart,vend or changing requested > alignment. > > I can't guarantee it's going to work for you (it boots here, but the > last version booted as well). But I think it's in much better shape. > > It is very careful to reproduce exactly the same allocation behaviour, > so the effectiveness of the cache can be reduced if sizes, alignments, > or start,end ranges are very frequently changing. But I'd hope that > for most vmap heavy workloads, they should cache quite well. We could > look at doing smarter things if it isn't effective enough. > > -- > Provide a free area cache for the vmalloc virtual address allocator, based > on the approach taken in the user virtual memory allocator. > > This reduces the number of rbtree operations and linear traversals over > the vmap extents to find a free area. The lazy vmap flushing makes this problem > worse because because freed but not yet flushed vmaps tend to build up in > the address space between flushes. > > Steven noticed a performance problem with GFS2. Results are as follows... > > > > mm/vmalloc.c | 49 +++++++++++++++++++++++++++++++++++-------------- > 1 files changed, 35 insertions(+), 14 deletions(-) > > Index: linux-2.6/mm/vmalloc.c > =================================================================== > --- linux-2.6.orig/mm/vmalloc.c > +++ linux-2.6/mm/vmalloc.c > @@ -262,8 +262,14 @@ struct vmap_area { > }; > > static DEFINE_SPINLOCK(vmap_area_lock); > -static struct rb_root vmap_area_root = RB_ROOT; > static LIST_HEAD(vmap_area_list); > +static struct rb_root vmap_area_root = RB_ROOT; > + > +static struct rb_node *free_vmap_cache; > +static unsigned long cached_hole_size; > +static unsigned long cached_start; > +static unsigned long cached_align; > + > static unsigned long vmap_area_pcpu_hole; > > static struct vmap_area *__find_vmap_area(unsigned long addr) > @@ -332,27 +338,52 @@ static struct vmap_area *alloc_vmap_area > struct rb_node *n; > unsigned long addr; > int purged = 0; > + struct vmap_area *first; > > BUG_ON(!size); > BUG_ON(size & ~PAGE_MASK); > + BUG_ON(!is_power_of_2(align)); > > va = kmalloc_node(sizeof(struct vmap_area), > gfp_mask & GFP_RECLAIM_MASK, node); > if (unlikely(!va)) > return ERR_PTR(-ENOMEM); > > + spin_lock(&vmap_area_lock); vmap_area_lock is unbalnced with last spin_unlock in case of overflow. Maybe you hold the lock after retry and you release the lock before retry. > retry: > - addr = ALIGN(vstart, align); > + /* invalidate cache if we have more permissive parameters */ > + if (!free_vmap_cache || > + size <= cached_hole_size || > + vstart < cached_start || > + align < cached_align) { > + cached_hole_size = 0; > + free_vmap_cache = NULL; > + } > + /* record if we encounter less permissive parameters */ > + cached_start = vstart; > + cached_align = align; > + > + /* find starting point for our search */ > + if (free_vmap_cache) { > + first = rb_entry(free_vmap_cache, struct vmap_area, rb_node); > + addr = ALIGN(first->va_end + PAGE_SIZE, align); > + if (addr < vstart) { > + free_vmap_cache = NULL; > + goto retry; > + } > + if (addr + size - 1 < addr) > + goto overflow; > > - spin_lock(&vmap_area_lock); > - if (addr + size - 1 < addr) > - goto overflow; > + } else { > + addr = ALIGN(vstart, align); > + if (addr + size - 1 < addr) > + goto overflow; > > - /* XXX: could have a last_hole cache */ > - n = vmap_area_root.rb_node; > - if (n) { > - struct vmap_area *first = NULL; > + n = vmap_area_root.rb_node; > + if (!n) > + goto found; > > + first = NULL; > do { > struct vmap_area *tmp; > tmp = rb_entry(n, struct vmap_area, rb_node); > @@ -369,26 +400,37 @@ retry: > if (!first) > goto found; > > - if (first->va_end < addr) { > - n = rb_next(&first->rb_node); > - if (n) > - first = rb_entry(n, struct vmap_area, rb_node); > - else > - goto found; > - } > - > - while (addr + size > first->va_start && addr + size <= vend) { > - addr = ALIGN(first->va_end + PAGE_SIZE, align); > + if (first->va_start < addr) { > + addr = ALIGN(max(first->va_end + PAGE_SIZE, addr), align); Frankly speaking, I don't see the benefit which you mentiond that it makes subsequent logic simpler. For me, I like old code which compares va_end. In case of spanning, old code has the problem? I think old code has no problem and looks good than current one. > if (addr + size - 1 < addr) > goto overflow; > - > n = rb_next(&first->rb_node); > if (n) > first = rb_entry(n, struct vmap_area, rb_node); > else > goto found; > } > + BUG_ON(first->va_start < addr); > + if (addr + cached_hole_size < first->va_start) > + cached_hole_size = first->va_start - addr; > + } > + > + /* from the starting point, walk areas until a suitable hole is found */ Unnecessary empty line :) > + > + while (addr + size > first->va_start && addr + size <= vend) { > + if (addr + cached_hole_size < first->va_start) > + cached_hole_size = first->va_start - addr; > + addr = ALIGN(first->va_end + PAGE_SIZE, align); > + if (addr + size - 1 < addr) > + goto overflow; > + > + n = rb_next(&first->rb_node); > + if (n) > + first = rb_entry(n, struct vmap_area, rb_node); > + else > + goto found; > } > + > found: > if (addr + size > vend) { > overflow: > @@ -406,14 +448,17 @@ overflow: > return ERR_PTR(-EBUSY); > } > > - BUG_ON(addr & (align-1)); > - > va->va_start = addr; > va->va_end = addr + size; > va->flags = 0; > __insert_vmap_area(va); > + free_vmap_cache = &va->rb_node; > spin_unlock(&vmap_area_lock); > > + BUG_ON(va->va_start & (align-1)); > + BUG_ON(va->va_start < vstart); > + BUG_ON(va->va_end > vend); > + > return va; > } > > @@ -427,6 +472,19 @@ static void rcu_free_va(struct rcu_head > static void __free_vmap_area(struct vmap_area *va) > { > BUG_ON(RB_EMPTY_NODE(&va->rb_node)); > + > + if (free_vmap_cache) { > + if (va->va_end < cached_start) { > + free_vmap_cache = NULL; > + } else { > + struct vmap_area *cache; > + cache = rb_entry(free_vmap_cache, struct vmap_area, rb_node); > + if (va->va_start <= cache->va_start) { > + free_vmap_cache = rb_prev(&va->rb_node); > + cache = rb_entry(free_vmap_cache, struct vmap_area, rb_node); > + } > + } > + } > rb_erase(&va->rb_node, &vmap_area_root); > RB_CLEAR_NODE(&va->rb_node); > list_del_rcu(&va->list); Anyway, I am looking forard to seeing Steven's experiment. If test has no problem, I will remake refactoring patch based on your patch. :) Thanks, Nick. -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] cache last free vmap_area to avoid restarting beginning 2010-05-25 15:00 ` Minchan Kim @ 2010-05-25 15:48 ` Steven Whitehouse 0 siblings, 0 replies; 30+ messages in thread From: Steven Whitehouse @ 2010-05-25 15:48 UTC (permalink / raw) To: Minchan Kim; +Cc: Nick Piggin, Andrew Morton, linux-mm, linux-kernel Hi, On Wed, 2010-05-26 at 00:00 +0900, Minchan Kim wrote: > Anyway, I am looking forard to seeing Steven's experiment. > If test has no problem, I will remake refactoring patch based on your patch. :) > > Thanks, Nick. I gather that it might be a couple of days before our tester can run the tests as he is busy with something else at the moment. I'll get back to you as soon as I can. Apologies for the delay in testing, Steve. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] cache last free vmap_area to avoid restarting beginning 2010-05-05 16:16 ` Nick Piggin 2010-05-17 12:42 ` Steven Whitehouse 2010-05-19 13:54 ` Steven Whitehouse @ 2010-05-22 9:53 ` Minchan Kim 2010-05-24 6:23 ` Nick Piggin 2 siblings, 1 reply; 30+ messages in thread From: Minchan Kim @ 2010-05-22 9:53 UTC (permalink / raw) To: Nick Piggin; +Cc: Steven Whitehouse, Andrew Morton, linux-mm, linux-kernel Hi, Nick. Sorry for late review. On Thu, 2010-05-06 at 02:16 +1000, Nick Piggin wrote: > On Wed, May 05, 2010 at 01:48:48PM +0100, Steven Whitehouse wrote: > > Hi, > > > > On Mon, 2010-05-03 at 02:29 +0900, Minchan Kim wrote: > > > Hi, Steven. > > > > > > Sorry for lazy response. > > > I wanted to submit the patch which implement Nick's request whole. > > > And unfortunately, I am so busy now. > > > But if it's urgent, I want to submit this one firstly and > > > at next version, maybe I will submit remained TODO things > > > after middle of May. > > > > > > I think this patch can't make regression other usages. > > > Nick. What do you think about? > > > > > I guess the question is whether the remaining items are essential for > > correct functioning of this patch, or whether they are "it would be nice > > if" items. I suspect that they are the latter (I'm not a VM expert, but > > from the brief descriptions it looks like that to me) in which case I'd > > suggest send the currently existing patch first and the following up > > with the remaining changes later. > > > > We have got a nice speed up with your current patch and so far as I'm > > aware not introduced any new bugs or regressions with it. > > > > Nick, does that sound ok? > > Just got around to looking at it again. I definitely agree we need to > fix the regression, however I'm concerned about introducing other > possible problems while doing that. > > The following patch should (modulo bugs, but it's somewhat tested) give > no difference in the allocation patterns, so won't introduce virtual > memory layout changes. > > Any chance you could test it? > > --- > mm/vmalloc.c | 49 +++++++++++++++++++++++++++++++++++-------------- > 1 files changed, 35 insertions(+), 14 deletions(-) > > Index: linux-2.6/mm/vmalloc.c > =================================================================== > --- linux-2.6.orig/mm/vmalloc.c > +++ linux-2.6/mm/vmalloc.c > @@ -262,8 +262,13 @@ struct vmap_area { > }; > > static DEFINE_SPINLOCK(vmap_area_lock); > -static struct rb_root vmap_area_root = RB_ROOT; > static LIST_HEAD(vmap_area_list); > +static struct rb_root vmap_area_root = RB_ROOT; > + > +static struct rb_node *free_vmap_cache; > +static unsigned long cached_hole_size; > +static unsigned long cached_start; > + > static unsigned long vmap_area_pcpu_hole; > > static struct vmap_area *__find_vmap_area(unsigned long addr) > @@ -332,6 +337,7 @@ static struct vmap_area *alloc_vmap_area > struct rb_node *n; > unsigned long addr; > int purged = 0; > + struct vmap_area *first; > > BUG_ON(!size); > BUG_ON(size & ~PAGE_MASK); > @@ -348,11 +354,23 @@ retry: > if (addr + size - 1 < addr) > goto overflow; > > - /* XXX: could have a last_hole cache */ > - n = vmap_area_root.rb_node; > - if (n) { > - struct vmap_area *first = NULL; > + if (size <= cached_hole_size || addr < cached_start || !free_vmap_cache) { Do we need !free_vmap_cache check? In __free_vmap_area, we already reset whole of variables when free_vmap_cache = NULL. > + cached_hole_size = 0; > + cached_start = addr; > + free_vmap_cache = NULL; > + } > > + /* find starting point for our search */ > + if (free_vmap_cache) { > + first = rb_entry(free_vmap_cache, struct vmap_area, rb_node); > + addr = ALIGN(first->va_end + PAGE_SIZE, align); > + > + } else { > + n = vmap_area_root.rb_node; > + if (!n) > + goto found; > + > + first = NULL; > do { > struct vmap_area *tmp; > tmp = rb_entry(n, struct vmap_area, rb_node); > @@ -369,26 +387,36 @@ retry: > if (!first) > goto found; > > - if (first->va_end < addr) { > + if (first->va_start < addr) { I can't understand your intention. Why do you change va_end with va_start? > + BUG_ON(first->va_end < addr); And Why do you put this BUG_ON in here? Could you elaborate on logic? > n = rb_next(&first->rb_node); > + addr = ALIGN(first->va_end + PAGE_SIZE, align); > if (n) > first = rb_entry(n, struct vmap_area, rb_node); > else > goto found; > } > + BUG_ON(first->va_start < addr); Ditto. > + if (addr + cached_hole_size < first->va_start) > + cached_hole_size = first->va_start - addr; > + } > > - while (addr + size > first->va_start && addr + size <= vend) { > - addr = ALIGN(first->va_end + PAGE_SIZE, align); > - if (addr + size - 1 < addr) > - goto overflow; > + /* from the starting point, walk areas until a suitable hole is found */ > > - n = rb_next(&first->rb_node); > - if (n) > - first = rb_entry(n, struct vmap_area, rb_node); > - else > - goto found; > - } > + while (addr + size > first->va_start && addr + size <= vend) { > + if (addr + cached_hole_size < first->va_start) > + cached_hole_size = first->va_start - addr; > + addr = ALIGN(first->va_end + PAGE_SIZE, align); > + if (addr + size - 1 < addr) > + goto overflow; > + > + n = rb_next(&first->rb_node); > + if (n) > + first = rb_entry(n, struct vmap_area, rb_node); > + else > + goto found; > } > + > found: > if (addr + size > vend) { > overflow: > @@ -412,6 +440,7 @@ overflow: > va->va_end = addr + size; > va->flags = 0; > __insert_vmap_area(va); > + free_vmap_cache = &va->rb_node; > spin_unlock(&vmap_area_lock); > > return va; > @@ -427,6 +456,21 @@ static void rcu_free_va(struct rcu_head > static void __free_vmap_area(struct vmap_area *va) > { > BUG_ON(RB_EMPTY_NODE(&va->rb_node)); > + > + if (free_vmap_cache) { > + if (va->va_end < cached_start) { > + cached_hole_size = 0; > + cached_start = 0; > + free_vmap_cache = NULL; > + } else { > + struct vmap_area *cache; > + cache = rb_entry(free_vmap_cache, struct vmap_area, rb_node); > + if (va->va_start <= cache->va_start) { > + free_vmap_cache = rb_prev(&va->rb_node); > + cache = rb_entry(free_vmap_cache, struct vmap_area, rb_node); > + } > + } > + } > rb_erase(&va->rb_node, &vmap_area_root); > RB_CLEAR_NODE(&va->rb_node); > list_del_rcu(&va->list); Hmm. I will send refactoring version soon. If you don't mind, let's discuss in there. :) -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] cache last free vmap_area to avoid restarting beginning 2010-05-22 9:53 ` Minchan Kim @ 2010-05-24 6:23 ` Nick Piggin 0 siblings, 0 replies; 30+ messages in thread From: Nick Piggin @ 2010-05-24 6:23 UTC (permalink / raw) To: Minchan Kim; +Cc: Steven Whitehouse, Andrew Morton, linux-mm, linux-kernel On Sat, May 22, 2010 at 06:53:53PM +0900, Minchan Kim wrote: > Hi, Nick. > Sorry for late review. No problem, thanks for reviewing. > On Thu, 2010-05-06 at 02:16 +1000, Nick Piggin wrote: > > On Wed, May 05, 2010 at 01:48:48PM +0100, Steven Whitehouse wrote: > > @@ -348,11 +354,23 @@ retry: > > if (addr + size - 1 < addr) > > goto overflow; > > > > - /* XXX: could have a last_hole cache */ > > - n = vmap_area_root.rb_node; > > - if (n) { > > - struct vmap_area *first = NULL; > > + if (size <= cached_hole_size || addr < cached_start || !free_vmap_cache) { > > Do we need !free_vmap_cache check? > In __free_vmap_area, we already reset whole of variables when free_vmap_cache = NULL. You're right. > > + cached_hole_size = 0; > > + cached_start = addr; > > + free_vmap_cache = NULL; > > + } > > > > + /* find starting point for our search */ > > + if (free_vmap_cache) { > > + first = rb_entry(free_vmap_cache, struct vmap_area, rb_node); > > + addr = ALIGN(first->va_end + PAGE_SIZE, align); > > + > > + } else { > > + n = vmap_area_root.rb_node; > > + if (!n) > > + goto found; > > + > > + first = NULL; > > do { > > struct vmap_area *tmp; > > tmp = rb_entry(n, struct vmap_area, rb_node); > > @@ -369,26 +387,36 @@ retry: > > if (!first) > > goto found; > > > > - if (first->va_end < addr) { > > + if (first->va_start < addr) { > > I can't understand your intention. > Why do you change va_end with va_start? Because we don't want an area which is spanning the start address. And it makes subsequent logic simpler. > > + BUG_ON(first->va_end < addr); > > And Why do you put this BUG_ON in here? > Could you elaborate on logic? It seems this is wrong, so I've removed it. This is the BUG that Steven hit, but there is another bug in there that my stress tester wasn't triggering. > > > n = rb_next(&first->rb_node); > > + addr = ALIGN(first->va_end + PAGE_SIZE, align); > > if (n) > > first = rb_entry(n, struct vmap_area, rb_node); > > else > > goto found; > > } > > + BUG_ON(first->va_start < addr); > > Ditto. Don't want an area spanning start address, as above. > > rb_erase(&va->rb_node, &vmap_area_root); > > RB_CLEAR_NODE(&va->rb_node); > > list_del_rcu(&va->list); > > Hmm. I will send refactoring version soon. > If you don't mind, let's discuss in there. :) I just reworked the initial patch a little bit and fixed a bug in it, if we could instead do the refactoring on top of it, that would save me having to rediff? I'll post it shortly. Thanks, Nick -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: vmalloc performance 2010-04-18 15:14 ` Minchan Kim 2010-04-19 12:58 ` Steven Whitehouse @ 2010-04-19 13:38 ` Nick Piggin 2010-04-19 14:09 ` Minchan Kim 1 sibling, 1 reply; 30+ messages in thread From: Nick Piggin @ 2010-04-19 13:38 UTC (permalink / raw) To: Minchan Kim; +Cc: Steven Whitehouse, linux-mm, linux-kernel On Mon, Apr 19, 2010 at 12:14:09AM +0900, Minchan Kim wrote: > On Fri, 2010-04-16 at 15:10 +0100, Steven Whitehouse wrote: > Nick, What do you think about "free area cache" approach? Thanks, yep something like this is what I had in mind. Looks like you have some really nice speed improvements which is great. > In this version, I don't consider last hole and backward cache movement which is > like mmap's cached_hole_size > That's because I want to flush vmap_areas freed intentionally if we meet vend. > It makes flush frequent than old but it's trade-off. In addition, vmalloc isn't > critical compared to mmap about performance. So I think that's enough. > > If you don't opposed, I will repost formal patch without code related to debug. I think I would prefer to be a little smarter about using lower addresses first. I know the lazy TLB flushing works against this, but that is an important speed tradeoff, wheras there is not really any downside to trying hard to allocate low areas first. Keeping virtual addresses dense helps with locality of reference of page tables, for one. So I would like to see: - invalidating the cache in the case of vstart being decreased. - Don't unconditionally reset the cache to the last vm area freed, because you might have a higher area freed after a lower area. Only reset if the freed area is lower. - Do keep a cached hole size, so smaller lookups can restart a full search. Probably also at this point, moving some of the rbtree code (like the search code) into functions would manage the alloc_vmap_area complexity. Maybe do this one first if you're going to write a patchset. What do you think? Care to have a go? :) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: vmalloc performance 2010-04-19 13:38 ` vmalloc performance Nick Piggin @ 2010-04-19 14:09 ` Minchan Kim 0 siblings, 0 replies; 30+ messages in thread From: Minchan Kim @ 2010-04-19 14:09 UTC (permalink / raw) To: Nick Piggin; +Cc: Steven Whitehouse, linux-mm, linux-kernel On Mon, Apr 19, 2010 at 10:38 PM, Nick Piggin <npiggin@suse.de> wrote: > On Mon, Apr 19, 2010 at 12:14:09AM +0900, Minchan Kim wrote: >> On Fri, 2010-04-16 at 15:10 +0100, Steven Whitehouse wrote: >> Nick, What do you think about "free area cache" approach? > > Thanks, yep something like this is what I had in mind. Looks like you > have some really nice speed improvements which is great. > > >> In this version, I don't consider last hole and backward cache movement which is >> like mmap's cached_hole_size >> That's because I want to flush vmap_areas freed intentionally if we meet vend. >> It makes flush frequent than old but it's trade-off. In addition, vmalloc isn't >> critical compared to mmap about performance. So I think that's enough. >> >> If you don't opposed, I will repost formal patch without code related to debug. > > I think I would prefer to be a little smarter about using lower > addresses first. I know the lazy TLB flushing works against this, but > that is an important speed tradeoff, wheras there is not really any > downside to trying hard to allocate low areas first. Keeping virtual > addresses dense helps with locality of reference of page tables, for > one. > > So I would like to see: > - invalidating the cache in the case of vstart being decreased. > - Don't unconditionally reset the cache to the last vm area freed, > because you might have a higher area freed after a lower area. Only > reset if the freed area is lower. > - Do keep a cached hole size, so smaller lookups can restart a full > search. Firstly, I considered it which is used by mmap. But I thought it might be overkill since vmalloc space isn't large compared to mmaped addresses. I should have thought about locality of reference of page tables. ;-) > Probably also at this point, moving some of the rbtree code (like the > search code) into functions would manage the alloc_vmap_area complexity. > Maybe do this one first if you're going to write a patchset. > > What do you think? Care to have a go? :) Good. I will add your requirements to TODO list. But don't wait me. If you care to have a go, RUN!!! I am looking forward to seeing your awesome patches. :) Thanks for careful review, Nick. -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: vmalloc performance 2010-04-15 8:33 ` Steven Whitehouse 2010-04-15 16:51 ` Minchan Kim @ 2010-04-16 6:12 ` Nick Piggin 2010-04-16 7:20 ` Minchan Kim 2010-04-16 8:50 ` Steven Whitehouse 1 sibling, 2 replies; 30+ messages in thread From: Nick Piggin @ 2010-04-16 6:12 UTC (permalink / raw) To: Steven Whitehouse; +Cc: Minchan Kim, linux-mm, linux-kernel On Thu, Apr 15, 2010 at 09:33:08AM +0100, Steven Whitehouse wrote: > Hi, > > On Thu, 2010-04-15 at 01:35 +0900, Minchan Kim wrote: > > On Thu, 2010-04-15 at 00:13 +0900, Minchan Kim wrote: > > > On Wed, Apr 14, 2010 at 9:49 PM, Steven Whitehouse <swhiteho@redhat.com> wrote: > > > >> When this module is run on my x86_64, 8 core, 12 Gb machine, then on an > > > >> otherwise idle system I get the following results: > > > >> > > > >> vmalloc took 148798983 us > > > >> vmalloc took 151664529 us > > > >> vmalloc took 152416398 us > > > >> vmalloc took 151837733 us > > > >> > > > >> After applying the two line patch (see the same bz) which disabled the > > > >> delayed removal of the structures, which appears to be intended to > > > >> improve performance in the smp case by reducing TLB flushes across cpus, > > > >> I get the following results: > > > >> > > > >> vmalloc took 15363634 us > > > >> vmalloc took 15358026 us > > > >> vmalloc took 15240955 us > > > >> vmalloc took 15402302 us > > > > > > > >> > > > >> So thats a speed up of around 10x, which isn't too bad. The question is > > > >> whether it is possible to come to a compromise where it is possible to > > > >> retain the benefits of the delayed TLB flushing code, but reduce the > > > >> overhead for other users. My two line patch basically disables the delay > > > >> by forcing a removal on each and every vfree. > > > >> > > > >> What is the correct way to fix this I wonder? > > > >> > > > >> Steve. > > > >> > > > > In my case(2 core, mem 2G system), 50300661 vs 11569357. > > It improves 4 times. > > > Looking at the code, it seems that the limit, against which my patch > removes a test, scales according to the number of cpu cores. So with > more cores, I'd expect the difference to be greater. I have a feeling > that the original reporter had a greater number than the 8 of my test > machine. > > > It would result from larger number of lazy_max_pages. > > It would prevent many vmap_area freed. > > So alloc_vmap_area takes long time to find new vmap_area. (ie, lookup > > rbtree) > > > > How about calling purge_vmap_area_lazy at the middle of loop in > > alloc_vmap_area if rbtree lookup were long? > > > That may be a good solution - I'm happy to test any patches but my worry > is that any change here might result in a regression in whatever > workload the lazy purge code was originally designed to improve. Is > there any way to test that I wonder? Ah this is interesting. What we could do is have a "free area cache" like the user virtual memory allocator has, which basically avoids restarting the search from scratch. Or we could perhaps go one better and do a more sophisticated free space allocator. Bigger systems will indeed get hurt by increasing flushes so I'd prefer to avoid that. But that's not a good justification for a slowdown for small systems. What good is having cake if you can't also eat it? :) > > BTW, Steve. Is is real issue or some test? > > I doubt such vmalloc bomb workload is real. > > Well the answer is both yes and no :-) So this is how I came across the > issue. I received a report that GFS2 performance had regressed in recent > kernels in relation to a test which basically fires lots of requests at > it via NFS. The reporter of this problem gave me two bits of > information: firstly that by eliminating all readdir calls from the > test, the regression is never seen and secondly that oprofile showed > that two functions related to vmalloc (rb_next, find_vmap_area, > alloc_vmap_area in that order) were taking between them about 60% of the > total cpu time. Thanks for tracking this down. I didn't realize GFS2 used vmalloc extensively. How large are typical vmalloc requests here, can you tell me? There is a per-cpu virtual memory allocator that is more scalable than the global one, and would help avoid these problems too. XFS is using it at the moment, but we are looking for some more users of the API so as to get more testing coverage. I was considering moving vmalloc over to use it (vm_map_ram). It's still probably a good idea to improve the global allocator regression first, but that might get you even more performance. Thanks, Nick -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: vmalloc performance 2010-04-16 6:12 ` Nick Piggin @ 2010-04-16 7:20 ` Minchan Kim 2010-04-16 8:50 ` Steven Whitehouse 1 sibling, 0 replies; 30+ messages in thread From: Minchan Kim @ 2010-04-16 7:20 UTC (permalink / raw) To: Nick Piggin; +Cc: Steven Whitehouse, linux-mm, linux-kernel On Fri, Apr 16, 2010 at 3:12 PM, Nick Piggin <npiggin@suse.de> wrote: > On Thu, Apr 15, 2010 at 09:33:08AM +0100, Steven Whitehouse wrote: >> Hi, >> >> On Thu, 2010-04-15 at 01:35 +0900, Minchan Kim wrote: >> > On Thu, 2010-04-15 at 00:13 +0900, Minchan Kim wrote: >> > > On Wed, Apr 14, 2010 at 9:49 PM, Steven Whitehouse <swhiteho@redhat.com> wrote: >> > > >> When this module is run on my x86_64, 8 core, 12 Gb machine, then on an >> > > >> otherwise idle system I get the following results: >> > > >> >> > > >> vmalloc took 148798983 us >> > > >> vmalloc took 151664529 us >> > > >> vmalloc took 152416398 us >> > > >> vmalloc took 151837733 us >> > > >> >> > > >> After applying the two line patch (see the same bz) which disabled the >> > > >> delayed removal of the structures, which appears to be intended to >> > > >> improve performance in the smp case by reducing TLB flushes across cpus, >> > > >> I get the following results: >> > > >> >> > > >> vmalloc took 15363634 us >> > > >> vmalloc took 15358026 us >> > > >> vmalloc took 15240955 us >> > > >> vmalloc took 15402302 us >> > >> > >> > > >> >> > > >> So thats a speed up of around 10x, which isn't too bad. The question is >> > > >> whether it is possible to come to a compromise where it is possible to >> > > >> retain the benefits of the delayed TLB flushing code, but reduce the >> > > >> overhead for other users. My two line patch basically disables the delay >> > > >> by forcing a removal on each and every vfree. >> > > >> >> > > >> What is the correct way to fix this I wonder? >> > > >> >> > > >> Steve. >> > > >> >> > >> > In my case(2 core, mem 2G system), 50300661 vs 11569357. >> > It improves 4 times. >> > >> Looking at the code, it seems that the limit, against which my patch >> removes a test, scales according to the number of cpu cores. So with >> more cores, I'd expect the difference to be greater. I have a feeling >> that the original reporter had a greater number than the 8 of my test >> machine. >> >> > It would result from larger number of lazy_max_pages. >> > It would prevent many vmap_area freed. >> > So alloc_vmap_area takes long time to find new vmap_area. (ie, lookup >> > rbtree) >> > >> > How about calling purge_vmap_area_lazy at the middle of loop in >> > alloc_vmap_area if rbtree lookup were long? >> > >> That may be a good solution - I'm happy to test any patches but my worry >> is that any change here might result in a regression in whatever >> workload the lazy purge code was originally designed to improve. Is >> there any way to test that I wonder? > > Ah this is interesting. What we could do is have a "free area cache" > like the user virtual memory allocator has, which basically avoids > restarting the search from scratch. > > Or we could perhaps go one better and do a more sophisticated free space > allocator. AFAIR, vmalloc's performance regression is first. I am not sure whoever suffers from it and didn't report. Anyway, with fist report, complicated allocator implement is rather overkill, I think. So I votes free_area_cache. Early ending of lookup from last cache point makes overflow fast and it results in flush. I think it's good in that it doesn't depends on system resource environment. And it could improve search time than one from scratch unless it's very unfortunate. > > Bigger systems will indeed get hurt by increasing flushes so I'd prefer > to avoid that. But that's not a good justification for a slowdown for > small systems. What good is having cake if you can't also eat it? :) Indeed. :) -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: vmalloc performance 2010-04-16 6:12 ` Nick Piggin 2010-04-16 7:20 ` Minchan Kim @ 2010-04-16 8:50 ` Steven Whitehouse 1 sibling, 0 replies; 30+ messages in thread From: Steven Whitehouse @ 2010-04-16 8:50 UTC (permalink / raw) To: Nick Piggin; +Cc: Minchan Kim, linux-mm, linux-kernel Hi, On Fri, 2010-04-16 at 16:12 +1000, Nick Piggin wrote: > On Thu, Apr 15, 2010 at 09:33:08AM +0100, Steven Whitehouse wrote: > > Hi, > > > > On Thu, 2010-04-15 at 01:35 +0900, Minchan Kim wrote: > > > On Thu, 2010-04-15 at 00:13 +0900, Minchan Kim wrote: > > > > On Wed, Apr 14, 2010 at 9:49 PM, Steven Whitehouse <swhiteho@redhat.com> wrote: > > > > >> When this module is run on my x86_64, 8 core, 12 Gb machine, then on an > > > > >> otherwise idle system I get the following results: > > > > >> > > > > >> vmalloc took 148798983 us > > > > >> vmalloc took 151664529 us > > > > >> vmalloc took 152416398 us > > > > >> vmalloc took 151837733 us > > > > >> > > > > >> After applying the two line patch (see the same bz) which disabled the > > > > >> delayed removal of the structures, which appears to be intended to > > > > >> improve performance in the smp case by reducing TLB flushes across cpus, > > > > >> I get the following results: > > > > >> > > > > >> vmalloc took 15363634 us > > > > >> vmalloc took 15358026 us > > > > >> vmalloc took 15240955 us > > > > >> vmalloc took 15402302 us > > > > > > > > > > >> > > > > >> So thats a speed up of around 10x, which isn't too bad. The question is > > > > >> whether it is possible to come to a compromise where it is possible to > > > > >> retain the benefits of the delayed TLB flushing code, but reduce the > > > > >> overhead for other users. My two line patch basically disables the delay > > > > >> by forcing a removal on each and every vfree. > > > > >> > > > > >> What is the correct way to fix this I wonder? > > > > >> > > > > >> Steve. > > > > >> > > > > > > In my case(2 core, mem 2G system), 50300661 vs 11569357. > > > It improves 4 times. > > > > > Looking at the code, it seems that the limit, against which my patch > > removes a test, scales according to the number of cpu cores. So with > > more cores, I'd expect the difference to be greater. I have a feeling > > that the original reporter had a greater number than the 8 of my test > > machine. > > > > > It would result from larger number of lazy_max_pages. > > > It would prevent many vmap_area freed. > > > So alloc_vmap_area takes long time to find new vmap_area. (ie, lookup > > > rbtree) > > > > > > How about calling purge_vmap_area_lazy at the middle of loop in > > > alloc_vmap_area if rbtree lookup were long? > > > > > That may be a good solution - I'm happy to test any patches but my worry > > is that any change here might result in a regression in whatever > > workload the lazy purge code was originally designed to improve. Is > > there any way to test that I wonder? > > Ah this is interesting. What we could do is have a "free area cache" > like the user virtual memory allocator has, which basically avoids > restarting the search from scratch. > > Or we could perhaps go one better and do a more sophisticated free space > allocator. > > Bigger systems will indeed get hurt by increasing flushes so I'd prefer > to avoid that. But that's not a good justification for a slowdown for > small systems. What good is having cake if you can't also eat it? :) > I'm all for cake, particularly if its lemon cake :-) > > > > BTW, Steve. Is is real issue or some test? > > > I doubt such vmalloc bomb workload is real. > > > > Well the answer is both yes and no :-) So this is how I came across the > > issue. I received a report that GFS2 performance had regressed in recent > > kernels in relation to a test which basically fires lots of requests at > > it via NFS. The reporter of this problem gave me two bits of > > information: firstly that by eliminating all readdir calls from the > > test, the regression is never seen and secondly that oprofile showed > > that two functions related to vmalloc (rb_next, find_vmap_area, > > alloc_vmap_area in that order) were taking between them about 60% of the > > total cpu time. > > Thanks for tracking this down. I didn't realize GFS2 used vmalloc > extensively. How large are typical vmalloc requests here, can you > tell me? There is a per-cpu virtual memory allocator that is more > scalable than the global one, and would help avoid these problems > too. > > XFS is using it at the moment, but we are looking for some more > users of the API so as to get more testing coverage. I was > considering moving vmalloc over to use it (vm_map_ram). > > It's still probably a good idea to improve the global allocator > regression first, but that might get you even more performance. > > Thanks, > Nick > Well, I wouldn't say extensively... its used just once in readdir. Even then we only use it for larger directories. We use it for two things, basically as a temporary buffer to record pointers to all the "leaf blocks" in one hash chain, and also as a temporary buffer to record pointers to all the directory entries in the same hash chain. The only reason that its used to keep track of the pointers to the leaf blocks themselves is simply that it was easier than having two separate allocations. The reason that we need a list of pointers to hash entries is so that we can feed the resulting buffer to sort() in order to put the entries into hash order. Sorting into hash order isn't really the optimal way to return the entries in readdir() but due to the slightly odd way in which directories expand as entries are added, it is the only ordering which allows us to be certain of not listing entries twice or missing entries if insertions are made by one process while another process is making successive calls to readdir(). The per-cpu virtual memory allocator though, sounds like a better fit for GFS2's needs here, so we should look into using that in future I think. As for the size of the allocations, that depends a entirely on the directory size. It could be anything from a single page to a couple of dozen or more. For the test which the original reporter was running, I suspect that it would be multiple pages, but probably less than 10. If a readdir spans multiple hash chains, then its possible that there will be two or more calls to vmalloc/vfree per readdir. However since readdir calls tend to use buffers based on the inode's optimal I/O size, its pretty unlikely that this will happen very often, and even then its only likely to span two hash chains at most. Steve. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2010-05-25 15:44 UTC | newest] Thread overview: 30+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-04-12 16:27 vmalloc performance Steven Whitehouse 2010-04-14 12:49 ` Steven Whitehouse 2010-04-14 14:24 ` Steven Whitehouse 2010-04-14 15:12 ` Minchan Kim 2010-04-14 15:13 ` Minchan Kim 2010-04-14 16:35 ` Minchan Kim 2010-04-15 8:33 ` Steven Whitehouse 2010-04-15 16:51 ` Minchan Kim 2010-04-16 14:10 ` Steven Whitehouse 2010-04-18 15:14 ` Minchan Kim 2010-04-19 12:58 ` Steven Whitehouse 2010-04-19 14:12 ` Minchan Kim 2010-04-29 13:43 ` Steven Whitehouse 2010-05-02 17:29 ` [PATCH] cache last free vmap_area to avoid restarting beginning Minchan Kim 2010-05-05 12:48 ` Steven Whitehouse 2010-05-05 16:16 ` Nick Piggin 2010-05-17 12:42 ` Steven Whitehouse 2010-05-18 13:44 ` Steven Whitehouse 2010-05-19 13:54 ` Steven Whitehouse 2010-05-19 13:56 ` Nick Piggin 2010-05-25 8:43 ` Nick Piggin 2010-05-25 15:00 ` Minchan Kim 2010-05-25 15:48 ` Steven Whitehouse 2010-05-22 9:53 ` Minchan Kim 2010-05-24 6:23 ` Nick Piggin 2010-04-19 13:38 ` vmalloc performance Nick Piggin 2010-04-19 14:09 ` Minchan Kim 2010-04-16 6:12 ` Nick Piggin 2010-04-16 7:20 ` Minchan Kim 2010-04-16 8:50 ` Steven Whitehouse
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).