* mmap() scalability in the presence of the MAP_POPULATE flag @ 2013-01-02 16:50 Roman Dubtsov 2013-01-03 0:09 ` Michel Lespinasse 0 siblings, 1 reply; 7+ messages in thread From: Roman Dubtsov @ 2013-01-02 16:50 UTC (permalink / raw) To: linux-kernel Concurrent mmap() calls from the same process are serialized via downing mm->mmap_sem for write. This means that operations like populating the pages which do not alter vmas are also performed serially. Anecdotal data from two machines I have access to is that populating pages by touching them in a loop outside of mmap() improves performance of the synthetic micro-benchmark by ~40% in the worst case. A crude patch that modifies vm_mmap_pgoff() to call make_pages_present() outside of do_mmap_pgoff() after upping the semaphore when MAP_POPULATE is present brings identical performance improvement. Is there an interest in fixing this or concurrent mmaps() from the same process are too much of a corner case to worry about it? Regards, Roma ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: mmap() scalability in the presence of the MAP_POPULATE flag 2013-01-02 16:50 mmap() scalability in the presence of the MAP_POPULATE flag Roman Dubtsov @ 2013-01-03 0:09 ` Michel Lespinasse 2013-01-03 17:09 ` Roman Dubtsov 0 siblings, 1 reply; 7+ messages in thread From: Michel Lespinasse @ 2013-01-03 0:09 UTC (permalink / raw) To: Roman Dubtsov Cc: linux-kernel, Andy Lutomirski, Rik van Riel, Andrew Morton, Hugh Dickins On Wed, Jan 2, 2013 at 8:50 AM, Roman Dubtsov <dubtsov@gmail.com> wrote: > Concurrent mmap() calls from the same process are serialized via downing > mm->mmap_sem for write. This means that operations like populating the > pages which do not alter vmas are also performed serially. Anecdotal > data from two machines I have access to is that populating pages by > touching them in a loop outside of mmap() improves performance of the > synthetic micro-benchmark by ~40% in the worst case. A crude patch that > modifies vm_mmap_pgoff() to call make_pages_present() outside of > do_mmap_pgoff() after upping the semaphore when MAP_POPULATE is present > brings identical performance improvement. > > Is there an interest in fixing this or concurrent mmaps() from the same > process are too much of a corner case to worry about it? Funny this comes up again. I actually have a patch series that is supposed to do that: [PATCH 0/9] Avoid populating unbounded num of ptes with mmap_sem held However, the patches are still pending, didn't get much review (probably not enough for Andrew to take them at this point), and I think everyone forgot about them during the winter break. Care to have a look at that thread and see if it works for you ? (caveat: you will possibly also need "[PATCH 10/9] mm: make do_mmap_pgoff return populate as a size in bytes, not as a bool" to make the series actually work for you) -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: mmap() scalability in the presence of the MAP_POPULATE flag 2013-01-03 0:09 ` Michel Lespinasse @ 2013-01-03 17:09 ` Roman Dubtsov 2013-01-04 11:57 ` Michel Lespinasse 0 siblings, 1 reply; 7+ messages in thread From: Roman Dubtsov @ 2013-01-03 17:09 UTC (permalink / raw) To: Michel Lespinasse Cc: linux-kernel, Andy Lutomirski, Rik van Riel, Andrew Morton, Hugh Dickins On Wed, 2013-01-02 at 16:09 -0800, Michel Lespinasse wrote: > > Is there an interest in fixing this or concurrent mmaps() from the same > > process are too much of a corner case to worry about it? > > Funny this comes up again. I actually have a patch series that is > supposed to do that: > [PATCH 0/9] Avoid populating unbounded num of ptes with mmap_sem held > > However, the patches are still pending, didn't get much review > (probably not enough for Andrew to take them at this point), and I > think everyone forgot about them during the winter break. > > Care to have a look at that thread and see if it works for you ? > > (caveat: you will possibly also need "[PATCH 10/9] mm: make > do_mmap_pgoff return populate as a size in bytes, not as a bool" to > make the series actually work for you) I applied the patches on top of 3.7.1. Here're the results for 4 threads concurrently mmap()-ing 10 64MB buffers in a loop without munmap()-s. The data is from a Nehalem i7-920 single-socket 4-core CPU. I've also added the older data I have for the 3.6.11 (patched and not) for reference. 3.6.11 vanilla, do not populate: 0.001 seconds 3.6.11 vanilla, populate via a loop: 0.216 seconds 3.6.11 vanilla, populate via MAP_POPULATE: 0.358 seconds 3.6.11 + crude patch, do not populate: 0.002 seconds 3.6.11 + crude patch, populate via loop: 0.215 seconds 3.6.11 + crude patch, populate via MAP_POPULATE: 0.217 seconds 3.7.1 vanilla, do not populate: 0.001 seconds 3.7.1 vanilla, populate via a loop: 0.216 seconds 3.7.1 vanilla, populate via MAP_POPULATE: 0.411 seconds 3.7.1 + patch series, do not populate: 0.001 seconds 3.7.1 + patch series, populate via loop: 0.216 seconds 3.7.1 + patch series, populate via MAP_POPULATE: 0.273 seconds So, the patch series mentioned above do improve performance but as far as I can read the benchmarking data there's still some performance left on the table. Regards, Roma ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: mmap() scalability in the presence of the MAP_POPULATE flag 2013-01-03 17:09 ` Roman Dubtsov @ 2013-01-04 11:57 ` Michel Lespinasse 2013-01-05 6:40 ` Roman Dubtsov 0 siblings, 1 reply; 7+ messages in thread From: Michel Lespinasse @ 2013-01-04 11:57 UTC (permalink / raw) To: Roman Dubtsov Cc: linux-kernel, Andy Lutomirski, Rik van Riel, Andrew Morton, Hugh Dickins On Fri, Jan 04, 2013 at 12:09:37AM +0700, Roman Dubtsov wrote: > On Wed, 2013-01-02 at 16:09 -0800, Michel Lespinasse wrote: > > > Is there an interest in fixing this or concurrent mmaps() from the same > > > process are too much of a corner case to worry about it? > > > > Funny this comes up again. I actually have a patch series that is > > supposed to do that: > > [PATCH 0/9] Avoid populating unbounded num of ptes with mmap_sem held > > > > However, the patches are still pending, didn't get much review > > (probably not enough for Andrew to take them at this point), and I > > think everyone forgot about them during the winter break. > > > > Care to have a look at that thread and see if it works for you ? > > > > (caveat: you will possibly also need "[PATCH 10/9] mm: make > > do_mmap_pgoff return populate as a size in bytes, not as a bool" to > > make the series actually work for you) > > I applied the patches on top of 3.7.1. Here're the results for 4 threads > concurrently mmap()-ing 10 64MB buffers in a loop without munmap()-s. > The data is from a Nehalem i7-920 single-socket 4-core CPU. I've also > added the older data I have for the 3.6.11 (patched and not) for > reference. > > 3.6.11 vanilla, do not populate: 0.001 seconds > 3.6.11 vanilla, populate via a loop: 0.216 seconds > 3.6.11 vanilla, populate via MAP_POPULATE: 0.358 seconds > > 3.6.11 + crude patch, do not populate: 0.002 seconds > 3.6.11 + crude patch, populate via loop: 0.215 seconds > 3.6.11 + crude patch, populate via MAP_POPULATE: 0.217 seconds > > 3.7.1 vanilla, do not populate: 0.001 seconds > 3.7.1 vanilla, populate via a loop: 0.216 seconds > 3.7.1 vanilla, populate via MAP_POPULATE: 0.411 seconds > > 3.7.1 + patch series, do not populate: 0.001 seconds > 3.7.1 + patch series, populate via loop: 0.216 seconds > 3.7.1 + patch series, populate via MAP_POPULATE: 0.273 seconds > > So, the patch series mentioned above do improve performance but as far > as I can read the benchmarking data there's still some performance left > on the table. Interesting. I expect you are using anon memory, so it's likely that mm_populate() holds the mmap_sem read side for the entire duration of the 64MB populate. Just curious, does the following help ? diff --git a/mm/memory.c b/mm/memory.c index e4ab66b94bb8..f65a4b3b2141 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1627,6 +1627,12 @@ static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long add stack_guard_page_end(vma, addr+PAGE_SIZE); } +/* not upstreamable as is, just for the sake of testing */ +static inline int rwsem_is_contended(struct rw_semaphore *sem) +{ + return (sem->count < 0); +} + /** * __get_user_pages() - pin user pages in memory * @tsk: task_struct of target task @@ -1854,6 +1860,11 @@ next_page: i++; start += PAGE_SIZE; nr_pages--; + if (nonblocking && rwsem_is_contended(&mm->mmap_sem)) { + up_read(&mm->mmap_sem); + *nonblocking = 0; + return i; + } } while (nr_pages && start < vma->vm_end); } while (nr_pages); return i; Linus didn't like rwsem_is_contended() when I implemented the mlock side of this a couple years ago, but maybe we can change his mind now. If this doesn't help, could you please send me your test case ? I think you described enough of it that I would be able to reproduce it given some time, but it's just easier if you send me a short C file :) -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: mmap() scalability in the presence of the MAP_POPULATE flag 2013-01-04 11:57 ` Michel Lespinasse @ 2013-01-05 6:40 ` Roman Dubtsov 2013-01-05 7:43 ` Michel Lespinasse 0 siblings, 1 reply; 7+ messages in thread From: Roman Dubtsov @ 2013-01-05 6:40 UTC (permalink / raw) To: Michel Lespinasse Cc: linux-kernel, Andy Lutomirski, Rik van Riel, Andrew Morton, Hugh Dickins [-- Attachment #1: Type: text/plain, Size: 4223 bytes --] On Fri, 2013-01-04 at 03:57 -0800, Michel Lespinasse wrote: > On Fri, Jan 04, 2013 at 12:09:37AM +0700, Roman Dubtsov wrote: > > On Wed, 2013-01-02 at 16:09 -0800, Michel Lespinasse wrote: > > > > Is there an interest in fixing this or concurrent mmaps() from the same > > > > process are too much of a corner case to worry about it? > > > > > > Funny this comes up again. I actually have a patch series that is > > > supposed to do that: > > > [PATCH 0/9] Avoid populating unbounded num of ptes with mmap_sem held > > > > > > However, the patches are still pending, didn't get much review > > > (probably not enough for Andrew to take them at this point), and I > > > think everyone forgot about them during the winter break. > > > > > > Care to have a look at that thread and see if it works for you ? > > > > > > (caveat: you will possibly also need "[PATCH 10/9] mm: make > > > do_mmap_pgoff return populate as a size in bytes, not as a bool" to > > > make the series actually work for you) > > > > I applied the patches on top of 3.7.1. Here're the results for 4 threads > > concurrently mmap()-ing 10 64MB buffers in a loop without munmap()-s. > > The data is from a Nehalem i7-920 single-socket 4-core CPU. I've also > > added the older data I have for the 3.6.11 (patched and not) for > > reference. > > > > 3.6.11 vanilla, do not populate: 0.001 seconds > > 3.6.11 vanilla, populate via a loop: 0.216 seconds > > 3.6.11 vanilla, populate via MAP_POPULATE: 0.358 seconds > > > > 3.6.11 + crude patch, do not populate: 0.002 seconds > > 3.6.11 + crude patch, populate via loop: 0.215 seconds > > 3.6.11 + crude patch, populate via MAP_POPULATE: 0.217 seconds > > > > 3.7.1 vanilla, do not populate: 0.001 seconds > > 3.7.1 vanilla, populate via a loop: 0.216 seconds > > 3.7.1 vanilla, populate via MAP_POPULATE: 0.411 seconds > > > > 3.7.1 + patch series, do not populate: 0.001 seconds > > 3.7.1 + patch series, populate via loop: 0.216 seconds > > 3.7.1 + patch series, populate via MAP_POPULATE: 0.273 seconds > > > > So, the patch series mentioned above do improve performance but as far > > as I can read the benchmarking data there's still some performance left > > on the table. > > Interesting. I expect you are using anon memory, so it's likely that > mm_populate() holds the mmap_sem read side for the entire duration of > the 64MB populate. > > Just curious, does the following help ? > > diff --git a/mm/memory.c b/mm/memory.c > index e4ab66b94bb8..f65a4b3b2141 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1627,6 +1627,12 @@ static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long add > stack_guard_page_end(vma, addr+PAGE_SIZE); > } > > +/* not upstreamable as is, just for the sake of testing */ > +static inline int rwsem_is_contended(struct rw_semaphore *sem) > +{ > + return (sem->count < 0); > +} > + > /** > * __get_user_pages() - pin user pages in memory > * @tsk: task_struct of target task > @@ -1854,6 +1860,11 @@ next_page: > i++; > start += PAGE_SIZE; > nr_pages--; > + if (nonblocking && rwsem_is_contended(&mm->mmap_sem)) { > + up_read(&mm->mmap_sem); > + *nonblocking = 0; > + return i; > + } > } while (nr_pages && start < vma->vm_end); > } while (nr_pages); > return i; > > Linus didn't like rwsem_is_contended() when I implemented the mlock > side of this a couple years ago, but maybe we can change his mind now. > > If this doesn't help, could you please send me your test case ? I > think you described enough of it that I would be able to reproduce it > given some time, but it's just easier if you send me a short C file :) > It does not, the results are more or less the same. I've attached my testcase. It does map anonymous memory. It also uses OpenMP for threading because I'm lazy, so it requires passing -fopenmp to gcc and the number of threads it runs is defined via OMP_NUM_THREADS environment variable. There are also two macros that influence test's behavior: - POPULATE_VIA_LOOP -- makes the test populate memory using a loop - POPULATE_VIA_MMAP -- makes the test populate memory via MAP_POPULATE If none of the macros are defined, the test does not populate memory. [-- Attachment #2: t.c --] [-- Type: text/x-csrc, Size: 792 bytes --] #include <stdio.h> #include <sys/mman.h> #include <unistd.h> #include "omp.h" #ifndef BUF_SIZE #define BUF_SIZE (64 * 1024 * 1024) #endif #ifndef PAGE_SIZE #define PAGE_SIZE (4 * 1024) #endif #ifdef POPULATE_VIA_MMAP #define MMAP_FLAGS (MAP_ANONYMOUS | MAP_PRIVATE | MAP_POPULATE) #else #define MMAP_FLAGS (MAP_ANONYMOUS | MAP_PRIVATE) #endif int main(int argc, char **argv) { #pragma omp parallel { } double t0 = omp_get_wtime(); #pragma omp parallel { int i; for (i = 0; i < 10; i++) { char *p = mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE, MMAP_FLAGS, -1, 0); #ifdef POPULATE_VIA_LOOP size_t j; for (j = 0; j < BUF_SIZE; j += PAGE_SIZE) p[j] = 0; #endif } } double t1 = omp_get_wtime(); printf("%d\t%f\n", omp_get_max_threads(), t1 - t0); return 0; } ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: mmap() scalability in the presence of the MAP_POPULATE flag 2013-01-05 6:40 ` Roman Dubtsov @ 2013-01-05 7:43 ` Michel Lespinasse 2013-01-31 0:31 ` Michel Lespinasse 0 siblings, 1 reply; 7+ messages in thread From: Michel Lespinasse @ 2013-01-05 7:43 UTC (permalink / raw) To: Roman Dubtsov Cc: linux-kernel, Andy Lutomirski, Rik van Riel, Andrew Morton, Hugh Dickins On Fri, Jan 4, 2013 at 10:40 PM, Roman Dubtsov <dubtsov@gmail.com> wrote: > On Fri, 2013-01-04 at 03:57 -0800, Michel Lespinasse wrote: >> If this doesn't help, could you please send me your test case ? I >> think you described enough of it that I would be able to reproduce it >> given some time, but it's just easier if you send me a short C file :) > > It does not, the results are more or less the same. I've attached my > testcase. It does map anonymous memory. It also uses OpenMP for > threading because I'm lazy, so it requires passing -fopenmp to gcc and > the number of threads it runs is defined via OMP_NUM_THREADS environment > variable. There are also two macros that influence test's behavior: > > - POPULATE_VIA_LOOP -- makes the test populate memory using a loop > - POPULATE_VIA_MMAP -- makes the test populate memory via MAP_POPULATE > > If none of the macros are defined, the test does not populate memory. Heh, very interesting. As it turns out, the problem gets MUCH worse as the number of threads increase. We are populating the anon mapping with huge pages. In the POPULATE_VIA_LOOP case, we are just taking a page fault every 2MB and filling it up with a zeroed huge page - most of the runtime comes from clearing the huge page. In the POPULATE_VIA_MMAP, follow_page() is called at 4KB increment addresses, and it takes the mm->page_table_lock 511 times out of 512 (that is, every time it falls within a huge page that's just been populated). So all OMP_NUM_THREADS threads are constantly bouncing over the mm->page_table_lock, and getting terrible performance as a result. Thanks for the report. I don't have a patch just now, but this does seem very solvable. -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: mmap() scalability in the presence of the MAP_POPULATE flag 2013-01-05 7:43 ` Michel Lespinasse @ 2013-01-31 0:31 ` Michel Lespinasse 0 siblings, 0 replies; 7+ messages in thread From: Michel Lespinasse @ 2013-01-31 0:31 UTC (permalink / raw) To: Roman Dubtsov Cc: linux-kernel, Andy Lutomirski, Rik van Riel, Andrew Morton, Hugh Dickins Hi Roman, On Fri, Jan 4, 2013 at 11:43 PM, Michel Lespinasse <walken@google.com> wrote: > On Fri, Jan 4, 2013 at 10:40 PM, Roman Dubtsov <dubtsov@gmail.com> wrote: >> - POPULATE_VIA_LOOP -- makes the test populate memory using a loop >> - POPULATE_VIA_MMAP -- makes the test populate memory via MAP_POPULATE >> > Heh, very interesting. As it turns out, the problem gets MUCH worse as > the number of threads increase. > > In the POPULATE_VIA_MMAP, follow_page() is called at 4KB increment > addresses, and it takes the mm->page_table_lock 511 times out of 512 > (that is, every time it falls within a huge page that's just been > populated). So all OMP_NUM_THREADS threads are constantly bouncing > over the mm->page_table_lock, and getting terrible performance as a > result. FYI, the patchset I just sent out ("fixes for large mm_populate() and munlock() operations") takes care of this and brings POPULATE_VIA_MMAP performance up where it should be. Thanks a lot for the report, as I hadn't noticed this issue before :) -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2013-01-31 0:31 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-01-02 16:50 mmap() scalability in the presence of the MAP_POPULATE flag Roman Dubtsov 2013-01-03 0:09 ` Michel Lespinasse 2013-01-03 17:09 ` Roman Dubtsov 2013-01-04 11:57 ` Michel Lespinasse 2013-01-05 6:40 ` Roman Dubtsov 2013-01-05 7:43 ` Michel Lespinasse 2013-01-31 0:31 ` Michel Lespinasse
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox