public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* mmap() scalability in the presence of the MAP_POPULATE flag
@ 2013-01-02 16:50 Roman Dubtsov
  2013-01-03  0:09 ` Michel Lespinasse
  0 siblings, 1 reply; 7+ messages in thread
From: Roman Dubtsov @ 2013-01-02 16:50 UTC (permalink / raw)
  To: linux-kernel

Concurrent mmap() calls from the same process are serialized via downing
mm->mmap_sem for write. This means that operations like populating the
pages which do not alter vmas are also performed serially. Anecdotal
data from two machines I have access to is that populating pages by
touching them in a loop outside of mmap() improves performance of the
synthetic micro-benchmark by ~40% in the worst case. A crude patch that
modifies vm_mmap_pgoff() to call make_pages_present() outside of
do_mmap_pgoff() after upping the semaphore when MAP_POPULATE is present
brings identical performance improvement.

Is there an interest in fixing this or concurrent mmaps() from the same
process are too much of a corner case to worry about it?

Regards,
Roma 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: mmap() scalability in the presence of the MAP_POPULATE flag
  2013-01-02 16:50 mmap() scalability in the presence of the MAP_POPULATE flag Roman Dubtsov
@ 2013-01-03  0:09 ` Michel Lespinasse
  2013-01-03 17:09   ` Roman Dubtsov
  0 siblings, 1 reply; 7+ messages in thread
From: Michel Lespinasse @ 2013-01-03  0:09 UTC (permalink / raw)
  To: Roman Dubtsov
  Cc: linux-kernel, Andy Lutomirski, Rik van Riel, Andrew Morton,
	Hugh Dickins

On Wed, Jan 2, 2013 at 8:50 AM, Roman Dubtsov <dubtsov@gmail.com> wrote:
> Concurrent mmap() calls from the same process are serialized via downing
> mm->mmap_sem for write. This means that operations like populating the
> pages which do not alter vmas are also performed serially. Anecdotal
> data from two machines I have access to is that populating pages by
> touching them in a loop outside of mmap() improves performance of the
> synthetic micro-benchmark by ~40% in the worst case. A crude patch that
> modifies vm_mmap_pgoff() to call make_pages_present() outside of
> do_mmap_pgoff() after upping the semaphore when MAP_POPULATE is present
> brings identical performance improvement.
>
> Is there an interest in fixing this or concurrent mmaps() from the same
> process are too much of a corner case to worry about it?

Funny this comes up again. I actually have a patch series that is
supposed to do that:
[PATCH 0/9] Avoid populating unbounded num of ptes with mmap_sem held

However, the patches are still pending, didn't get much review
(probably not enough for Andrew to take them at this point), and I
think everyone forgot about them during the winter break.

Care to have a look at that thread and see if it works for you ?

(caveat: you will possibly also need "[PATCH 10/9] mm: make
do_mmap_pgoff return populate as a size in bytes, not as a bool" to
make the series actually work for you)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: mmap() scalability in the presence of the MAP_POPULATE flag
  2013-01-03  0:09 ` Michel Lespinasse
@ 2013-01-03 17:09   ` Roman Dubtsov
  2013-01-04 11:57     ` Michel Lespinasse
  0 siblings, 1 reply; 7+ messages in thread
From: Roman Dubtsov @ 2013-01-03 17:09 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-kernel, Andy Lutomirski, Rik van Riel, Andrew Morton,
	Hugh Dickins

On Wed, 2013-01-02 at 16:09 -0800, Michel Lespinasse wrote:
> > Is there an interest in fixing this or concurrent mmaps() from the same
> > process are too much of a corner case to worry about it?
> 
> Funny this comes up again. I actually have a patch series that is
> supposed to do that:
> [PATCH 0/9] Avoid populating unbounded num of ptes with mmap_sem held
> 
> However, the patches are still pending, didn't get much review
> (probably not enough for Andrew to take them at this point), and I
> think everyone forgot about them during the winter break.
> 
> Care to have a look at that thread and see if it works for you ?
> 
> (caveat: you will possibly also need "[PATCH 10/9] mm: make
> do_mmap_pgoff return populate as a size in bytes, not as a bool" to
> make the series actually work for you)

I applied the patches on top of 3.7.1. Here're the results for 4 threads
concurrently mmap()-ing 10 64MB buffers in a loop without munmap()-s.
The data is from a Nehalem i7-920 single-socket 4-core CPU. I've also
added the older data I have for the 3.6.11 (patched and not) for
reference.

3.6.11 vanilla, do not populate: 0.001 seconds
3.6.11 vanilla, populate via a loop: 0.216 seconds
3.6.11 vanilla, populate via MAP_POPULATE: 0.358 seconds 

3.6.11 + crude patch, do not populate: 0.002 seconds
3.6.11 + crude patch, populate via loop: 0.215 seconds
3.6.11 + crude patch, populate via MAP_POPULATE: 0.217 seconds

3.7.1 vanilla, do not populate: 0.001 seconds
3.7.1 vanilla, populate via a loop: 0.216 seconds
3.7.1 vanilla, populate via MAP_POPULATE: 0.411 seconds

3.7.1 + patch series, do not populate: 0.001 seconds
3.7.1 + patch series, populate via loop: 0.216 seconds
3.7.1 + patch series, populate via MAP_POPULATE: 0.273 seconds

So, the patch series mentioned above do improve performance but as far
as I can read the benchmarking data there's still some performance left
on the table.

Regards,
Roma


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: mmap() scalability in the presence of the MAP_POPULATE flag
  2013-01-03 17:09   ` Roman Dubtsov
@ 2013-01-04 11:57     ` Michel Lespinasse
  2013-01-05  6:40       ` Roman Dubtsov
  0 siblings, 1 reply; 7+ messages in thread
From: Michel Lespinasse @ 2013-01-04 11:57 UTC (permalink / raw)
  To: Roman Dubtsov
  Cc: linux-kernel, Andy Lutomirski, Rik van Riel, Andrew Morton,
	Hugh Dickins

On Fri, Jan 04, 2013 at 12:09:37AM +0700, Roman Dubtsov wrote:
> On Wed, 2013-01-02 at 16:09 -0800, Michel Lespinasse wrote:
> > > Is there an interest in fixing this or concurrent mmaps() from the same
> > > process are too much of a corner case to worry about it?
> > 
> > Funny this comes up again. I actually have a patch series that is
> > supposed to do that:
> > [PATCH 0/9] Avoid populating unbounded num of ptes with mmap_sem held
> > 
> > However, the patches are still pending, didn't get much review
> > (probably not enough for Andrew to take them at this point), and I
> > think everyone forgot about them during the winter break.
> > 
> > Care to have a look at that thread and see if it works for you ?
> > 
> > (caveat: you will possibly also need "[PATCH 10/9] mm: make
> > do_mmap_pgoff return populate as a size in bytes, not as a bool" to
> > make the series actually work for you)
> 
> I applied the patches on top of 3.7.1. Here're the results for 4 threads
> concurrently mmap()-ing 10 64MB buffers in a loop without munmap()-s.
> The data is from a Nehalem i7-920 single-socket 4-core CPU. I've also
> added the older data I have for the 3.6.11 (patched and not) for
> reference.
> 
> 3.6.11 vanilla, do not populate: 0.001 seconds
> 3.6.11 vanilla, populate via a loop: 0.216 seconds
> 3.6.11 vanilla, populate via MAP_POPULATE: 0.358 seconds 
> 
> 3.6.11 + crude patch, do not populate: 0.002 seconds
> 3.6.11 + crude patch, populate via loop: 0.215 seconds
> 3.6.11 + crude patch, populate via MAP_POPULATE: 0.217 seconds
> 
> 3.7.1 vanilla, do not populate: 0.001 seconds
> 3.7.1 vanilla, populate via a loop: 0.216 seconds
> 3.7.1 vanilla, populate via MAP_POPULATE: 0.411 seconds
> 
> 3.7.1 + patch series, do not populate: 0.001 seconds
> 3.7.1 + patch series, populate via loop: 0.216 seconds
> 3.7.1 + patch series, populate via MAP_POPULATE: 0.273 seconds
> 
> So, the patch series mentioned above do improve performance but as far
> as I can read the benchmarking data there's still some performance left
> on the table.

Interesting. I expect you are using anon memory, so it's likely that
mm_populate() holds the mmap_sem read side for the entire duration of
the 64MB populate.

Just curious, does the following help ?

diff --git a/mm/memory.c b/mm/memory.c
index e4ab66b94bb8..f65a4b3b2141 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1627,6 +1627,12 @@ static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long add
 	       stack_guard_page_end(vma, addr+PAGE_SIZE);
 }
 
+/* not upstreamable as is, just for the sake of testing */
+static inline int rwsem_is_contended(struct rw_semaphore *sem)
+{
+	return (sem->count < 0);
+}
+
 /**
  * __get_user_pages() - pin user pages in memory
  * @tsk:	task_struct of target task
@@ -1854,6 +1860,11 @@ next_page:
 			i++;
 			start += PAGE_SIZE;
 			nr_pages--;
+			if (nonblocking && rwsem_is_contended(&mm->mmap_sem)) {
+				up_read(&mm->mmap_sem);
+				*nonblocking = 0;
+				return i;
+			}
 		} while (nr_pages && start < vma->vm_end);
 	} while (nr_pages);
 	return i;

Linus didn't like rwsem_is_contended() when I implemented the mlock
side of this a couple years ago, but maybe we can change his mind now.

If this doesn't help, could you please send me your test case ? I
think you described enough of it that I would be able to reproduce it
given some time, but it's just easier if you send me a short C file :)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: mmap() scalability in the presence of the MAP_POPULATE flag
  2013-01-04 11:57     ` Michel Lespinasse
@ 2013-01-05  6:40       ` Roman Dubtsov
  2013-01-05  7:43         ` Michel Lespinasse
  0 siblings, 1 reply; 7+ messages in thread
From: Roman Dubtsov @ 2013-01-05  6:40 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-kernel, Andy Lutomirski, Rik van Riel, Andrew Morton,
	Hugh Dickins

[-- Attachment #1: Type: text/plain, Size: 4223 bytes --]

On Fri, 2013-01-04 at 03:57 -0800, Michel Lespinasse wrote:
> On Fri, Jan 04, 2013 at 12:09:37AM +0700, Roman Dubtsov wrote:
> > On Wed, 2013-01-02 at 16:09 -0800, Michel Lespinasse wrote:
> > > > Is there an interest in fixing this or concurrent mmaps() from the same
> > > > process are too much of a corner case to worry about it?
> > > 
> > > Funny this comes up again. I actually have a patch series that is
> > > supposed to do that:
> > > [PATCH 0/9] Avoid populating unbounded num of ptes with mmap_sem held
> > > 
> > > However, the patches are still pending, didn't get much review
> > > (probably not enough for Andrew to take them at this point), and I
> > > think everyone forgot about them during the winter break.
> > > 
> > > Care to have a look at that thread and see if it works for you ?
> > > 
> > > (caveat: you will possibly also need "[PATCH 10/9] mm: make
> > > do_mmap_pgoff return populate as a size in bytes, not as a bool" to
> > > make the series actually work for you)
> > 
> > I applied the patches on top of 3.7.1. Here're the results for 4 threads
> > concurrently mmap()-ing 10 64MB buffers in a loop without munmap()-s.
> > The data is from a Nehalem i7-920 single-socket 4-core CPU. I've also
> > added the older data I have for the 3.6.11 (patched and not) for
> > reference.
> > 
> > 3.6.11 vanilla, do not populate: 0.001 seconds
> > 3.6.11 vanilla, populate via a loop: 0.216 seconds
> > 3.6.11 vanilla, populate via MAP_POPULATE: 0.358 seconds 
> > 
> > 3.6.11 + crude patch, do not populate: 0.002 seconds
> > 3.6.11 + crude patch, populate via loop: 0.215 seconds
> > 3.6.11 + crude patch, populate via MAP_POPULATE: 0.217 seconds
> > 
> > 3.7.1 vanilla, do not populate: 0.001 seconds
> > 3.7.1 vanilla, populate via a loop: 0.216 seconds
> > 3.7.1 vanilla, populate via MAP_POPULATE: 0.411 seconds
> > 
> > 3.7.1 + patch series, do not populate: 0.001 seconds
> > 3.7.1 + patch series, populate via loop: 0.216 seconds
> > 3.7.1 + patch series, populate via MAP_POPULATE: 0.273 seconds
> > 
> > So, the patch series mentioned above do improve performance but as far
> > as I can read the benchmarking data there's still some performance left
> > on the table.
> 
> Interesting. I expect you are using anon memory, so it's likely that
> mm_populate() holds the mmap_sem read side for the entire duration of
> the 64MB populate.
> 
> Just curious, does the following help ?
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index e4ab66b94bb8..f65a4b3b2141 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1627,6 +1627,12 @@ static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long add
>  	       stack_guard_page_end(vma, addr+PAGE_SIZE);
>  }
>  
> +/* not upstreamable as is, just for the sake of testing */
> +static inline int rwsem_is_contended(struct rw_semaphore *sem)
> +{
> +	return (sem->count < 0);
> +}
> +
>  /**
>   * __get_user_pages() - pin user pages in memory
>   * @tsk:	task_struct of target task
> @@ -1854,6 +1860,11 @@ next_page:
>  			i++;
>  			start += PAGE_SIZE;
>  			nr_pages--;
> +			if (nonblocking && rwsem_is_contended(&mm->mmap_sem)) {
> +				up_read(&mm->mmap_sem);
> +				*nonblocking = 0;
> +				return i;
> +			}
>  		} while (nr_pages && start < vma->vm_end);
>  	} while (nr_pages);
>  	return i;
> 
> Linus didn't like rwsem_is_contended() when I implemented the mlock
> side of this a couple years ago, but maybe we can change his mind now.
> 
> If this doesn't help, could you please send me your test case ? I
> think you described enough of it that I would be able to reproduce it
> given some time, but it's just easier if you send me a short C file :)
> 

It does not, the results are more or less the same. I've attached my
testcase. It does map anonymous memory. It also uses OpenMP for
threading because I'm lazy, so it requires passing -fopenmp to gcc and
the number of threads it runs is defined via OMP_NUM_THREADS environment
variable. There are also two macros that influence test's behavior:

- POPULATE_VIA_LOOP -- makes the test populate memory using a loop
- POPULATE_VIA_MMAP -- makes the test populate memory via MAP_POPULATE

If none of the macros are defined, the test does not populate memory.


[-- Attachment #2: t.c --]
[-- Type: text/x-csrc, Size: 792 bytes --]

#include <stdio.h>
#include <sys/mman.h>
#include <unistd.h>

#include "omp.h"

#ifndef BUF_SIZE
#define BUF_SIZE (64 * 1024 * 1024)
#endif

#ifndef PAGE_SIZE
#define PAGE_SIZE (4 * 1024)
#endif

#ifdef POPULATE_VIA_MMAP
#define MMAP_FLAGS (MAP_ANONYMOUS | MAP_PRIVATE | MAP_POPULATE)
#else
#define MMAP_FLAGS (MAP_ANONYMOUS | MAP_PRIVATE)
#endif

int main(int argc, char **argv)
{
#pragma omp parallel
	{
	}

	double t0 = omp_get_wtime();
#pragma omp parallel
	{
		int i;
		for (i = 0; i < 10; i++) {
			char *p = mmap(NULL, BUF_SIZE,
				PROT_READ | PROT_WRITE, MMAP_FLAGS, -1, 0);
#ifdef POPULATE_VIA_LOOP
			size_t j;
			for (j = 0; j < BUF_SIZE; j += PAGE_SIZE)
				p[j] = 0;
#endif
		}
	}
	double t1 = omp_get_wtime();
	printf("%d\t%f\n", omp_get_max_threads(), t1 - t0);

	return 0;
}

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: mmap() scalability in the presence of the MAP_POPULATE flag
  2013-01-05  6:40       ` Roman Dubtsov
@ 2013-01-05  7:43         ` Michel Lespinasse
  2013-01-31  0:31           ` Michel Lespinasse
  0 siblings, 1 reply; 7+ messages in thread
From: Michel Lespinasse @ 2013-01-05  7:43 UTC (permalink / raw)
  To: Roman Dubtsov
  Cc: linux-kernel, Andy Lutomirski, Rik van Riel, Andrew Morton,
	Hugh Dickins

On Fri, Jan 4, 2013 at 10:40 PM, Roman Dubtsov <dubtsov@gmail.com> wrote:
> On Fri, 2013-01-04 at 03:57 -0800, Michel Lespinasse wrote:
>> If this doesn't help, could you please send me your test case ? I
>> think you described enough of it that I would be able to reproduce it
>> given some time, but it's just easier if you send me a short C file :)
>
> It does not, the results are more or less the same. I've attached my
> testcase. It does map anonymous memory. It also uses OpenMP for
> threading because I'm lazy, so it requires passing -fopenmp to gcc and
> the number of threads it runs is defined via OMP_NUM_THREADS environment
> variable. There are also two macros that influence test's behavior:
>
> - POPULATE_VIA_LOOP -- makes the test populate memory using a loop
> - POPULATE_VIA_MMAP -- makes the test populate memory via MAP_POPULATE
>
> If none of the macros are defined, the test does not populate memory.

Heh, very interesting. As it turns out, the problem gets MUCH worse as
the number of threads increase.

We are populating the anon mapping with huge pages. In the
POPULATE_VIA_LOOP case, we are just taking a page fault every 2MB and
filling it up with a zeroed huge page - most of the runtime comes from
clearing the huge page.

In the POPULATE_VIA_MMAP, follow_page() is called at 4KB increment
addresses, and it takes the mm->page_table_lock 511 times out of 512
(that is, every time it falls within a huge page that's just been
populated). So all OMP_NUM_THREADS threads are constantly bouncing
over the mm->page_table_lock, and getting terrible performance as a
result.

Thanks for the report. I don't have a patch just now, but this does
seem very solvable.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: mmap() scalability in the presence of the MAP_POPULATE flag
  2013-01-05  7:43         ` Michel Lespinasse
@ 2013-01-31  0:31           ` Michel Lespinasse
  0 siblings, 0 replies; 7+ messages in thread
From: Michel Lespinasse @ 2013-01-31  0:31 UTC (permalink / raw)
  To: Roman Dubtsov
  Cc: linux-kernel, Andy Lutomirski, Rik van Riel, Andrew Morton,
	Hugh Dickins

Hi Roman,

On Fri, Jan 4, 2013 at 11:43 PM, Michel Lespinasse <walken@google.com> wrote:
> On Fri, Jan 4, 2013 at 10:40 PM, Roman Dubtsov <dubtsov@gmail.com> wrote:
>> - POPULATE_VIA_LOOP -- makes the test populate memory using a loop
>> - POPULATE_VIA_MMAP -- makes the test populate memory via MAP_POPULATE
>>
> Heh, very interesting. As it turns out, the problem gets MUCH worse as
> the number of threads increase.
>
> In the POPULATE_VIA_MMAP, follow_page() is called at 4KB increment
> addresses, and it takes the mm->page_table_lock 511 times out of 512
> (that is, every time it falls within a huge page that's just been
> populated). So all OMP_NUM_THREADS threads are constantly bouncing
> over the mm->page_table_lock, and getting terrible performance as a
> result.

FYI, the patchset I just sent out ("fixes for large mm_populate() and
munlock() operations") takes care of this and brings POPULATE_VIA_MMAP
performance up where it should be.

Thanks a lot for the report, as I hadn't noticed this issue before :)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2013-01-31  0:31 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-02 16:50 mmap() scalability in the presence of the MAP_POPULATE flag Roman Dubtsov
2013-01-03  0:09 ` Michel Lespinasse
2013-01-03 17:09   ` Roman Dubtsov
2013-01-04 11:57     ` Michel Lespinasse
2013-01-05  6:40       ` Roman Dubtsov
2013-01-05  7:43         ` Michel Lespinasse
2013-01-31  0:31           ` Michel Lespinasse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox