Re[2]: your mail on mmap() to the kernel list

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re[2]: your mail on mmap() to the kernel list
       [not found] <3C082244.8587.80EF082@localhost>
@ 2001-12-01  8:31 ` Peter Zaitsev
  2001-12-01  9:37   ` Andrew Morton
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Zaitsev @ 2001-12-01  8:31 UTC (permalink / raw)
  To: theowl; +Cc: theowl, linux-kernel

Hello theowl,

Saturday, December 01, 2001, 2:20:20 AM, you wrote:

Well. Thank you. I've allready found this - in recent kernels it's
even regulated via proc fs.

The only question is why map anonymous is rather fast (i get
1000req/sec even then mapped 300.000 of blocks), therefore with
mapping a fd the perfomance drops to 20req/second at this number.

Any ideas why does this happen or any patches which increases the
speed exists ?

tfch> in include/linux/sched.h:

tfch> /* Maximum number of active map areas.. This is a random (large) number */
tfch> #define MAX_MAP_COUNT   (65536)

tfch> this should probably be (TASK_SIZE / PAGE_SIZE) on 32 bit architectures
tfch> and something 'reasonably big' on 64 bit (too big of a value would
tfch> allow for a nice DoS against the kernel).

-- 
Best regards,
 Peter                            mailto:pz@spylog.ru

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: your mail on mmap() to the kernel list
  2001-12-01  8:31 ` Re[2]: your mail on mmap() to the kernel list Peter Zaitsev
@ 2001-12-01  9:37   ` Andrew Morton
  2001-12-02 23:07     ` Re[2]: " Peter Zaitsev
  2001-12-04 14:15     ` Andrea Arcangeli
  0 siblings, 2 replies; 15+ messages in thread
From: Andrew Morton @ 2001-12-01  9:37 UTC (permalink / raw)
  To: Peter Zaitsev
  Cc: theowl, theowl, linux-kernel, Linus Torvalds, Andrea Arcangeli

Peter Zaitsev wrote:
> 
> Hello theowl,
> 
> Saturday, December 01, 2001, 2:20:20 AM, you wrote:
> 
> Well. Thank you. I've allready found this - in recent kernels it's
> even regulated via proc fs.
> 
> The only question is why map anonymous is rather fast (i get
> 1000req/sec even then mapped 300.000 of blocks), therefore with
> mapping a fd the perfomance drops to 20req/second at this number.
> 

well a kernel profile is pretty unambiguous:

c0116af4 mm_init                                       1   0.0050
c0117394 do_fork                                       1   0.0005
c0124ccc clear_page_tables                             1   0.0064
c0125af0 do_wp_page                                    1   0.0026
c01260a0 do_no_page                                    1   0.0033
c01265dc find_vma_prepare                              1   0.0081
c0129138 file_read_actor                               1   0.0093
c012d95c kmem_cache_alloc                              1   0.0035
c0147890 d_lookup                                      1   0.0036
c01573dc write_profile                                 1   0.0061
c0169d44 journal_add_journal_head                      1   0.0035
c0106e88 system_call                                   2   0.0357
c01264bc unlock_vma_mappings                           2   0.0500
c0135bb4 fget                                          2   0.0227
c028982c __generic_copy_from_user                      2   0.0192
c01267ec do_mmap_pgoff                                 4   0.0035
c0126d6c find_vma                                      7   0.0761
c0105000 _stext                                       16   0.1667
c0126c70 get_unmapped_area                          4991  19.8056
c0105278 poll_idle                                  4997 124.9250
00000000 total                                     10034   0.0062

The `poll_idle' count is from the other CPU.

What appears to be happening is that the VMA tree has degenerated
into a monstrous singly linked list.  All those little 4k mappings
are individual data structures, chained one after the other.

The reason you don't see it with an anonymous map is, I think, that
the kernel will merge contiguous anon mappings into a single one,
so there's basically nothing to be searched each time you request some
more memory.

Running the same test on the -ac kernel is 4-5% slower, so Andrea's
new rbtree implementation makes a better linked list than the old
AVL-tree's one :)

It strikes me that this is not a completely stupid usage pattern, and
that perhaps it's worth thinking about some tweaks to cope with it.
I really don't know, but I've Cc'ed some people who do.

Here's the test app:

#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <errno.h>

int main()
{
	int i = 0;
	void *p;
	int t;
	int fd;
	
	fd = open("test.dat", O_RDWR);
	if (fd < 0) {
		puts("Unable to open file !");
		return;
	}
	t = time(NULL);
	while (1) {
		p = mmap(0, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
		if ((int) p == -1) {
			perror("mmap");
			return;
		}
		i++;
		if (i % 10000 == 0) {
			printf(" %d  Time: %d\n", i, time(NULL) - t);
			t = time(NULL);
		}
	}
}

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re[2]: your mail on mmap() to the kernel list
  2001-12-01  9:37   ` Andrew Morton
@ 2001-12-02 23:07     ` Peter Zaitsev
  2001-12-02 23:34       ` Andrew Morton
  2001-12-04 14:15     ` Andrea Arcangeli
  1 sibling, 1 reply; 15+ messages in thread
From: Peter Zaitsev @ 2001-12-02 23:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: theowl, theowl, linux-kernel, Linus Torvalds, Andrea Arcangeli

Hello Andrew,

Saturday, December 01, 2001, 12:37:01 PM, you wrote:

>>
>> The only question is why map anonymous is rather fast (i get
>> 1000req/sec even then mapped 300.000 of blocks), therefore with
>> mapping a fd the perfomance drops to 20req/second at this number.
>> 

AM> well a kernel profile is pretty unambiguous:

AM> c0116af4 mm_init                                       1   0.0050
AM> c0117394 do_fork                                       1   0.0005
AM> c0124ccc clear_page_tables                             1   0.0064
AM> c0125af0 do_wp_page                                    1   0.0026
AM> c01260a0 do_no_page                                    1   0.0033
AM> c01265dc find_vma_prepare                              1   0.0081
AM> c0129138 file_read_actor                               1   0.0093
AM> c012d95c kmem_cache_alloc                              1   0.0035
AM> c0147890 d_lookup                                      1   0.0036
AM> c01573dc write_profile                                 1   0.0061
AM> c0169d44 journal_add_journal_head                      1   0.0035
AM> c0106e88 system_call                                   2   0.0357
AM> c01264bc unlock_vma_mappings                           2   0.0500
AM> c0135bb4 fget                                          2   0.0227
AM> c028982c __generic_copy_from_user                      2   0.0192
AM> c01267ec do_mmap_pgoff                                 4   0.0035
AM> c0126d6c find_vma                                      7   0.0761
AM> c0105000 _stext                                       16   0.1667
AM> c0126c70 get_unmapped_area                          4991  19.8056
AM> c0105278 poll_idle                                  4997 124.9250
AM> 00000000 total                                     10034   0.0062

AM> The `poll_idle' count is from the other CPU.

AM> What appears to be happening is that the VMA tree has degenerated
AM> into a monstrous singly linked list.  All those little 4k mappings
AM> are individual data structures, chained one after the other.

Well. The thing is I've modified an application a bit so it randomly
asked from 1 to 64 pages  and the execution process still look the
same.  So this does not only happen then mapping same sizes but also
touches the initial mapping of many chunks.

So the next test was also simple - I started to deallocate in the same
order fist mmaped chunks  holding only 40.000 of last mapped chunks:

 31000  Time: 5
 32000  Time: 4
 33000  Time: 5
 34000  Time: 5
 35000  Time: 5
 36000  Time: 6
 37000  Time: 5
 38000  Time: 6
 39000  Time: 5
 40000  Time: 6
 41000  Time: 0
 42000  Time: 0
 43000  Time: 1
 44000  Time: 0
 45000  Time: 0
 46000  Time: 1
 47000  Time: 1
 48000  Time: 1
 49000  Time: 1
 50000  Time: 1

 As you see then I start to free pages they are able to be find rather
 fast.

 Now I made in to hold 20000 of mappings only on loop iterations from
 20000 to 40000  and then stop freeing and continue allocating:


 2000  Time: 0
 4000  Time: 0
 6000  Time: 1
 8000  Time: 2
 10000  Time: 2
 12000  Time: 4
 14000  Time: 4
 16000  Time: 4
 18000  Time: 5
 20000  Time: 6
 22000  Time: 0
 24000  Time: 1
 26000  Time: 1
 28000  Time: 1
 30000  Time: 3
 32000  Time: 3
 34000  Time: 4
 36000  Time: 5
 38000  Time: 5
 40000  Time: 6
 42000  Time: 6
 44000  Time: 7
 46000  Time: 8
  

 Quite surprising as you see the speed increases in the hole but
 degrades quite fast even the number of mapped pages stays the same on
 interval 20.000 - 40.000

And now back to the previous test. Now I tested it with 20.000 of
mapped pages: As you see the cycle with a period of 20.000 - with this
period the pages with low addresses are freed so it look exactly like
address space is scaned from the low address to high looking for the
first place for page to fit:

 5000  Time: 1
 10000  Time: 4
 15000  Time: 10
 20000  Time: 13
 25000  Time: 1
 30000  Time: 5
 35000  Time: 9
 40000  Time: 14
 45000  Time: 1
 50000  Time: 5
 55000  Time: 9
 60000  Time: 13
 65000  Time: 1
 70000  Time: 5
 75000  Time: 10
 80000  Time: 13
 85000  Time: 1
 90000  Time: 5
 95000  Time: 10
 100000  Time: 13
 105000  Time: 1
 110000  Time: 5
 115000  Time: 9
 120000  Time: 14

 


AM> The reason you don't see it with an anonymous map is, I think, that
AM> the kernel will merge contiguous anon mappings into a single one,
AM> so there's basically nothing to be searched each time you request some
AM> more memory.

AM> Running the same test on the -ac kernel is 4-5% slower, so Andrea's
AM> new rbtree implementation makes a better linked list than the old
AM> AVL-tree's one :)

AM> It strikes me that this is not a completely stupid usage pattern, and
AM> that perhaps it's worth thinking about some tweaks to cope with it.
AM> I really don't know, but I've Cc'ed some people who do.

Hope so :)
Also As you see other patterns also show fast performance degradation
over increasing number of pages. I can also test random allocation and
freeing but something tells me the result will be the same.

Hope to hear some info from guru :)


Please write me if some patches will be available I will be happy to
test them

The last test programm (if interested)

#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <errno.h>
#include <stdlib.h>

int main()
 {
  int i=0;
  void* p;
  int t;
  int fd;
  int size;
  void* arr[1000000];
  int   sz[1000000];
  
  int addr;
  
  for (t=0;t<1000000;t++) arr[t]=NULL;
  for (t=0;t<1000000;t++) sz[t]=0;
  t=time(NULL);
  while(1)
    {     
     fd=open("/spylog/1/test.dat",O_RDWR);
     if (fd<0)
      {
       puts("Unable to open file !");
       return;
      }
//     size=(((double)random()/RAND_MAX*16)+1)*4096;
       size=4096;
//     printf("<%d>",size);
     p=mmap(0x60000000,size, PROT_READ | PROT_WRITE , MAP_PRIVATE ,fd ,0);
     if ((int)p==-1) 
          {
           printf("Failed %d\n",errno);
           return;
          }       
     arr[i]=p;
     sz[i]=size;
     if ((i>20000)) munmap(arr[i-20000],sz[i-20000]);
     i++;     
     if (i%5000==0)
       {
        printf(" %d  Time: %d\n",i,time(NULL)-t);
        t=time(NULL);
       }
     
    } 
 }




-- 
Best regards,
 Peter                            mailto:pz@spylog.ru


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: your mail on mmap() to the kernel list
  2001-12-02 23:07     ` Re[2]: " Peter Zaitsev
@ 2001-12-02 23:34       ` Andrew Morton
  2001-12-03 10:10         ` Re[2]: " Peter Zaitsev
  2001-12-04 14:18         ` Andrea Arcangeli
  0 siblings, 2 replies; 15+ messages in thread
From: Andrew Morton @ 2001-12-02 23:34 UTC (permalink / raw)
  To: Peter Zaitsev
  Cc: theowl, theowl, linux-kernel, Linus Torvalds, Andrea Arcangeli

Peter Zaitsev wrote:
> 
> ...

It's very simple.  The kernel has a linked list of vma's for your
process.  It is kept in address order.  Each time you request a
new mapping, the kernel walks down that list looking for an address
at which to place the new mapping.  This data structure is set up
for efficient find-by-address, not for efficient find-a-gap.

Question is: do we need to optimise for this case?

If it's just a single file, then you'd be better off just mapping the
whole thing.   If you need to map lots and lots of files then
you'll hit the maximum file limit fairly early and the mmap()
performance will be not great, but maybe acceptable.

One scenario where this could be a problem is for a file
which is too large to be mapped in its entirety, but the
application needs access to lots of bits of it at the same
time.  There doesn't seem to be much alternative to setting
up a distinct mapping for each access window in this case.

> Also As you see other patterns also show fast performance degradation
> over increasing number of pages. I can also test random allocation and
> freeing but something tells me the result will be the same.

Is this proving to be a problem in a real-world application?
What are you trying to do here?

-

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re[2]: your mail on mmap() to the kernel list
  2001-12-02 23:34       ` Andrew Morton
@ 2001-12-03 10:10         ` Peter Zaitsev
  2001-12-04 14:18         ` Andrea Arcangeli
  1 sibling, 0 replies; 15+ messages in thread
From: Peter Zaitsev @ 2001-12-03 10:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: theowl, theowl, linux-kernel, Linus Torvalds, Andrea Arcangeli

Hello Andrew,

Monday, December 03, 2001, 2:34:54 AM, you wrote:

AM> Peter Zaitsev wrote:
>> 
>> ...

AM> It's very simple.  The kernel has a linked list of vma's for your
AM> process.  It is kept in address order.  Each time you request a
AM> new mapping, the kernel walks down that list looking for an address
AM> at which to place the new mapping.  This data structure is set up
AM> for efficient find-by-address, not for efficient find-a-gap.

Yes. I see. I've tried some other load patterns like random
allocation/deallocation  but still see the speed degrades then a
number of mapped blocks increases :(

The other thing is  I do not get the thing with anonymous mapping -
I've now tried the simple thing - Quite the same program but every
second mapping was anonymous to prevent merging of the calls but still
I see the speed gain:

1) Non-anonymous
  5000  Time: 1 Mapped: 4999 Unmapped: 1661
  10000  Time: 2 Mapped: 9999 Unmapped: 3311 
  15000  Time: 6 Mapped: 14999 Unmapped: 4957 
  20000  Time: 9 Mapped: 19999 Unmapped: 6620 
  25000  Time: 12 Mapped: 24999 Unmapped: 8290 
  30000  Time: 15 Mapped: 29999 Unmapped: 9975 
  35000  Time: 17 Mapped: 34999 Unmapped: 11643 
  40000  Time: 20 Mapped: 39999 Unmapped: 13287 
  45000  Time: 23 Mapped: 44999 Unmapped: 14953 
  50000  Time: 26 Mapped: 49999 Unmapped: 16618 
  55000  Time: 28 Mapped: 54999 Unmapped: 18311 
  60000  Time: 31 Mapped: 59999 Unmapped: 20013 

2) Every second is anonymous
  5000  Time: 1 Mapped: 4999 Unmapped: 1661
  10000  Time: 1 Mapped: 9999 Unmapped: 3311 
  15000  Time: 3 Mapped: 14999 Unmapped: 4957 
  20000  Time: 6 Mapped: 19999 Unmapped: 6620 
  25000  Time: 8 Mapped: 24999 Unmapped: 8290 
  30000  Time: 9 Mapped: 29999 Unmapped: 9975 
  35000  Time: 12 Mapped: 34999 Unmapped: 11643 
  40000  Time: 13 Mapped: 39999 Unmapped: 13287 
  45000  Time: 15 Mapped: 44999 Unmapped: 14953 
  50000  Time: 18 Mapped: 49999 Unmapped: 16618 
  55000  Time: 18 Mapped: 54999 Unmapped: 18311 
  60000  Time: 21 Mapped: 59999 Unmapped: 20013

Any question is why ? Does the anonymous mapping is made separate way
to be always able to merge ?  This is important for me as I can stand
slowed down mmaping of files but really do not want for memory
allocation to slow down (mmap is used with multi threaded program)

AM> Question is: do we need to optimise for this case?

I hope you would. Or at least some patches would exist for this :)

AM> If it's just a single file, then you'd be better off just mapping the
AM> whole thing.   If you need to map lots and lots of files then
AM> you'll hit the maximum file limit fairly early and the mmap()
AM> performance will be not great, but maybe acceptable.

Well. I need much of small files. These files contain different
clients data and so they should not be merged.
I will not hit the open file limit - I've recently tested the  500.000
of open files - it seems to work and with no real speed penalty.

So this is really the strange case - I'm able to open 500.000+ of
files effective way but can't  effectively map more than 20.000 of
them :(

AM> One scenario where this could be a problem is for a file
AM> which is too large to be mapped in its entirety, but the
AM> application needs access to lots of bits of it at the same
AM> time.  There doesn't seem to be much alternative to setting
AM> up a distinct mapping for each access window in this case.

Yes.  I also found what migrating to small number of bug files will
not help much there will produce some other problems :)

>> Also As you see other patterns also show fast performance degradation
>> over increasing number of pages. I can also test random allocation and
>> freeing but something tells me the result will be the same.

AM> Is this proving to be a problem in a real-world application?
AM> What are you trying to do here?

I had two problems which I tried to solve with mmaping a lot of files.
My application works with a lot of data - 2G+ there not really all of
it are often accessed so some of it can be swapped out without any
problem for the application but I got two problems
1) User Address space on Intel  is 3.5G   (with Andrea's patches) - I
can't afford to migrate to 64bit environment as there are a lot of
mashies running the application. So I've set up the cache of mapped
pieces. As some of them are accessed rather rare it work only with
relatively small number of mapped chunks.
2) The second problem is more complicated - My program may work
effective way with only 20% of data present in memory, therefore I
must save things to the disk periodically. So with  read/write this
leads to a lot of swapping as a page must be swapped in before being
able to written to the disk.  And even more - only small persent of
pages really change  and so need to be saved, therefore it is hard to
track this and mmap should theoretically handle this automatically.

-- 
Best regards,
 Peter                            mailto:pz@spylog.ru

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: your mail on mmap() to the kernel list
  2001-12-01  9:37   ` Andrew Morton
  2001-12-02 23:07     ` Re[2]: " Peter Zaitsev
@ 2001-12-04 14:15     ` Andrea Arcangeli
  2001-12-04 15:36       ` Re[2]: " Peter Zaitsev
  1 sibling, 1 reply; 15+ messages in thread
From: Andrea Arcangeli @ 2001-12-04 14:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Peter Zaitsev, theowl, theowl, linux-kernel, Linus Torvalds

On Sat, Dec 01, 2001 at 01:37:01AM -0800, Andrew Morton wrote:
> Peter Zaitsev wrote:
> > 
> > Hello theowl,
> > 
> > Saturday, December 01, 2001, 2:20:20 AM, you wrote:
> > 
> > Well. Thank you. I've allready found this - in recent kernels it's
> > even regulated via proc fs.
> > 
> > The only question is why map anonymous is rather fast (i get
> > 1000req/sec even then mapped 300.000 of blocks), therefore with
> > mapping a fd the perfomance drops to 20req/second at this number.
> > 
> 
> well a kernel profile is pretty unambiguous:
> 
> c0116af4 mm_init                                       1   0.0050
> c0117394 do_fork                                       1   0.0005
> c0124ccc clear_page_tables                             1   0.0064
> c0125af0 do_wp_page                                    1   0.0026
> c01260a0 do_no_page                                    1   0.0033
> c01265dc find_vma_prepare                              1   0.0081
> c0129138 file_read_actor                               1   0.0093
> c012d95c kmem_cache_alloc                              1   0.0035
> c0147890 d_lookup                                      1   0.0036
> c01573dc write_profile                                 1   0.0061
> c0169d44 journal_add_journal_head                      1   0.0035
> c0106e88 system_call                                   2   0.0357
> c01264bc unlock_vma_mappings                           2   0.0500
> c0135bb4 fget                                          2   0.0227
> c028982c __generic_copy_from_user                      2   0.0192
> c01267ec do_mmap_pgoff                                 4   0.0035
> c0126d6c find_vma                                      7   0.0761
> c0105000 _stext                                       16   0.1667
> c0126c70 get_unmapped_area                          4991  19.8056
> c0105278 poll_idle                                  4997 124.9250
> 00000000 total                                     10034   0.0062

the vma lookup overhead is nothing compared to the walking of the linked
list (not of the tree) to find the first available slot above
TASK_UNMAPPED_BASE. In the vma lookup engine rewrite I only cared about
find_vma, not about optimizing the search of a free vma over
TASK_UNMAPPED_BASE.  Such list-loop is really evil.

> What appears to be happening is that the VMA tree has degenerated
> into a monstrous singly linked list.  All those little 4k mappings

actually it's not that the tree degenerated into a list. It's that we
need to walk all the vma to check if there is a large enough hole to
place the new dynamic mapping and so walk we use the list, not the tree,
because it needs less mem resources and it's simpler and faster.

You can fix the problem in userspace by using a meaningful 'addr' as
hint to mmap(2), or by using MAP_FIXED from userspace, then the kernel
won't waste time searching the first available mapping over
TASK_UNMAPPED_BASE.

> The reason you don't see it with an anonymous map is, I think, that
> the kernel will merge contiguous anon mappings into a single one,

Yes. But actually merging also file backed vmas would be a goodness
indipendently of the arch_get_unmapped_area loop. It's not possible
right now because the anonymous memory case it's much simpler: no
i_shared_locks etc... but I'd like if in the long run also the file
backed vma will be mergeable. The side effect of the goodness is that
also the loop would run faster of course. But technically to really kill
the evil complexity of the loop (for example if every vma belongs to a
different file so it cannot be merged anyways) we'd need a tree of
"holes" indexed in function of the size of the hole. but it seems a very
dubious optimization... Complexity wise it makes sense, but in practice
the common case should matter more I guess, and of course userspace can
just fix this without any kernel modification by passing an helpful
"addr", to the mmap syscall with very very little effort.  Untested
patch follows:

@@ -68,7 +9,7 @@
 int main()
 {
 	int i = 0;
-	void *p;
+	void *p = NULL;
 	int t;
 	int fd;

@@ -79,7 +20,7 @@
 	}
 	t = time(NULL);
 	while (1) {
-		p = mmap(0, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
+		p = mmap(p, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
 		if ((int) p == -1) {
 			perror("mmap");
 			return;

Andrea

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: your mail on mmap() to the kernel list
  2001-12-02 23:34       ` Andrew Morton
  2001-12-03 10:10         ` Re[2]: " Peter Zaitsev
@ 2001-12-04 14:18         ` Andrea Arcangeli
  1 sibling, 0 replies; 15+ messages in thread
From: Andrea Arcangeli @ 2001-12-04 14:18 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Peter Zaitsev, theowl, theowl, linux-kernel, Linus Torvalds

On Sun, Dec 02, 2001 at 03:34:54PM -0800, Andrew Morton wrote:
> Peter Zaitsev wrote:
> > 
> > ...
> 
> It's very simple.  The kernel has a linked list of vma's for your
> process.  It is kept in address order.  Each time you request a
> new mapping, the kernel walks down that list looking for an address
> at which to place the new mapping.  This data structure is set up
> for efficient find-by-address, not for efficient find-a-gap.

exactly.

Andrea

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re[2]: your mail on mmap() to the kernel list
  2001-12-04 14:15     ` Andrea Arcangeli
@ 2001-12-04 15:36       ` Peter Zaitsev
  2001-12-04 16:42         ` Rik van Riel
  2001-12-04 16:48         ` Andrea Arcangeli
  0 siblings, 2 replies; 15+ messages in thread
From: Peter Zaitsev @ 2001-12-04 15:36 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, theowl, theowl, linux-kernel, Linus Torvalds

Hello Andrea,

Tuesday, December 04, 2001, 5:15:49 PM, you wrote:

AA> the vma lookup overhead is nothing compared to the walking of the linked
AA> list (not of the tree) to find the first available slot above
AA> TASK_UNMAPPED_BASE. In the vma lookup engine rewrite I only cared about
AA> find_vma, not about optimizing the search of a free vma over
AA> TASK_UNMAPPED_BASE.  Such list-loop is really evil.
Sure.

>> What appears to be happening is that the VMA tree has degenerated
>> into a monstrous singly linked list.  All those little 4k mappings

AA> actually it's not that the tree degenerated into a list. It's that we
AA> need to walk all the vma to check if there is a large enough hole to
AA> place the new dynamic mapping and so walk we use the list, not the tree,
AA> because it needs less mem resources and it's simpler and faster.

AA> You can fix the problem in userspace by using a meaningful 'addr' as
AA> hint to mmap(2), or by using MAP_FIXED from userspace, then the kernel
AA> won't waste time searching the first available mapping over
AA> TASK_UNMAPPED_BASE.
Well. Really you can't do this, because you can not really track all of
the mappings in user program as glibc and probably other libraries
use mmap for their purposes. This may work only at the program start
there you may almost be shure only low addresses are used yet.
In my case I'm implementing a cache of mmaped chunks so there are
always some mmaps/munmaps going.   Also I can't use "random" addresses
with a high probability it's so as my application mmaps about 50% of
user address space (meaning 3.5G your patches).

>> The reason you don't see it with an anonymous map is, I think, that
>> the kernel will merge contiguous anon mappings into a single one,

AA> Yes. But actually merging also file backed vmas would be a goodness
AA> indipendently of the arch_get_unmapped_area loop. It's not possible
AA> right now because the anonymous memory case it's much simpler: no
AA> i_shared_locks etc... but I'd like if in the long run also the file
AA> backed vma will be mergeable. The side effect of the goodness is that
AA> also the loop would run faster of course.
Do you think it's the big chance the two close mappings will belong to the
different parts of one file. I think this should be true only for some
specific workloads.
AA>  But technically to really kill
AA> the evil complexity of the loop (for example if every vma belongs to a
AA> different file so it cannot be merged anyways) we'd need a tree of
AA> "holes" indexed in function of the size of the hole. but it seems a very
AA> dubious optimization...
Are there much problems with this approach ? Much memory usage or cpu
overhead somethere ?
AA>  Complexity wise it makes sense, but in practice
AA> the common case should matter more I guess, and of course userspace can
AA> just fix this without any kernel modification by passing an helpful
AA> "addr", to the mmap syscall with very very little effort.  Untested
AA> patch follows:

Could you please explain a bit how this the hint works ? Does it tries
only specified address or also tries to look up the free space from
this point ?

-- 
Best regards,
 Peter                            mailto:pz@spylog.ru

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re[2]: your mail on mmap() to the kernel list
  2001-12-04 15:36       ` Re[2]: " Peter Zaitsev
@ 2001-12-04 16:42         ` Rik van Riel
  2001-12-04 16:55           ` Andrea Arcangeli
  2001-12-05 14:36           ` Re[3]: " Peter Zaitsev
  2001-12-04 16:48         ` Andrea Arcangeli
  1 sibling, 2 replies; 15+ messages in thread
From: Rik van Riel @ 2001-12-04 16:42 UTC (permalink / raw)
  To: Peter Zaitsev
  Cc: Andrea Arcangeli, Andrew Morton, theowl, theowl, linux-kernel,
	Linus Torvalds

On Tue, 4 Dec 2001, Peter Zaitsev wrote:
> Tuesday, December 04, 2001, 5:15:49 PM, you wrote:

> AA> You can fix the problem in userspace by using a meaningful 'addr' as
> AA> hint to mmap(2), or by using MAP_FIXED from userspace, then the kernel
> AA> won't waste time searching the first available mapping over
> AA> TASK_UNMAPPED_BASE.

> Well. Really you can't do this, because you can not really track all of
> the mappings in user program as glibc and probably other libraries
> use mmap for their purposes.

There's no reason we couldn't do this hint in kernel space.

In arch_get_unmapped_area we can simply keep track of the
lowest address where we found free space, while on munmap()
we can adjust this hint if needed.

OTOH, I doubt it would help real-world workloads where the
application maps and unmaps areas of different sizes and
actually does something with the memory instead of just
mapping and unmapping it ;)))

kind regards,

Rik
-- 
Shortwave goes a long way:  irc.starchat.net  #swl

http://www.surriel.com/		http://distro.conectiva.com/



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: your mail on mmap() to the kernel list
  2001-12-04 15:36       ` Re[2]: " Peter Zaitsev
  2001-12-04 16:42         ` Rik van Riel
@ 2001-12-04 16:48         ` Andrea Arcangeli
  2001-12-05 16:12           ` Re[2]: " Peter Zaitsev
  1 sibling, 1 reply; 15+ messages in thread
From: Andrea Arcangeli @ 2001-12-04 16:48 UTC (permalink / raw)
  To: Peter Zaitsev; +Cc: Andrew Morton, theowl, theowl, linux-kernel, Linus Torvalds

On Tue, Dec 04, 2001 at 06:36:24PM +0300, Peter Zaitsev wrote:
> Hello Andrea,
> 
> Tuesday, December 04, 2001, 5:15:49 PM, you wrote:
> 
> 
> AA> the vma lookup overhead is nothing compared to the walking of the linked
> AA> list (not of the tree) to find the first available slot above
> AA> TASK_UNMAPPED_BASE. In the vma lookup engine rewrite I only cared about
> AA> find_vma, not about optimizing the search of a free vma over
> AA> TASK_UNMAPPED_BASE.  Such list-loop is really evil.
> Sure.
> 
> >> What appears to be happening is that the VMA tree has degenerated
> >> into a monstrous singly linked list.  All those little 4k mappings
> 
> AA> actually it's not that the tree degenerated into a list. It's that we
> AA> need to walk all the vma to check if there is a large enough hole to
> AA> place the new dynamic mapping and so walk we use the list, not the tree,
> AA> because it needs less mem resources and it's simpler and faster.
> 
> AA> You can fix the problem in userspace by using a meaningful 'addr' as
> AA> hint to mmap(2), or by using MAP_FIXED from userspace, then the kernel
> AA> won't waste time searching the first available mapping over
> AA> TASK_UNMAPPED_BASE.
> Well. Really you can't do this, because you can not really track all of

I can do this sometime, I did it in the patch I posted for example. If
you want transparency and genericity you can still do it in userspace
with an LD_LIBRARY_PATH that loads a libc that builds the hole-lookup
data structure (tree?) internally, and then passes the hint to the
kernel if the program suggests 0.

OTOH I prefer those kind of algorithms (if needed) to be in the kernel
so they're faster/cleaner and also a lib would need some magic knowledge
about kernel internals virtual addresses to do it right.

> >> The reason you don't see it with an anonymous map is, I think, that
> >> the kernel will merge contiguous anon mappings into a single one,
> 
> AA> Yes. But actually merging also file backed vmas would be a goodness
> AA> indipendently of the arch_get_unmapped_area loop. It's not possible
> AA> right now because the anonymous memory case it's much simpler: no
> AA> i_shared_locks etc... but I'd like if in the long run also the file
> AA> backed vma will be mergeable. The side effect of the goodness is that
> AA> also the loop would run faster of course.
> Do you think it's the big chance the two close mappings will belong to the
> different parts of one file. I think this should be true only for some
> specific workloads.
> AA>  But technically to really kill
> AA> the evil complexity of the loop (for example if every vma belongs to a
> AA> different file so it cannot be merged anyways) we'd need a tree of
> AA> "holes" indexed in function of the size of the hole. but it seems a very
> AA> dubious optimization...
> Are there much problems with this approach ? Much memory usage or cpu
> overhead somethere ?
> AA>  Complexity wise it makes sense, but in practice
> AA> the common case should matter more I guess, and of course userspace can
> AA> just fix this without any kernel modification by passing an helpful
> AA> "addr", to the mmap syscall with very very little effort.  Untested
> AA> patch follows:
> 
> Could you please explain a bit how this the hint works ? Does it tries

man 2 mmap, then search for 'start'

> only specified address or also tries to look up the free space from
> this point ?

in short it checks if there's 'len' free space at address 'start', and
if there is, it maps it there, without having to search for the first
large enough hole in the heap.

Andrea

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: your mail on mmap() to the kernel list
  2001-12-04 16:42         ` Rik van Riel
@ 2001-12-04 16:55           ` Andrea Arcangeli
  2001-12-05 14:38             ` Re[2]: " Peter Zaitsev
  2001-12-05 14:36           ` Re[3]: " Peter Zaitsev
  1 sibling, 1 reply; 15+ messages in thread
From: Andrea Arcangeli @ 2001-12-04 16:55 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Zaitsev, Andrew Morton, theowl, theowl, linux-kernel,
	Linus Torvalds

On Tue, Dec 04, 2001 at 02:42:28PM -0200, Rik van Riel wrote:
> On Tue, 4 Dec 2001, Peter Zaitsev wrote:
> > Tuesday, December 04, 2001, 5:15:49 PM, you wrote:
> 
> > AA> You can fix the problem in userspace by using a meaningful 'addr' as
> > AA> hint to mmap(2), or by using MAP_FIXED from userspace, then the kernel
> > AA> won't waste time searching the first available mapping over
> > AA> TASK_UNMAPPED_BASE.
> 
> > Well. Really you can't do this, because you can not really track all of
> > the mappings in user program as glibc and probably other libraries
> > use mmap for their purposes.
> 
> There's no reason we couldn't do this hint in kernel space.
> 
> In arch_get_unmapped_area we can simply keep track of the
> lowest address where we found free space, while on munmap()
> we can adjust this hint if needed.
>
> OTOH, I doubt it would help real-world workloads where the
> application maps and unmaps areas of different sizes and
> actually does something with the memory instead of just
> mapping and unmapping it ;)))

exactly, while that would be simple to implement and very lightweight at
runtime, that's not enough to mathematically drop the complexity of the
get_unmapped_area algorithm. It would optimize only the case where
there's no fragmentation of the mapped virtual address space.

For finding the best fit in the heap with O(log(N)) complexity (rather
than the current O(N) complexity of the linked list) one tree indexed by
the size of each hole would be necessary.

Andrea

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re[3]: your mail on mmap() to the kernel list
  2001-12-04 16:42         ` Rik van Riel
  2001-12-04 16:55           ` Andrea Arcangeli
@ 2001-12-05 14:36           ` Peter Zaitsev
  1 sibling, 0 replies; 15+ messages in thread
From: Peter Zaitsev @ 2001-12-05 14:36 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, Andrew Morton, theowl, theowl, linux-kernel,
	Linus Torvalds

Hello Rik,

Tuesday, December 04, 2001, 7:42:28 PM, you wrote:

>> Well. Really you can't do this, because you can not really track all of
>> the mappings in user program as glibc and probably other libraries
>> use mmap for their purposes.

RvR> There's no reason we couldn't do this hint in kernel space.

Yes. This cache will probably give a good hit rate. It of course does
not decrease mathematical complexity but speeding the things up couple
of times is good anyway :)

RvR> In arch_get_unmapped_area we can simply keep track of the
RvR> lowest address where we found free space, while on munmap()
RvR> we can adjust this hint if needed.

RvR> OTOH, I doubt it would help real-world workloads where the
RvR> application maps and unmaps areas of different sizes and
RvR> actually does something with the memory instead of just
RvR> mapping and unmapping it ;)))

Well this is quite simple I think. Database may use mmap to access the
data in files, as you can't map everything to 32bit address space you
will have to map just parts of the files, therefore as you can't have
to much mapped chunks you will have different chunk sizes to merge
continuos mmaped areas. Other thing some databases support different
physical page sizes so this will be true even without merging.

One more thing: fread at least in some cases does mmap of the file -
so if you use it aggressively you will get in the problem.

Anyway in most cases even then mmaped chunks are different sizes most
of them should be around the same size in the most cases.

-- 
Best regards,
 Peter                            mailto:pz@spylog.ru

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re[2]: your mail on mmap() to the kernel list
  2001-12-04 16:55           ` Andrea Arcangeli
@ 2001-12-05 14:38             ` Peter Zaitsev
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Zaitsev @ 2001-12-05 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Andrew Morton, theowl, theowl, linux-kernel,
	Linus Torvalds

Hello Andrea,

>>
>> OTOH, I doubt it would help real-world workloads where the
>> application maps and unmaps areas of different sizes and
>> actually does something with the memory instead of just
>> mapping and unmapping it ;)))

AA> exactly, while that would be simple to implement and very lightweight at
AA> runtime, that's not enough to mathematically drop the complexity of the
AA> get_unmapped_area algorithm. It would optimize only the case where
AA> there's no fragmentation of the mapped virtual address space.

And also will optimize all mappings of 4K and (which are at least 70%
in mu case) :)

AA> For finding the best fit in the heap with O(log(N)) complexity (rather
AA> than the current O(N) complexity of the linked list) one tree indexed by
AA> the size of each hole would be necessary.

This of course would be the best way.

-- 
Best regards,
 Peter                            mailto:pz@spylog.ru

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re[2]: your mail on mmap() to the kernel list
  2001-12-04 16:48         ` Andrea Arcangeli
@ 2001-12-05 16:12           ` Peter Zaitsev
  2001-12-05 16:23             ` Andrea Arcangeli
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Zaitsev @ 2001-12-05 16:12 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, theowl, theowl, linux-kernel, Linus Torvalds

Hello Andrea,

Tuesday, December 04, 2001, 7:48:24 PM, you wrote:

>> AA> You can fix the problem in userspace by using a meaningful 'addr' as
>> AA> hint to mmap(2), or by using MAP_FIXED from userspace, then the kernel
>> AA> won't waste time searching the first available mapping over
>> AA> TASK_UNMAPPED_BASE.
>> Well. Really you can't do this, because you can not really track all of

AA> I can do this sometime, I did it in the patch I posted for example.
Well. You're really easy may do this only in such dumb test example
then you do something real between mmaps even memory allocation in
multi threaded program you will not able to track it.

AA>  If
AA> you want transparency and genericity you can still do it in userspace
AA> with an LD_LIBRARY_PATH that loads a libc that builds the hole-lookup
AA> data structure (tree?) internally, and then passes the hint to the
AA> kernel if the program suggests 0.

Well but it's not really easy and right thing to do as you will also
need to write your own memory allocator (as standard one uses mmap so
you can get in recursive loop).  And also it will be really
platform/kernel dependent as different kernels may use different
addresses there user part of address space available for mmaping
starts.

AA> OTOH I prefer those kind of algorithms (if needed) to be in the kernel
AA> so they're faster/cleaner and also a lib would need some magic knowledge
AA> about kernel internals virtual addresses to do it right.
Yes. I'm quite agree with that.  Also if you have one tree implemented
it should not be too hard to have a second tree indexed by other
field.

>> AA> the common case should matter more I guess, and of course userspace can
>> AA> just fix this without any kernel modification by passing an helpful
>> AA> "addr", to the mmap syscall with very very little effort.  Untested
>> AA> patch follows:
>> 
>> Could you please explain a bit how this the hint works ? Does it tries

AA> man 2 mmap, then search for 'start'

I've read it of course :)

>> only specified address or also tries to look up the free space from
>> this point ?

AA> in short it checks if there's 'len' free space at address 'start', and
AA> if there is, it maps it there, without having to search for the first
AA> large enough hole in the heap.

I understand it but does it start so search standard way from the low
addresses or it looks it above the hint address before looking at
lower addresses ?

-- 
Best regards,
 Peter                            mailto:pz@spylog.ru

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: your mail on mmap() to the kernel list
  2001-12-05 16:12           ` Re[2]: " Peter Zaitsev
@ 2001-12-05 16:23             ` Andrea Arcangeli
  0 siblings, 0 replies; 15+ messages in thread
From: Andrea Arcangeli @ 2001-12-05 16:23 UTC (permalink / raw)
  To: Peter Zaitsev; +Cc: Andrew Morton, theowl, theowl, linux-kernel, Linus Torvalds

On Wed, Dec 05, 2001 at 07:12:03PM +0300, Peter Zaitsev wrote:
> I understand it but does it start so search standard way from the low
> addresses or it looks it above the hint address before looking at
> lower addresses ?

mmap(2) with a "valid" hint (so with enough space in the hole between
'start' and 'start+len') runs with O(log(N)) complexity because it uses
the tree internally to verify that's a valid hole. It's only the search
of an hole that has O(N) complexity, the "check" of one hole has just
O(log(N)) complexity.

Andrea

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2001-12-05 16:24 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <3C082244.8587.80EF082@localhost>
2001-12-01  8:31 ` Re[2]: your mail on mmap() to the kernel list Peter Zaitsev
2001-12-01  9:37   ` Andrew Morton
2001-12-02 23:07     ` Re[2]: " Peter Zaitsev
2001-12-02 23:34       ` Andrew Morton
2001-12-03 10:10         ` Re[2]: " Peter Zaitsev
2001-12-04 14:18         ` Andrea Arcangeli
2001-12-04 14:15     ` Andrea Arcangeli
2001-12-04 15:36       ` Re[2]: " Peter Zaitsev
2001-12-04 16:42         ` Rik van Riel
2001-12-04 16:55           ` Andrea Arcangeli
2001-12-05 14:38             ` Re[2]: " Peter Zaitsev
2001-12-05 14:36           ` Re[3]: " Peter Zaitsev
2001-12-04 16:48         ` Andrea Arcangeli
2001-12-05 16:12           ` Re[2]: " Peter Zaitsev
2001-12-05 16:23             ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox