* Re[2]: your mail on mmap() to the kernel list [not found] <3C082244.8587.80EF082@localhost> @ 2001-12-01 8:31 ` Peter Zaitsev 2001-12-01 9:37 ` Andrew Morton 0 siblings, 1 reply; 15+ messages in thread From: Peter Zaitsev @ 2001-12-01 8:31 UTC (permalink / raw) To: theowl; +Cc: theowl, linux-kernel Hello theowl, Saturday, December 01, 2001, 2:20:20 AM, you wrote: Well. Thank you. I've allready found this - in recent kernels it's even regulated via proc fs. The only question is why map anonymous is rather fast (i get 1000req/sec even then mapped 300.000 of blocks), therefore with mapping a fd the perfomance drops to 20req/second at this number. Any ideas why does this happen or any patches which increases the speed exists ? tfch> in include/linux/sched.h: tfch> /* Maximum number of active map areas.. This is a random (large) number */ tfch> #define MAX_MAP_COUNT (65536) tfch> this should probably be (TASK_SIZE / PAGE_SIZE) on 32 bit architectures tfch> and something 'reasonably big' on 64 bit (too big of a value would tfch> allow for a nice DoS against the kernel). -- Best regards, Peter mailto:pz@spylog.ru ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: your mail on mmap() to the kernel list 2001-12-01 8:31 ` Re[2]: your mail on mmap() to the kernel list Peter Zaitsev @ 2001-12-01 9:37 ` Andrew Morton 2001-12-02 23:07 ` Re[2]: " Peter Zaitsev 2001-12-04 14:15 ` Andrea Arcangeli 0 siblings, 2 replies; 15+ messages in thread From: Andrew Morton @ 2001-12-01 9:37 UTC (permalink / raw) To: Peter Zaitsev Cc: theowl, theowl, linux-kernel, Linus Torvalds, Andrea Arcangeli Peter Zaitsev wrote: > > Hello theowl, > > Saturday, December 01, 2001, 2:20:20 AM, you wrote: > > Well. Thank you. I've allready found this - in recent kernels it's > even regulated via proc fs. > > The only question is why map anonymous is rather fast (i get > 1000req/sec even then mapped 300.000 of blocks), therefore with > mapping a fd the perfomance drops to 20req/second at this number. > well a kernel profile is pretty unambiguous: c0116af4 mm_init 1 0.0050 c0117394 do_fork 1 0.0005 c0124ccc clear_page_tables 1 0.0064 c0125af0 do_wp_page 1 0.0026 c01260a0 do_no_page 1 0.0033 c01265dc find_vma_prepare 1 0.0081 c0129138 file_read_actor 1 0.0093 c012d95c kmem_cache_alloc 1 0.0035 c0147890 d_lookup 1 0.0036 c01573dc write_profile 1 0.0061 c0169d44 journal_add_journal_head 1 0.0035 c0106e88 system_call 2 0.0357 c01264bc unlock_vma_mappings 2 0.0500 c0135bb4 fget 2 0.0227 c028982c __generic_copy_from_user 2 0.0192 c01267ec do_mmap_pgoff 4 0.0035 c0126d6c find_vma 7 0.0761 c0105000 _stext 16 0.1667 c0126c70 get_unmapped_area 4991 19.8056 c0105278 poll_idle 4997 124.9250 00000000 total 10034 0.0062 The `poll_idle' count is from the other CPU. What appears to be happening is that the VMA tree has degenerated into a monstrous singly linked list. All those little 4k mappings are individual data structures, chained one after the other. The reason you don't see it with an anonymous map is, I think, that the kernel will merge contiguous anon mappings into a single one, so there's basically nothing to be searched each time you request some more memory. Running the same test on the -ac kernel is 4-5% slower, so Andrea's new rbtree implementation makes a better linked list than the old AVL-tree's one :) It strikes me that this is not a completely stupid usage pattern, and that perhaps it's worth thinking about some tweaks to cope with it. I really don't know, but I've Cc'ed some people who do. Here's the test app: #include <stdio.h> #include <unistd.h> #include <sys/mman.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <errno.h> int main() { int i = 0; void *p; int t; int fd; fd = open("test.dat", O_RDWR); if (fd < 0) { puts("Unable to open file !"); return; } t = time(NULL); while (1) { p = mmap(0, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0); if ((int) p == -1) { perror("mmap"); return; } i++; if (i % 10000 == 0) { printf(" %d Time: %d\n", i, time(NULL) - t); t = time(NULL); } } } ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re[2]: your mail on mmap() to the kernel list 2001-12-01 9:37 ` Andrew Morton @ 2001-12-02 23:07 ` Peter Zaitsev 2001-12-02 23:34 ` Andrew Morton 2001-12-04 14:15 ` Andrea Arcangeli 1 sibling, 1 reply; 15+ messages in thread From: Peter Zaitsev @ 2001-12-02 23:07 UTC (permalink / raw) To: Andrew Morton Cc: theowl, theowl, linux-kernel, Linus Torvalds, Andrea Arcangeli Hello Andrew, Saturday, December 01, 2001, 12:37:01 PM, you wrote: >> >> The only question is why map anonymous is rather fast (i get >> 1000req/sec even then mapped 300.000 of blocks), therefore with >> mapping a fd the perfomance drops to 20req/second at this number. >> AM> well a kernel profile is pretty unambiguous: AM> c0116af4 mm_init 1 0.0050 AM> c0117394 do_fork 1 0.0005 AM> c0124ccc clear_page_tables 1 0.0064 AM> c0125af0 do_wp_page 1 0.0026 AM> c01260a0 do_no_page 1 0.0033 AM> c01265dc find_vma_prepare 1 0.0081 AM> c0129138 file_read_actor 1 0.0093 AM> c012d95c kmem_cache_alloc 1 0.0035 AM> c0147890 d_lookup 1 0.0036 AM> c01573dc write_profile 1 0.0061 AM> c0169d44 journal_add_journal_head 1 0.0035 AM> c0106e88 system_call 2 0.0357 AM> c01264bc unlock_vma_mappings 2 0.0500 AM> c0135bb4 fget 2 0.0227 AM> c028982c __generic_copy_from_user 2 0.0192 AM> c01267ec do_mmap_pgoff 4 0.0035 AM> c0126d6c find_vma 7 0.0761 AM> c0105000 _stext 16 0.1667 AM> c0126c70 get_unmapped_area 4991 19.8056 AM> c0105278 poll_idle 4997 124.9250 AM> 00000000 total 10034 0.0062 AM> The `poll_idle' count is from the other CPU. AM> What appears to be happening is that the VMA tree has degenerated AM> into a monstrous singly linked list. All those little 4k mappings AM> are individual data structures, chained one after the other. Well. The thing is I've modified an application a bit so it randomly asked from 1 to 64 pages and the execution process still look the same. So this does not only happen then mapping same sizes but also touches the initial mapping of many chunks. So the next test was also simple - I started to deallocate in the same order fist mmaped chunks holding only 40.000 of last mapped chunks: 31000 Time: 5 32000 Time: 4 33000 Time: 5 34000 Time: 5 35000 Time: 5 36000 Time: 6 37000 Time: 5 38000 Time: 6 39000 Time: 5 40000 Time: 6 41000 Time: 0 42000 Time: 0 43000 Time: 1 44000 Time: 0 45000 Time: 0 46000 Time: 1 47000 Time: 1 48000 Time: 1 49000 Time: 1 50000 Time: 1 As you see then I start to free pages they are able to be find rather fast. Now I made in to hold 20000 of mappings only on loop iterations from 20000 to 40000 and then stop freeing and continue allocating: 2000 Time: 0 4000 Time: 0 6000 Time: 1 8000 Time: 2 10000 Time: 2 12000 Time: 4 14000 Time: 4 16000 Time: 4 18000 Time: 5 20000 Time: 6 22000 Time: 0 24000 Time: 1 26000 Time: 1 28000 Time: 1 30000 Time: 3 32000 Time: 3 34000 Time: 4 36000 Time: 5 38000 Time: 5 40000 Time: 6 42000 Time: 6 44000 Time: 7 46000 Time: 8 Quite surprising as you see the speed increases in the hole but degrades quite fast even the number of mapped pages stays the same on interval 20.000 - 40.000 And now back to the previous test. Now I tested it with 20.000 of mapped pages: As you see the cycle with a period of 20.000 - with this period the pages with low addresses are freed so it look exactly like address space is scaned from the low address to high looking for the first place for page to fit: 5000 Time: 1 10000 Time: 4 15000 Time: 10 20000 Time: 13 25000 Time: 1 30000 Time: 5 35000 Time: 9 40000 Time: 14 45000 Time: 1 50000 Time: 5 55000 Time: 9 60000 Time: 13 65000 Time: 1 70000 Time: 5 75000 Time: 10 80000 Time: 13 85000 Time: 1 90000 Time: 5 95000 Time: 10 100000 Time: 13 105000 Time: 1 110000 Time: 5 115000 Time: 9 120000 Time: 14 AM> The reason you don't see it with an anonymous map is, I think, that AM> the kernel will merge contiguous anon mappings into a single one, AM> so there's basically nothing to be searched each time you request some AM> more memory. AM> Running the same test on the -ac kernel is 4-5% slower, so Andrea's AM> new rbtree implementation makes a better linked list than the old AM> AVL-tree's one :) AM> It strikes me that this is not a completely stupid usage pattern, and AM> that perhaps it's worth thinking about some tweaks to cope with it. AM> I really don't know, but I've Cc'ed some people who do. Hope so :) Also As you see other patterns also show fast performance degradation over increasing number of pages. I can also test random allocation and freeing but something tells me the result will be the same. Hope to hear some info from guru :) Please write me if some patches will be available I will be happy to test them The last test programm (if interested) #include <stdio.h> #include <unistd.h> #include <sys/mman.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <errno.h> #include <stdlib.h> int main() { int i=0; void* p; int t; int fd; int size; void* arr[1000000]; int sz[1000000]; int addr; for (t=0;t<1000000;t++) arr[t]=NULL; for (t=0;t<1000000;t++) sz[t]=0; t=time(NULL); while(1) { fd=open("/spylog/1/test.dat",O_RDWR); if (fd<0) { puts("Unable to open file !"); return; } // size=(((double)random()/RAND_MAX*16)+1)*4096; size=4096; // printf("<%d>",size); p=mmap(0x60000000,size, PROT_READ | PROT_WRITE , MAP_PRIVATE ,fd ,0); if ((int)p==-1) { printf("Failed %d\n",errno); return; } arr[i]=p; sz[i]=size; if ((i>20000)) munmap(arr[i-20000],sz[i-20000]); i++; if (i%5000==0) { printf(" %d Time: %d\n",i,time(NULL)-t); t=time(NULL); } } } -- Best regards, Peter mailto:pz@spylog.ru ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: your mail on mmap() to the kernel list 2001-12-02 23:07 ` Re[2]: " Peter Zaitsev @ 2001-12-02 23:34 ` Andrew Morton 2001-12-03 10:10 ` Re[2]: " Peter Zaitsev 2001-12-04 14:18 ` Andrea Arcangeli 0 siblings, 2 replies; 15+ messages in thread From: Andrew Morton @ 2001-12-02 23:34 UTC (permalink / raw) To: Peter Zaitsev Cc: theowl, theowl, linux-kernel, Linus Torvalds, Andrea Arcangeli Peter Zaitsev wrote: > > ... It's very simple. The kernel has a linked list of vma's for your process. It is kept in address order. Each time you request a new mapping, the kernel walks down that list looking for an address at which to place the new mapping. This data structure is set up for efficient find-by-address, not for efficient find-a-gap. Question is: do we need to optimise for this case? If it's just a single file, then you'd be better off just mapping the whole thing. If you need to map lots and lots of files then you'll hit the maximum file limit fairly early and the mmap() performance will be not great, but maybe acceptable. One scenario where this could be a problem is for a file which is too large to be mapped in its entirety, but the application needs access to lots of bits of it at the same time. There doesn't seem to be much alternative to setting up a distinct mapping for each access window in this case. > Also As you see other patterns also show fast performance degradation > over increasing number of pages. I can also test random allocation and > freeing but something tells me the result will be the same. Is this proving to be a problem in a real-world application? What are you trying to do here? - ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re[2]: your mail on mmap() to the kernel list 2001-12-02 23:34 ` Andrew Morton @ 2001-12-03 10:10 ` Peter Zaitsev 2001-12-04 14:18 ` Andrea Arcangeli 1 sibling, 0 replies; 15+ messages in thread From: Peter Zaitsev @ 2001-12-03 10:10 UTC (permalink / raw) To: Andrew Morton Cc: theowl, theowl, linux-kernel, Linus Torvalds, Andrea Arcangeli Hello Andrew, Monday, December 03, 2001, 2:34:54 AM, you wrote: AM> Peter Zaitsev wrote: >> >> ... AM> It's very simple. The kernel has a linked list of vma's for your AM> process. It is kept in address order. Each time you request a AM> new mapping, the kernel walks down that list looking for an address AM> at which to place the new mapping. This data structure is set up AM> for efficient find-by-address, not for efficient find-a-gap. Yes. I see. I've tried some other load patterns like random allocation/deallocation but still see the speed degrades then a number of mapped blocks increases :( The other thing is I do not get the thing with anonymous mapping - I've now tried the simple thing - Quite the same program but every second mapping was anonymous to prevent merging of the calls but still I see the speed gain: 1) Non-anonymous 5000 Time: 1 Mapped: 4999 Unmapped: 1661 10000 Time: 2 Mapped: 9999 Unmapped: 3311 15000 Time: 6 Mapped: 14999 Unmapped: 4957 20000 Time: 9 Mapped: 19999 Unmapped: 6620 25000 Time: 12 Mapped: 24999 Unmapped: 8290 30000 Time: 15 Mapped: 29999 Unmapped: 9975 35000 Time: 17 Mapped: 34999 Unmapped: 11643 40000 Time: 20 Mapped: 39999 Unmapped: 13287 45000 Time: 23 Mapped: 44999 Unmapped: 14953 50000 Time: 26 Mapped: 49999 Unmapped: 16618 55000 Time: 28 Mapped: 54999 Unmapped: 18311 60000 Time: 31 Mapped: 59999 Unmapped: 20013 2) Every second is anonymous 5000 Time: 1 Mapped: 4999 Unmapped: 1661 10000 Time: 1 Mapped: 9999 Unmapped: 3311 15000 Time: 3 Mapped: 14999 Unmapped: 4957 20000 Time: 6 Mapped: 19999 Unmapped: 6620 25000 Time: 8 Mapped: 24999 Unmapped: 8290 30000 Time: 9 Mapped: 29999 Unmapped: 9975 35000 Time: 12 Mapped: 34999 Unmapped: 11643 40000 Time: 13 Mapped: 39999 Unmapped: 13287 45000 Time: 15 Mapped: 44999 Unmapped: 14953 50000 Time: 18 Mapped: 49999 Unmapped: 16618 55000 Time: 18 Mapped: 54999 Unmapped: 18311 60000 Time: 21 Mapped: 59999 Unmapped: 20013 Any question is why ? Does the anonymous mapping is made separate way to be always able to merge ? This is important for me as I can stand slowed down mmaping of files but really do not want for memory allocation to slow down (mmap is used with multi threaded program) AM> Question is: do we need to optimise for this case? I hope you would. Or at least some patches would exist for this :) AM> If it's just a single file, then you'd be better off just mapping the AM> whole thing. If you need to map lots and lots of files then AM> you'll hit the maximum file limit fairly early and the mmap() AM> performance will be not great, but maybe acceptable. Well. I need much of small files. These files contain different clients data and so they should not be merged. I will not hit the open file limit - I've recently tested the 500.000 of open files - it seems to work and with no real speed penalty. So this is really the strange case - I'm able to open 500.000+ of files effective way but can't effectively map more than 20.000 of them :( AM> One scenario where this could be a problem is for a file AM> which is too large to be mapped in its entirety, but the AM> application needs access to lots of bits of it at the same AM> time. There doesn't seem to be much alternative to setting AM> up a distinct mapping for each access window in this case. Yes. I also found what migrating to small number of bug files will not help much there will produce some other problems :) >> Also As you see other patterns also show fast performance degradation >> over increasing number of pages. I can also test random allocation and >> freeing but something tells me the result will be the same. AM> Is this proving to be a problem in a real-world application? AM> What are you trying to do here? I had two problems which I tried to solve with mmaping a lot of files. My application works with a lot of data - 2G+ there not really all of it are often accessed so some of it can be swapped out without any problem for the application but I got two problems 1) User Address space on Intel is 3.5G (with Andrea's patches) - I can't afford to migrate to 64bit environment as there are a lot of mashies running the application. So I've set up the cache of mapped pieces. As some of them are accessed rather rare it work only with relatively small number of mapped chunks. 2) The second problem is more complicated - My program may work effective way with only 20% of data present in memory, therefore I must save things to the disk periodically. So with read/write this leads to a lot of swapping as a page must be swapped in before being able to written to the disk. And even more - only small persent of pages really change and so need to be saved, therefore it is hard to track this and mmap should theoretically handle this automatically. -- Best regards, Peter mailto:pz@spylog.ru ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: your mail on mmap() to the kernel list 2001-12-02 23:34 ` Andrew Morton 2001-12-03 10:10 ` Re[2]: " Peter Zaitsev @ 2001-12-04 14:18 ` Andrea Arcangeli 1 sibling, 0 replies; 15+ messages in thread From: Andrea Arcangeli @ 2001-12-04 14:18 UTC (permalink / raw) To: Andrew Morton; +Cc: Peter Zaitsev, theowl, theowl, linux-kernel, Linus Torvalds On Sun, Dec 02, 2001 at 03:34:54PM -0800, Andrew Morton wrote: > Peter Zaitsev wrote: > > > > ... > > It's very simple. The kernel has a linked list of vma's for your > process. It is kept in address order. Each time you request a > new mapping, the kernel walks down that list looking for an address > at which to place the new mapping. This data structure is set up > for efficient find-by-address, not for efficient find-a-gap. exactly. Andrea ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: your mail on mmap() to the kernel list 2001-12-01 9:37 ` Andrew Morton 2001-12-02 23:07 ` Re[2]: " Peter Zaitsev @ 2001-12-04 14:15 ` Andrea Arcangeli 2001-12-04 15:36 ` Re[2]: " Peter Zaitsev 1 sibling, 1 reply; 15+ messages in thread From: Andrea Arcangeli @ 2001-12-04 14:15 UTC (permalink / raw) To: Andrew Morton; +Cc: Peter Zaitsev, theowl, theowl, linux-kernel, Linus Torvalds On Sat, Dec 01, 2001 at 01:37:01AM -0800, Andrew Morton wrote: > Peter Zaitsev wrote: > > > > Hello theowl, > > > > Saturday, December 01, 2001, 2:20:20 AM, you wrote: > > > > Well. Thank you. I've allready found this - in recent kernels it's > > even regulated via proc fs. > > > > The only question is why map anonymous is rather fast (i get > > 1000req/sec even then mapped 300.000 of blocks), therefore with > > mapping a fd the perfomance drops to 20req/second at this number. > > > > well a kernel profile is pretty unambiguous: > > c0116af4 mm_init 1 0.0050 > c0117394 do_fork 1 0.0005 > c0124ccc clear_page_tables 1 0.0064 > c0125af0 do_wp_page 1 0.0026 > c01260a0 do_no_page 1 0.0033 > c01265dc find_vma_prepare 1 0.0081 > c0129138 file_read_actor 1 0.0093 > c012d95c kmem_cache_alloc 1 0.0035 > c0147890 d_lookup 1 0.0036 > c01573dc write_profile 1 0.0061 > c0169d44 journal_add_journal_head 1 0.0035 > c0106e88 system_call 2 0.0357 > c01264bc unlock_vma_mappings 2 0.0500 > c0135bb4 fget 2 0.0227 > c028982c __generic_copy_from_user 2 0.0192 > c01267ec do_mmap_pgoff 4 0.0035 > c0126d6c find_vma 7 0.0761 > c0105000 _stext 16 0.1667 > c0126c70 get_unmapped_area 4991 19.8056 > c0105278 poll_idle 4997 124.9250 > 00000000 total 10034 0.0062 the vma lookup overhead is nothing compared to the walking of the linked list (not of the tree) to find the first available slot above TASK_UNMAPPED_BASE. In the vma lookup engine rewrite I only cared about find_vma, not about optimizing the search of a free vma over TASK_UNMAPPED_BASE. Such list-loop is really evil. > What appears to be happening is that the VMA tree has degenerated > into a monstrous singly linked list. All those little 4k mappings actually it's not that the tree degenerated into a list. It's that we need to walk all the vma to check if there is a large enough hole to place the new dynamic mapping and so walk we use the list, not the tree, because it needs less mem resources and it's simpler and faster. You can fix the problem in userspace by using a meaningful 'addr' as hint to mmap(2), or by using MAP_FIXED from userspace, then the kernel won't waste time searching the first available mapping over TASK_UNMAPPED_BASE. > The reason you don't see it with an anonymous map is, I think, that > the kernel will merge contiguous anon mappings into a single one, Yes. But actually merging also file backed vmas would be a goodness indipendently of the arch_get_unmapped_area loop. It's not possible right now because the anonymous memory case it's much simpler: no i_shared_locks etc... but I'd like if in the long run also the file backed vma will be mergeable. The side effect of the goodness is that also the loop would run faster of course. But technically to really kill the evil complexity of the loop (for example if every vma belongs to a different file so it cannot be merged anyways) we'd need a tree of "holes" indexed in function of the size of the hole. but it seems a very dubious optimization... Complexity wise it makes sense, but in practice the common case should matter more I guess, and of course userspace can just fix this without any kernel modification by passing an helpful "addr", to the mmap syscall with very very little effort. Untested patch follows: @@ -68,7 +9,7 @@ int main() { int i = 0; - void *p; + void *p = NULL; int t; int fd; @@ -79,7 +20,7 @@ } t = time(NULL); while (1) { - p = mmap(0, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0); + p = mmap(p, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0); if ((int) p == -1) { perror("mmap"); return; Andrea ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re[2]: your mail on mmap() to the kernel list 2001-12-04 14:15 ` Andrea Arcangeli @ 2001-12-04 15:36 ` Peter Zaitsev 2001-12-04 16:42 ` Rik van Riel 2001-12-04 16:48 ` Andrea Arcangeli 0 siblings, 2 replies; 15+ messages in thread From: Peter Zaitsev @ 2001-12-04 15:36 UTC (permalink / raw) To: Andrea Arcangeli Cc: Andrew Morton, theowl, theowl, linux-kernel, Linus Torvalds Hello Andrea, Tuesday, December 04, 2001, 5:15:49 PM, you wrote: AA> the vma lookup overhead is nothing compared to the walking of the linked AA> list (not of the tree) to find the first available slot above AA> TASK_UNMAPPED_BASE. In the vma lookup engine rewrite I only cared about AA> find_vma, not about optimizing the search of a free vma over AA> TASK_UNMAPPED_BASE. Such list-loop is really evil. Sure. >> What appears to be happening is that the VMA tree has degenerated >> into a monstrous singly linked list. All those little 4k mappings AA> actually it's not that the tree degenerated into a list. It's that we AA> need to walk all the vma to check if there is a large enough hole to AA> place the new dynamic mapping and so walk we use the list, not the tree, AA> because it needs less mem resources and it's simpler and faster. AA> You can fix the problem in userspace by using a meaningful 'addr' as AA> hint to mmap(2), or by using MAP_FIXED from userspace, then the kernel AA> won't waste time searching the first available mapping over AA> TASK_UNMAPPED_BASE. Well. Really you can't do this, because you can not really track all of the mappings in user program as glibc and probably other libraries use mmap for their purposes. This may work only at the program start there you may almost be shure only low addresses are used yet. In my case I'm implementing a cache of mmaped chunks so there are always some mmaps/munmaps going. Also I can't use "random" addresses with a high probability it's so as my application mmaps about 50% of user address space (meaning 3.5G your patches). >> The reason you don't see it with an anonymous map is, I think, that >> the kernel will merge contiguous anon mappings into a single one, AA> Yes. But actually merging also file backed vmas would be a goodness AA> indipendently of the arch_get_unmapped_area loop. It's not possible AA> right now because the anonymous memory case it's much simpler: no AA> i_shared_locks etc... but I'd like if in the long run also the file AA> backed vma will be mergeable. The side effect of the goodness is that AA> also the loop would run faster of course. Do you think it's the big chance the two close mappings will belong to the different parts of one file. I think this should be true only for some specific workloads. AA> But technically to really kill AA> the evil complexity of the loop (for example if every vma belongs to a AA> different file so it cannot be merged anyways) we'd need a tree of AA> "holes" indexed in function of the size of the hole. but it seems a very AA> dubious optimization... Are there much problems with this approach ? Much memory usage or cpu overhead somethere ? AA> Complexity wise it makes sense, but in practice AA> the common case should matter more I guess, and of course userspace can AA> just fix this without any kernel modification by passing an helpful AA> "addr", to the mmap syscall with very very little effort. Untested AA> patch follows: Could you please explain a bit how this the hint works ? Does it tries only specified address or also tries to look up the free space from this point ? -- Best regards, Peter mailto:pz@spylog.ru ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re[2]: your mail on mmap() to the kernel list 2001-12-04 15:36 ` Re[2]: " Peter Zaitsev @ 2001-12-04 16:42 ` Rik van Riel 2001-12-04 16:55 ` Andrea Arcangeli 2001-12-05 14:36 ` Re[3]: " Peter Zaitsev 2001-12-04 16:48 ` Andrea Arcangeli 1 sibling, 2 replies; 15+ messages in thread From: Rik van Riel @ 2001-12-04 16:42 UTC (permalink / raw) To: Peter Zaitsev Cc: Andrea Arcangeli, Andrew Morton, theowl, theowl, linux-kernel, Linus Torvalds On Tue, 4 Dec 2001, Peter Zaitsev wrote: > Tuesday, December 04, 2001, 5:15:49 PM, you wrote: > AA> You can fix the problem in userspace by using a meaningful 'addr' as > AA> hint to mmap(2), or by using MAP_FIXED from userspace, then the kernel > AA> won't waste time searching the first available mapping over > AA> TASK_UNMAPPED_BASE. > Well. Really you can't do this, because you can not really track all of > the mappings in user program as glibc and probably other libraries > use mmap for their purposes. There's no reason we couldn't do this hint in kernel space. In arch_get_unmapped_area we can simply keep track of the lowest address where we found free space, while on munmap() we can adjust this hint if needed. OTOH, I doubt it would help real-world workloads where the application maps and unmaps areas of different sizes and actually does something with the memory instead of just mapping and unmapping it ;))) kind regards, Rik -- Shortwave goes a long way: irc.starchat.net #swl http://www.surriel.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: your mail on mmap() to the kernel list 2001-12-04 16:42 ` Rik van Riel @ 2001-12-04 16:55 ` Andrea Arcangeli 2001-12-05 14:38 ` Re[2]: " Peter Zaitsev 2001-12-05 14:36 ` Re[3]: " Peter Zaitsev 1 sibling, 1 reply; 15+ messages in thread From: Andrea Arcangeli @ 2001-12-04 16:55 UTC (permalink / raw) To: Rik van Riel Cc: Peter Zaitsev, Andrew Morton, theowl, theowl, linux-kernel, Linus Torvalds On Tue, Dec 04, 2001 at 02:42:28PM -0200, Rik van Riel wrote: > On Tue, 4 Dec 2001, Peter Zaitsev wrote: > > Tuesday, December 04, 2001, 5:15:49 PM, you wrote: > > > AA> You can fix the problem in userspace by using a meaningful 'addr' as > > AA> hint to mmap(2), or by using MAP_FIXED from userspace, then the kernel > > AA> won't waste time searching the first available mapping over > > AA> TASK_UNMAPPED_BASE. > > > Well. Really you can't do this, because you can not really track all of > > the mappings in user program as glibc and probably other libraries > > use mmap for their purposes. > > There's no reason we couldn't do this hint in kernel space. > > In arch_get_unmapped_area we can simply keep track of the > lowest address where we found free space, while on munmap() > we can adjust this hint if needed. > > OTOH, I doubt it would help real-world workloads where the > application maps and unmaps areas of different sizes and > actually does something with the memory instead of just > mapping and unmapping it ;))) exactly, while that would be simple to implement and very lightweight at runtime, that's not enough to mathematically drop the complexity of the get_unmapped_area algorithm. It would optimize only the case where there's no fragmentation of the mapped virtual address space. For finding the best fit in the heap with O(log(N)) complexity (rather than the current O(N) complexity of the linked list) one tree indexed by the size of each hole would be necessary. Andrea ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re[2]: your mail on mmap() to the kernel list 2001-12-04 16:55 ` Andrea Arcangeli @ 2001-12-05 14:38 ` Peter Zaitsev 0 siblings, 0 replies; 15+ messages in thread From: Peter Zaitsev @ 2001-12-05 14:38 UTC (permalink / raw) To: Andrea Arcangeli Cc: Rik van Riel, Andrew Morton, theowl, theowl, linux-kernel, Linus Torvalds Hello Andrea, >> >> OTOH, I doubt it would help real-world workloads where the >> application maps and unmaps areas of different sizes and >> actually does something with the memory instead of just >> mapping and unmapping it ;))) AA> exactly, while that would be simple to implement and very lightweight at AA> runtime, that's not enough to mathematically drop the complexity of the AA> get_unmapped_area algorithm. It would optimize only the case where AA> there's no fragmentation of the mapped virtual address space. And also will optimize all mappings of 4K and (which are at least 70% in mu case) :) AA> For finding the best fit in the heap with O(log(N)) complexity (rather AA> than the current O(N) complexity of the linked list) one tree indexed by AA> the size of each hole would be necessary. This of course would be the best way. -- Best regards, Peter mailto:pz@spylog.ru ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re[3]: your mail on mmap() to the kernel list 2001-12-04 16:42 ` Rik van Riel 2001-12-04 16:55 ` Andrea Arcangeli @ 2001-12-05 14:36 ` Peter Zaitsev 1 sibling, 0 replies; 15+ messages in thread From: Peter Zaitsev @ 2001-12-05 14:36 UTC (permalink / raw) To: Rik van Riel Cc: Andrea Arcangeli, Andrew Morton, theowl, theowl, linux-kernel, Linus Torvalds Hello Rik, Tuesday, December 04, 2001, 7:42:28 PM, you wrote: >> Well. Really you can't do this, because you can not really track all of >> the mappings in user program as glibc and probably other libraries >> use mmap for their purposes. RvR> There's no reason we couldn't do this hint in kernel space. Yes. This cache will probably give a good hit rate. It of course does not decrease mathematical complexity but speeding the things up couple of times is good anyway :) RvR> In arch_get_unmapped_area we can simply keep track of the RvR> lowest address where we found free space, while on munmap() RvR> we can adjust this hint if needed. RvR> OTOH, I doubt it would help real-world workloads where the RvR> application maps and unmaps areas of different sizes and RvR> actually does something with the memory instead of just RvR> mapping and unmapping it ;))) Well this is quite simple I think. Database may use mmap to access the data in files, as you can't map everything to 32bit address space you will have to map just parts of the files, therefore as you can't have to much mapped chunks you will have different chunk sizes to merge continuos mmaped areas. Other thing some databases support different physical page sizes so this will be true even without merging. One more thing: fread at least in some cases does mmap of the file - so if you use it aggressively you will get in the problem. Anyway in most cases even then mmaped chunks are different sizes most of them should be around the same size in the most cases. -- Best regards, Peter mailto:pz@spylog.ru ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: your mail on mmap() to the kernel list 2001-12-04 15:36 ` Re[2]: " Peter Zaitsev 2001-12-04 16:42 ` Rik van Riel @ 2001-12-04 16:48 ` Andrea Arcangeli 2001-12-05 16:12 ` Re[2]: " Peter Zaitsev 1 sibling, 1 reply; 15+ messages in thread From: Andrea Arcangeli @ 2001-12-04 16:48 UTC (permalink / raw) To: Peter Zaitsev; +Cc: Andrew Morton, theowl, theowl, linux-kernel, Linus Torvalds On Tue, Dec 04, 2001 at 06:36:24PM +0300, Peter Zaitsev wrote: > Hello Andrea, > > Tuesday, December 04, 2001, 5:15:49 PM, you wrote: > > > AA> the vma lookup overhead is nothing compared to the walking of the linked > AA> list (not of the tree) to find the first available slot above > AA> TASK_UNMAPPED_BASE. In the vma lookup engine rewrite I only cared about > AA> find_vma, not about optimizing the search of a free vma over > AA> TASK_UNMAPPED_BASE. Such list-loop is really evil. > Sure. > > >> What appears to be happening is that the VMA tree has degenerated > >> into a monstrous singly linked list. All those little 4k mappings > > AA> actually it's not that the tree degenerated into a list. It's that we > AA> need to walk all the vma to check if there is a large enough hole to > AA> place the new dynamic mapping and so walk we use the list, not the tree, > AA> because it needs less mem resources and it's simpler and faster. > > AA> You can fix the problem in userspace by using a meaningful 'addr' as > AA> hint to mmap(2), or by using MAP_FIXED from userspace, then the kernel > AA> won't waste time searching the first available mapping over > AA> TASK_UNMAPPED_BASE. > Well. Really you can't do this, because you can not really track all of I can do this sometime, I did it in the patch I posted for example. If you want transparency and genericity you can still do it in userspace with an LD_LIBRARY_PATH that loads a libc that builds the hole-lookup data structure (tree?) internally, and then passes the hint to the kernel if the program suggests 0. OTOH I prefer those kind of algorithms (if needed) to be in the kernel so they're faster/cleaner and also a lib would need some magic knowledge about kernel internals virtual addresses to do it right. > >> The reason you don't see it with an anonymous map is, I think, that > >> the kernel will merge contiguous anon mappings into a single one, > > AA> Yes. But actually merging also file backed vmas would be a goodness > AA> indipendently of the arch_get_unmapped_area loop. It's not possible > AA> right now because the anonymous memory case it's much simpler: no > AA> i_shared_locks etc... but I'd like if in the long run also the file > AA> backed vma will be mergeable. The side effect of the goodness is that > AA> also the loop would run faster of course. > Do you think it's the big chance the two close mappings will belong to the > different parts of one file. I think this should be true only for some > specific workloads. > AA> But technically to really kill > AA> the evil complexity of the loop (for example if every vma belongs to a > AA> different file so it cannot be merged anyways) we'd need a tree of > AA> "holes" indexed in function of the size of the hole. but it seems a very > AA> dubious optimization... > Are there much problems with this approach ? Much memory usage or cpu > overhead somethere ? > AA> Complexity wise it makes sense, but in practice > AA> the common case should matter more I guess, and of course userspace can > AA> just fix this without any kernel modification by passing an helpful > AA> "addr", to the mmap syscall with very very little effort. Untested > AA> patch follows: > > Could you please explain a bit how this the hint works ? Does it tries man 2 mmap, then search for 'start' > only specified address or also tries to look up the free space from > this point ? in short it checks if there's 'len' free space at address 'start', and if there is, it maps it there, without having to search for the first large enough hole in the heap. Andrea ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re[2]: your mail on mmap() to the kernel list 2001-12-04 16:48 ` Andrea Arcangeli @ 2001-12-05 16:12 ` Peter Zaitsev 2001-12-05 16:23 ` Andrea Arcangeli 0 siblings, 1 reply; 15+ messages in thread From: Peter Zaitsev @ 2001-12-05 16:12 UTC (permalink / raw) To: Andrea Arcangeli Cc: Andrew Morton, theowl, theowl, linux-kernel, Linus Torvalds Hello Andrea, Tuesday, December 04, 2001, 7:48:24 PM, you wrote: >> AA> You can fix the problem in userspace by using a meaningful 'addr' as >> AA> hint to mmap(2), or by using MAP_FIXED from userspace, then the kernel >> AA> won't waste time searching the first available mapping over >> AA> TASK_UNMAPPED_BASE. >> Well. Really you can't do this, because you can not really track all of AA> I can do this sometime, I did it in the patch I posted for example. Well. You're really easy may do this only in such dumb test example then you do something real between mmaps even memory allocation in multi threaded program you will not able to track it. AA> If AA> you want transparency and genericity you can still do it in userspace AA> with an LD_LIBRARY_PATH that loads a libc that builds the hole-lookup AA> data structure (tree?) internally, and then passes the hint to the AA> kernel if the program suggests 0. Well but it's not really easy and right thing to do as you will also need to write your own memory allocator (as standard one uses mmap so you can get in recursive loop). And also it will be really platform/kernel dependent as different kernels may use different addresses there user part of address space available for mmaping starts. AA> OTOH I prefer those kind of algorithms (if needed) to be in the kernel AA> so they're faster/cleaner and also a lib would need some magic knowledge AA> about kernel internals virtual addresses to do it right. Yes. I'm quite agree with that. Also if you have one tree implemented it should not be too hard to have a second tree indexed by other field. >> AA> the common case should matter more I guess, and of course userspace can >> AA> just fix this without any kernel modification by passing an helpful >> AA> "addr", to the mmap syscall with very very little effort. Untested >> AA> patch follows: >> >> Could you please explain a bit how this the hint works ? Does it tries AA> man 2 mmap, then search for 'start' I've read it of course :) >> only specified address or also tries to look up the free space from >> this point ? AA> in short it checks if there's 'len' free space at address 'start', and AA> if there is, it maps it there, without having to search for the first AA> large enough hole in the heap. I understand it but does it start so search standard way from the low addresses or it looks it above the hint address before looking at lower addresses ? -- Best regards, Peter mailto:pz@spylog.ru ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: your mail on mmap() to the kernel list 2001-12-05 16:12 ` Re[2]: " Peter Zaitsev @ 2001-12-05 16:23 ` Andrea Arcangeli 0 siblings, 0 replies; 15+ messages in thread From: Andrea Arcangeli @ 2001-12-05 16:23 UTC (permalink / raw) To: Peter Zaitsev; +Cc: Andrew Morton, theowl, theowl, linux-kernel, Linus Torvalds On Wed, Dec 05, 2001 at 07:12:03PM +0300, Peter Zaitsev wrote: > I understand it but does it start so search standard way from the low > addresses or it looks it above the hint address before looking at > lower addresses ? mmap(2) with a "valid" hint (so with enough space in the hole between 'start' and 'start+len') runs with O(log(N)) complexity because it uses the tree internally to verify that's a valid hole. It's only the search of an hole that has O(N) complexity, the "check" of one hole has just O(log(N)) complexity. Andrea ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2001-12-05 16:24 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <3C082244.8587.80EF082@localhost>
2001-12-01 8:31 ` Re[2]: your mail on mmap() to the kernel list Peter Zaitsev
2001-12-01 9:37 ` Andrew Morton
2001-12-02 23:07 ` Re[2]: " Peter Zaitsev
2001-12-02 23:34 ` Andrew Morton
2001-12-03 10:10 ` Re[2]: " Peter Zaitsev
2001-12-04 14:18 ` Andrea Arcangeli
2001-12-04 14:15 ` Andrea Arcangeli
2001-12-04 15:36 ` Re[2]: " Peter Zaitsev
2001-12-04 16:42 ` Rik van Riel
2001-12-04 16:55 ` Andrea Arcangeli
2001-12-05 14:38 ` Re[2]: " Peter Zaitsev
2001-12-05 14:36 ` Re[3]: " Peter Zaitsev
2001-12-04 16:48 ` Andrea Arcangeli
2001-12-05 16:12 ` Re[2]: " Peter Zaitsev
2001-12-05 16:23 ` Andrea Arcangeli
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox