* Hugepages_Rsvd goes huge in 2.6.20-rc7 @ 2007-02-06 0:19 Nishanth Aravamudan 2007-02-06 0:25 ` Nishanth Aravamudan 0 siblings, 1 reply; 5+ messages in thread From: Nishanth Aravamudan @ 2007-02-06 0:19 UTC (permalink / raw) To: linux-mm; +Cc: libhugetlbfs-devel, david, hugh Hi all, So, here's the current state of the hugepages portion of my /proc/meminfo (x86_64, 2.6.20-rc7, will test with 2.6.20 shortly, but AFAICS, there haven't been many changes to hugepage code between the two): HugePages_Total: 100 HugePages_Free: 100 HugePages_Rsvd: 18446744073709551615 Hugepagesize: 2048 kB That's not good :) Context: I'm currently working on some patches for libhugetlbfs which should ultimately help us reduce our hugepage usage when remapping segments so they are backed by hugepages. The current algorithm maps in hugepage file as MAP_SHARED, copies over the segment data, then unmaps the file. It then unmaps the program's segments, and maps in the same hugepage file MAP_PRIVATE, so that we take COW faults. Now, the problem is, for writable segments (data) the COW fault instatiates a new hugepage, but the original MAP_SHARED hugepage stays resident in the page cache. So, for a program that could survive (after the initial remapping algorithm) with only 2 hugepages in use, uses 3 hugepages instead. To work around this, I've modified the algorithm to prefault in the writable segment in the remapping code (via a one-byte read and write). Then, I issue a posix_fadvise(segment_fd, 0, 0, FADV_DONTNEED), to try and drop the shared hugepage from the page cache. With a small dummy relinked app (that just sleeps), this does reduce our run-time hugepage cost from 3 to 2. But, I'm noticing that libhugetlbfs' `make func` utility, which tests libhugetlbfs' functionality only, every so often leads to a lot of "VM killing process ...". This only appears to happen to a particular testcase (xBDT.linkshare, which remaps the BSS, data and text segments and tries to share the text segments between 2 processes), but when it does, it happens for a while (that is, if I try and run that particular test manually, it keeps getting killed) and /proc/meminfo reports a garbage value for HugePages_Rsvd like I listed above. If I rerun `make func`, sometimes the problem goes away (Rsvd returns to a sane value, as well...). I've added Hugh & David to the Cc, because they discussed a similar problem a few months back. Maybe there is still a race somewhere? I'm willing to test any possible fixes, and I'll work on making this more easily reproducible (although it seems to happen pretty regularly here) with a simpler test. Thanks, Nish -- Nishanth Aravamudan <nacc@us.ibm.com> IBM Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Hugepages_Rsvd goes huge in 2.6.20-rc7 2007-02-06 0:19 Hugepages_Rsvd goes huge in 2.6.20-rc7 Nishanth Aravamudan @ 2007-02-06 0:25 ` Nishanth Aravamudan 2007-02-06 0:55 ` Nishanth Aravamudan 0 siblings, 1 reply; 5+ messages in thread From: Nishanth Aravamudan @ 2007-02-06 0:25 UTC (permalink / raw) To: linux-mm; +Cc: libhugetlbfs-devel, david, hugh Sorry, I botched Hugh's e-mail address, please make sure to reply to the correct one. Thanks, Nish On 05.02.2007 [16:19:04 -0800], Nishanth Aravamudan wrote: > Hi all, > > So, here's the current state of the hugepages portion of my > /proc/meminfo (x86_64, 2.6.20-rc7, will test with 2.6.20 shortly, but > AFAICS, there haven't been many changes to hugepage code between the > two): > > HugePages_Total: 100 > HugePages_Free: 100 > HugePages_Rsvd: 18446744073709551615 > Hugepagesize: 2048 kB > > That's not good :) > > Context: I'm currently working on some patches for libhugetlbfs which > should ultimately help us reduce our hugepage usage when remapping > segments so they are backed by hugepages. The current algorithm maps in > hugepage file as MAP_SHARED, copies over the segment data, then unmaps > the file. It then unmaps the program's segments, and maps in the same > hugepage file MAP_PRIVATE, so that we take COW faults. Now, the problem > is, for writable segments (data) the COW fault instatiates a new > hugepage, but the original MAP_SHARED hugepage stays resident in the > page cache. So, for a program that could survive (after the initial > remapping algorithm) with only 2 hugepages in use, uses 3 hugepages > instead. > > To work around this, I've modified the algorithm to prefault in the > writable segment in the remapping code (via a one-byte read and write). > Then, I issue a posix_fadvise(segment_fd, 0, 0, FADV_DONTNEED), to try > and drop the shared hugepage from the page cache. With a small dummy > relinked app (that just sleeps), this does reduce our run-time hugepage > cost from 3 to 2. But, I'm noticing that libhugetlbfs' `make func` > utility, which tests libhugetlbfs' functionality only, every so often > leads to a lot of "VM killing process ...". This only appears to happen > to a particular testcase (xBDT.linkshare, which remaps the BSS, data and > text segments and tries to share the text segments between 2 processes), > but when it does, it happens for a while (that is, if I try and run that > particular test manually, it keeps getting killed) and /proc/meminfo > reports a garbage value for HugePages_Rsvd like I listed above. If I > rerun `make func`, sometimes the problem goes away (Rsvd returns to a > sane value, as well...). > > I've added Hugh & David to the Cc, because they discussed a similar > problem a few months back. Maybe there is still a race somewhere? > > I'm willing to test any possible fixes, and I'll work on making this > more easily reproducible (although it seems to happen pretty regularly > here) with a simpler test. > > Thanks, > Nish > > -- > Nishanth Aravamudan <nacc@us.ibm.com> > IBM Linux Technology Center -- Nishanth Aravamudan <nacc@us.ibm.com> IBM Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Hugepages_Rsvd goes huge in 2.6.20-rc7 2007-02-06 0:25 ` Nishanth Aravamudan @ 2007-02-06 0:55 ` Nishanth Aravamudan 2007-02-06 1:24 ` David Gibson 0 siblings, 1 reply; 5+ messages in thread From: Nishanth Aravamudan @ 2007-02-06 0:55 UTC (permalink / raw) To: linux-mm; +Cc: libhugetlbfs-devel, david, hugh On 05.02.2007 [16:25:34 -0800], Nishanth Aravamudan wrote: > Sorry, I botched Hugh's e-mail address, please make sure to reply to the > correct one. > > Thanks, > Nish > > On 05.02.2007 [16:19:04 -0800], Nishanth Aravamudan wrote: > > Hi all, > > > > So, here's the current state of the hugepages portion of my > > /proc/meminfo (x86_64, 2.6.20-rc7, will test with 2.6.20 shortly, > > but AFAICS, there haven't been many changes to hugepage code between > > the two): Reproduced on 2.6.20, and I think I've got a means to make it more easily reproducible (at least on x86_64). Please note, I found that when HugePages_Rsvd goes very large, I can make it return to 0 by running `make func`, but killing it before it gets to the sharing tests. Rsvd returns to 0 in this case. So, here's my means of reproducing it (as root, from the libhugetlbfs root directory [1]): # make sure everything is clean, hugepages wise root@arkanoid# rm -rf /mnt/hugetlbfs/* # if /proc/meminfo is already screwed up, run `make func` and kill it # around when you see the mprotect testcase run, that seems to always # work -- I'll try to be more scientific on this in a bit, to see which # test causes the value to return to sanity # run the linkshare testcase once, probably will die right away root@arkanoid# HUGETLB_VERBOSE=99 HUGETLB_ELFMAP=y HUGETLB_SHARE=1 LD_LIBRARY_PATH=./obj64 ./tests/obj64/xBDT.linkshare # you should see the testcase be killed, something like # "FAIL Child 1 killed by signal: Killed" root@arkanoid# cat /proc/meminfo # and a large value in meminfo now Seems to happen every time I do this :) Note, part of this reproducibility stems from a small modification to the details I gave before. Before doing the posix_fadvise() call, I now do an fsync() on the file-descriptor. Without the fsync(), it may take one or two invocations before the test fails, but it still will in my experience so far. Also note, that I'm not trying to defend the way I'm approaching this problem in libhugetlbfs (I'm very open to alternatives) -- but regardless of what I do there, I don't think Rsvd should be 18446744073709551615 ... Thanks, Nish [1] You'll need the latest development snapshot of libhugetlbfs (http://libhugetlbfs.ozlabs.org/snapshots/libhugetlbfs-dev-20070129.tar.gz) as well as the following patch applied on top of it: diff --git a/elflink.c b/elflink.c index 5a57358..18926fa 100644 --- a/elflink.c +++ b/elflink.c @@ -186,8 +186,8 @@ static char share_path[PATH_MAX+1]; #define MAX_HTLB_SEGS 2 struct seg_info { - void *vaddr; - unsigned long filesz, memsz; + void *vaddr, *extra_vaddr; + unsigned long filesz, memsz, extrasz; int prot; int fd; int phdr; @@ -497,8 +497,7 @@ static inline int keep_symbol(Elf_Sym *s, void *start, void *end) * include these initialized variables in our copy. */ -static void get_extracopy(struct seg_info *seg, void **extra_start, - void **extra_end) +static void get_extracopy(struct seg_info *seg) { Elf_Dyn *dyntab; /* dynamic segment table */ Elf_Sym *symtab = NULL; /* dynamic symbol table */ @@ -511,7 +510,7 @@ static void get_extracopy(struct seg_info *seg, void **extra_start, end_orig = seg->vaddr + seg->memsz; start_orig = seg->vaddr + seg->filesz; if (seg->filesz == seg->memsz) - goto bail2; + return; if (!minimal_copy) goto bail2; @@ -557,23 +556,20 @@ static void get_extracopy(struct seg_info *seg, void **extra_start, if (found_sym) { /* Return the copy window */ - *extra_start = start; - *extra_end = end; - return; - } else { - /* No need to copy anything */ - *extra_start = start_orig; - *extra_end = start_orig; - goto bail3; + seg->extra_vaddr = start; + seg->extrasz = end - start; } + /* + * else no need to copy anything, so leave seg->extra_vaddr as + * NULL + */ + return; bail: DEBUG("Unable to perform minimal copy\n"); bail2: - *extra_start = start_orig; - *extra_end = end_orig; -bail3: - return; + seg->extra_vaddr = start_orig; + seg->extrasz = end_orig - start_orig; } /* @@ -584,7 +580,7 @@ bail3: static int prepare_segment(struct seg_info *seg) { int hpage_size = gethugepagesize(); - void *p, *extra_start, *extra_end; + void *p; unsigned long gap; unsigned long size; @@ -592,9 +588,14 @@ static int prepare_segment(struct seg_info *seg) * Calculate the BSS size that we must copy in order to minimize * the size of the shared mapping. */ - get_extracopy(seg, &extra_start, &extra_end); - size = ALIGN((unsigned long)extra_end - (unsigned long)seg->vaddr, + get_extracopy(seg); + if (seg->extra_vaddr) { + size = ALIGN((unsigned long)seg->extra_vaddr + + seg->extrasz - (unsigned long)seg->vaddr, hpage_size); + } else { + size = ALIGN(seg->filesz, hpage_size); + } /* Prepare the hugetlbfs file */ @@ -617,11 +618,12 @@ static int prepare_segment(struct seg_info *seg) memcpy(p, seg->vaddr, seg->filesz); DEBUG_CONT("done\n"); - if (extra_end > extra_start) { + if (seg->extra_vaddr) { DEBUG("Copying extra %#0lx bytes from %p...", - (unsigned long)(extra_end - extra_start), extra_start); - gap = extra_start - (seg->vaddr + seg->filesz); - memcpy((p + seg->filesz + gap), extra_start, (extra_end - extra_start)); + seg->extrasz, seg->extra_vaddr); + gap = seg->extra_vaddr - (seg->vaddr + seg->filesz); + memcpy((p + seg->filesz + gap), seg->extra_vaddr, + seg->extrasz); DEBUG_CONT("done\n"); } @@ -791,6 +793,7 @@ static void remap_segments(struct seg_info *seg, int num) long hpage_size = gethugepagesize(); int i; void *p; + char c; /* * XXX: The bogus call to mmap below forces ld.so to resolve the @@ -829,6 +832,46 @@ static void remap_segments(struct seg_info *seg, int num) /* The segments are all back at this point. * and it should be safe to reference static data */ + + /* + * This pagecache dropping code should not be used for shared + * segments. But we currently only share read-only segments, so + * the below check for PROT_WRITE is implicitly sufficient. + */ + for (i = 0; i < num; i++) { + if (seg[i].prot & PROT_WRITE) { + /* + * take a COW fault on each hugepage in the + * segment's file data ... + */ + for (p = seg[i].vaddr; + p <= seg[i].vaddr + seg[i].filesz; + p += hpage_size) { + memcpy(&c, p, 1); + memcpy(p, &c, 1); + } + /* + * ... as well as each huge page in the + * extracopy area + */ + if (seg[i].extra_vaddr) { + for (p = seg[i].extra_vaddr; + p <= seg[i].extra_vaddr + + seg[i].extrasz; + p += hpage_size) { + memcpy(&c, p, 1); + memcpy(p, &c, 1); + } + } + /* + * Note: fadvise() failing is not actually an + * error, as we'll just use an extra set of + * hugepages (in the pagecache). + */ + fsync(seg[i].fd); + posix_fadvise(seg[i].fd, 0, 0, POSIX_FADV_DONTNEED); + } + } } static int check_env(void) -- Nishanth Aravamudan <nacc@us.ibm.com> IBM Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: Hugepages_Rsvd goes huge in 2.6.20-rc7 2007-02-06 0:55 ` Nishanth Aravamudan @ 2007-02-06 1:24 ` David Gibson 2007-02-07 1:05 ` [Libhugetlbfs-devel] " Nishanth Aravamudan 0 siblings, 1 reply; 5+ messages in thread From: David Gibson @ 2007-02-06 1:24 UTC (permalink / raw) To: Nishanth Aravamudan; +Cc: linux-mm, libhugetlbfs-devel, hugh On Mon, Feb 05, 2007 at 04:55:47PM -0800, Nishanth Aravamudan wrote: > On 05.02.2007 [16:25:34 -0800], Nishanth Aravamudan wrote: > > Sorry, I botched Hugh's e-mail address, please make sure to reply to the > > correct one. > > > > Thanks, > > Nish > > > > On 05.02.2007 [16:19:04 -0800], Nishanth Aravamudan wrote: > > > Hi all, > > > > > > So, here's the current state of the hugepages portion of my > > > /proc/meminfo (x86_64, 2.6.20-rc7, will test with 2.6.20 shortly, > > > but AFAICS, there haven't been many changes to hugepage code between > > > the two): > > Reproduced on 2.6.20, and I think I've got a means to make it more > easily reproducible (at least on x86_64). > > Please note, I found that when HugePages_Rsvd goes very large, I can > make it return to 0 by running `make func`, but killing it before it > gets to the sharing tests. Rsvd returns to 0 in this case. Um.. yeah, that may just be because it's reserving some pages, which rolls the rsvd count back to 0. > So, here's my means of reproducing it (as root, from the libhugetlbfs > root directory [1]): > > # make sure everything is clean, hugepages wise > root@arkanoid# rm -rf /mnt/hugetlbfs/* > # if /proc/meminfo is already screwed up, run `make func` and kill it > # around when you see the mprotect testcase run, that seems to always > # work -- I'll try to be more scientific on this in a bit, to see which > # test causes the value to return to sanity > > # run the linkshare testcase once, probably will die right away > root@arkanoid# HUGETLB_VERBOSE=99 HUGETLB_ELFMAP=y HUGETLB_SHARE=1 LD_LIBRARY_PATH=./obj64 ./tests/obj64/xBDT.linkshare > # you should see the testcase be killed, something like > # "FAIL Child 1 killed by signal: Killed" > root@arkanoid# cat /proc/meminfo > # and a large value in meminfo now > > Seems to happen every time I do this :) Note, part of this > reproducibility stems from a small modification to the details I gave > before. Before doing the posix_fadvise() call, I now do an fsync() on > the file-descriptor. Without the fsync(), it may take one or two > invocations before the test fails, but it still will in my experience so > far. > > Also note, that I'm not trying to defend the way I'm approaching this > problem in libhugetlbfs (I'm very open to alternatives) -- but > regardless of what I do there, I don't think Rsvd should be > 18446744073709551615 ... Oh, certainly not. Clearly we're managing to decrement it more times than we're incrementing it somehow. I'd check the codepath for the madvise() thing, we may not be handling that properly. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Libhugetlbfs-devel] Hugepages_Rsvd goes huge in 2.6.20-rc7 2007-02-06 1:24 ` David Gibson @ 2007-02-07 1:05 ` Nishanth Aravamudan 0 siblings, 0 replies; 5+ messages in thread From: Nishanth Aravamudan @ 2007-02-07 1:05 UTC (permalink / raw) To: linux-mm, libhugetlbfs-devel, hugh On 06.02.2007 [12:24:42 +1100], David Gibson wrote: > On Mon, Feb 05, 2007 at 04:55:47PM -0800, Nishanth Aravamudan wrote: > > On 05.02.2007 [16:25:34 -0800], Nishanth Aravamudan wrote: > > > Sorry, I botched Hugh's e-mail address, please make sure to reply to the > > > correct one. > > > > > > Thanks, > > > Nish > > > > > > On 05.02.2007 [16:19:04 -0800], Nishanth Aravamudan wrote: > > > > Hi all, > > > > > > > > So, here's the current state of the hugepages portion of my > > > > /proc/meminfo (x86_64, 2.6.20-rc7, will test with 2.6.20 > > > > shortly, but AFAICS, there haven't been many changes to hugepage > > > > code between the two): > > > > Reproduced on 2.6.20, and I think I've got a means to make it more > > easily reproducible (at least on x86_64). <snip> > > Also note, that I'm not trying to defend the way I'm approaching > > this problem in libhugetlbfs (I'm very open to alternatives) -- but > > regardless of what I do there, I don't think Rsvd should be > > 18446744073709551615 ... > > Oh, certainly not. Clearly we're managing to decrement it more times > than we're incrementing it somehow. I'd check the codepath for the > madvise() thing, we may not be handling that properly. FYI, Ken Chen's patch fixes the problem for me: http://marc2.theaimsgroup.com/?l=linux-mm&m=117079608820399&w=2 Thanks, Nish -- Nishanth Aravamudan <nacc@us.ibm.com> IBM Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2007-02-07 1:04 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-02-06 0:19 Hugepages_Rsvd goes huge in 2.6.20-rc7 Nishanth Aravamudan 2007-02-06 0:25 ` Nishanth Aravamudan 2007-02-06 0:55 ` Nishanth Aravamudan 2007-02-06 1:24 ` David Gibson 2007-02-07 1:05 ` [Libhugetlbfs-devel] " Nishanth Aravamudan
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).