linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Hugepages_Rsvd goes huge in 2.6.20-rc7
@ 2007-02-06  0:19 Nishanth Aravamudan
  2007-02-06  0:25 ` Nishanth Aravamudan
  0 siblings, 1 reply; 5+ messages in thread
From: Nishanth Aravamudan @ 2007-02-06  0:19 UTC (permalink / raw)
  To: linux-mm; +Cc: libhugetlbfs-devel, david, hugh

Hi all,

So, here's the current state of the hugepages portion of my
/proc/meminfo (x86_64, 2.6.20-rc7, will test with 2.6.20 shortly, but
AFAICS, there haven't been many changes to hugepage code between the
two):

HugePages_Total:   100
HugePages_Free:    100
HugePages_Rsvd:  18446744073709551615
Hugepagesize:     2048 kB

That's not good :)

Context: I'm currently working on some patches for libhugetlbfs which
should ultimately help us reduce our hugepage usage when remapping
segments so they are backed by hugepages. The current algorithm maps in
hugepage file as MAP_SHARED, copies over the segment data, then unmaps
the file. It then unmaps the program's segments, and maps in the same
hugepage file MAP_PRIVATE, so that we take COW faults. Now, the problem
is, for writable segments (data) the COW fault instatiates a new
hugepage, but the original MAP_SHARED hugepage stays resident in the
page cache. So, for a program that could survive (after the initial
remapping algorithm) with only 2 hugepages in use, uses 3 hugepages
instead.

To work around this, I've modified the algorithm to prefault in the
writable segment in the remapping code (via a one-byte read and write).
Then, I issue a posix_fadvise(segment_fd, 0, 0, FADV_DONTNEED), to try
and drop the shared hugepage from the page cache. With a small dummy
relinked app (that just sleeps), this does reduce our run-time hugepage
cost from 3 to 2. But, I'm noticing that libhugetlbfs' `make func`
utility, which tests libhugetlbfs' functionality only, every so often
leads to a lot of "VM killing process ...". This only appears to happen
to a particular testcase (xBDT.linkshare, which remaps the BSS, data and
text segments and tries to share the text segments between 2 processes),
but when it does, it happens for a while (that is, if I try and run that
particular test manually, it keeps getting killed) and /proc/meminfo
reports a garbage value for HugePages_Rsvd like I listed above. If I
rerun `make func`, sometimes the problem goes away (Rsvd returns to a
sane value, as well...).

I've added Hugh & David to the Cc, because they discussed a similar
problem a few months back. Maybe there is still a race somewhere?

I'm willing to test any possible fixes, and I'll work on making this
more easily reproducible (although it seems to happen pretty regularly
here) with a simpler test.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Hugepages_Rsvd goes huge in 2.6.20-rc7
  2007-02-06  0:19 Hugepages_Rsvd goes huge in 2.6.20-rc7 Nishanth Aravamudan
@ 2007-02-06  0:25 ` Nishanth Aravamudan
  2007-02-06  0:55   ` Nishanth Aravamudan
  0 siblings, 1 reply; 5+ messages in thread
From: Nishanth Aravamudan @ 2007-02-06  0:25 UTC (permalink / raw)
  To: linux-mm; +Cc: libhugetlbfs-devel, david, hugh

Sorry, I botched Hugh's e-mail address, please make sure to reply to the
correct one.

Thanks,
Nish

On 05.02.2007 [16:19:04 -0800], Nishanth Aravamudan wrote:
> Hi all,
> 
> So, here's the current state of the hugepages portion of my
> /proc/meminfo (x86_64, 2.6.20-rc7, will test with 2.6.20 shortly, but
> AFAICS, there haven't been many changes to hugepage code between the
> two):
> 
> HugePages_Total:   100
> HugePages_Free:    100
> HugePages_Rsvd:  18446744073709551615
> Hugepagesize:     2048 kB
> 
> That's not good :)
> 
> Context: I'm currently working on some patches for libhugetlbfs which
> should ultimately help us reduce our hugepage usage when remapping
> segments so they are backed by hugepages. The current algorithm maps in
> hugepage file as MAP_SHARED, copies over the segment data, then unmaps
> the file. It then unmaps the program's segments, and maps in the same
> hugepage file MAP_PRIVATE, so that we take COW faults. Now, the problem
> is, for writable segments (data) the COW fault instatiates a new
> hugepage, but the original MAP_SHARED hugepage stays resident in the
> page cache. So, for a program that could survive (after the initial
> remapping algorithm) with only 2 hugepages in use, uses 3 hugepages
> instead.
> 
> To work around this, I've modified the algorithm to prefault in the
> writable segment in the remapping code (via a one-byte read and write).
> Then, I issue a posix_fadvise(segment_fd, 0, 0, FADV_DONTNEED), to try
> and drop the shared hugepage from the page cache. With a small dummy
> relinked app (that just sleeps), this does reduce our run-time hugepage
> cost from 3 to 2. But, I'm noticing that libhugetlbfs' `make func`
> utility, which tests libhugetlbfs' functionality only, every so often
> leads to a lot of "VM killing process ...". This only appears to happen
> to a particular testcase (xBDT.linkshare, which remaps the BSS, data and
> text segments and tries to share the text segments between 2 processes),
> but when it does, it happens for a while (that is, if I try and run that
> particular test manually, it keeps getting killed) and /proc/meminfo
> reports a garbage value for HugePages_Rsvd like I listed above. If I
> rerun `make func`, sometimes the problem goes away (Rsvd returns to a
> sane value, as well...).
> 
> I've added Hugh & David to the Cc, because they discussed a similar
> problem a few months back. Maybe there is still a race somewhere?
> 
> I'm willing to test any possible fixes, and I'll work on making this
> more easily reproducible (although it seems to happen pretty regularly
> here) with a simpler test.
> 
> Thanks,
> Nish
> 
> -- 
> Nishanth Aravamudan <nacc@us.ibm.com>
> IBM Linux Technology Center

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Hugepages_Rsvd goes huge in 2.6.20-rc7
  2007-02-06  0:25 ` Nishanth Aravamudan
@ 2007-02-06  0:55   ` Nishanth Aravamudan
  2007-02-06  1:24     ` David Gibson
  0 siblings, 1 reply; 5+ messages in thread
From: Nishanth Aravamudan @ 2007-02-06  0:55 UTC (permalink / raw)
  To: linux-mm; +Cc: libhugetlbfs-devel, david, hugh

On 05.02.2007 [16:25:34 -0800], Nishanth Aravamudan wrote:
> Sorry, I botched Hugh's e-mail address, please make sure to reply to the
> correct one.
> 
> Thanks,
> Nish
> 
> On 05.02.2007 [16:19:04 -0800], Nishanth Aravamudan wrote:
> > Hi all,
> > 
> > So, here's the current state of the hugepages portion of my
> > /proc/meminfo (x86_64, 2.6.20-rc7, will test with 2.6.20 shortly,
> > but AFAICS, there haven't been many changes to hugepage code between
> > the two):

Reproduced on 2.6.20, and I think I've got a means to make it more
easily reproducible (at least on x86_64).

Please note, I found that when HugePages_Rsvd goes very large, I can
make it return to 0 by running `make func`, but killing it before it
gets to the sharing tests.  Rsvd returns to 0 in this case.

So, here's my means of reproducing it (as root, from the libhugetlbfs
root directory [1]):

# make sure everything is clean, hugepages wise
root@arkanoid# rm -rf /mnt/hugetlbfs/*
# if /proc/meminfo is already screwed up, run `make func` and kill it
# around when you see the mprotect testcase run, that seems to always
# work -- I'll try to be more scientific on this in a bit, to see which
# test causes the value to return to sanity

# run the linkshare testcase once, probably will die right away
root@arkanoid# HUGETLB_VERBOSE=99 HUGETLB_ELFMAP=y HUGETLB_SHARE=1 LD_LIBRARY_PATH=./obj64 ./tests/obj64/xBDT.linkshare
# you should see the testcase be killed, something like
# "FAIL    Child 1 killed by signal: Killed"
root@arkanoid# cat /proc/meminfo
# and a large value in meminfo now

Seems to happen every time I do this :) Note, part of this
reproducibility stems from a small modification to the details I gave
before. Before doing the posix_fadvise() call, I now do an fsync() on
the file-descriptor. Without the fsync(), it may take one or two
invocations before the test fails, but it still will in my experience so
far.

Also note, that I'm not trying to defend the way I'm approaching this
problem in libhugetlbfs (I'm very open to alternatives) -- but
regardless of what I do there, I don't think Rsvd should be
18446744073709551615 ...

Thanks,
Nish

[1]

You'll need the latest development snapshot of libhugetlbfs
(http://libhugetlbfs.ozlabs.org/snapshots/libhugetlbfs-dev-20070129.tar.gz)
as well as the following patch applied on top of it:

diff --git a/elflink.c b/elflink.c
index 5a57358..18926fa 100644
--- a/elflink.c
+++ b/elflink.c
@@ -186,8 +186,8 @@ static char share_path[PATH_MAX+1];
 #define MAX_HTLB_SEGS	2
 
 struct seg_info {
-	void *vaddr;
-	unsigned long filesz, memsz;
+	void *vaddr, *extra_vaddr;
+	unsigned long filesz, memsz, extrasz;
 	int prot;
 	int fd;
 	int phdr;
@@ -497,8 +497,7 @@ static inline int keep_symbol(Elf_Sym *s, void *start, void *end)
  * include these initialized variables in our copy.
  */
 
-static void get_extracopy(struct seg_info *seg, void **extra_start,
-							void **extra_end)
+static void get_extracopy(struct seg_info *seg)
 {
 	Elf_Dyn *dyntab;        /* dynamic segment table */
 	Elf_Sym *symtab = NULL; /* dynamic symbol table */
@@ -511,7 +510,7 @@ static void get_extracopy(struct seg_info *seg, void **extra_start,
 	end_orig = seg->vaddr + seg->memsz;
 	start_orig = seg->vaddr + seg->filesz;
 	if (seg->filesz == seg->memsz)
-		goto bail2;
+		return;
 	if (!minimal_copy)
 		goto bail2;
 
@@ -557,23 +556,20 @@ static void get_extracopy(struct seg_info *seg, void **extra_start,
 
 	if (found_sym) {
 		/* Return the copy window */
-		*extra_start = start;
-		*extra_end = end;
-		return;
-	} else {
-		/* No need to copy anything */
-		*extra_start = start_orig;
-		*extra_end = start_orig;
-		goto bail3;
+		seg->extra_vaddr = start;
+		seg->extrasz = end - start;
 	}
+	/*
+	 * else no need to copy anything, so leave seg->extra_vaddr as
+	 * NULL
+	 */
+	return;
 
 bail:
 	DEBUG("Unable to perform minimal copy\n");
 bail2:
-	*extra_start = start_orig;
-	*extra_end = end_orig;
-bail3:
-	return;
+	seg->extra_vaddr = start_orig;
+	seg->extrasz = end_orig - start_orig;
 }
 
 /*
@@ -584,7 +580,7 @@ bail3:
 static int prepare_segment(struct seg_info *seg)
 {
 	int hpage_size = gethugepagesize();
-	void *p, *extra_start, *extra_end;
+	void *p;
 	unsigned long gap;
 	unsigned long size;
 
@@ -592,9 +588,14 @@ static int prepare_segment(struct seg_info *seg)
 	 * Calculate the BSS size that we must copy in order to minimize
 	 * the size of the shared mapping.
 	 */
-	get_extracopy(seg, &extra_start, &extra_end);
-	size = ALIGN((unsigned long)extra_end - (unsigned long)seg->vaddr,
+	get_extracopy(seg);
+	if (seg->extra_vaddr) {
+		size = ALIGN((unsigned long)seg->extra_vaddr +
+				seg->extrasz - (unsigned long)seg->vaddr,
 				hpage_size);
+	} else {
+		size = ALIGN(seg->filesz, hpage_size);
+	}
 
 	/* Prepare the hugetlbfs file */
 
@@ -617,11 +618,12 @@ static int prepare_segment(struct seg_info *seg)
 	memcpy(p, seg->vaddr, seg->filesz);
 	DEBUG_CONT("done\n");
 
-	if (extra_end > extra_start) {
+	if (seg->extra_vaddr) {
 		DEBUG("Copying extra %#0lx bytes from %p...",
-			(unsigned long)(extra_end - extra_start), extra_start);
-		gap = extra_start - (seg->vaddr + seg->filesz);
-		memcpy((p + seg->filesz + gap), extra_start, (extra_end - extra_start));
+					seg->extrasz, seg->extra_vaddr);
+		gap = seg->extra_vaddr - (seg->vaddr + seg->filesz);
+		memcpy((p + seg->filesz + gap), seg->extra_vaddr,
+							seg->extrasz);
 		DEBUG_CONT("done\n");
 	}
 
@@ -791,6 +793,7 @@ static void remap_segments(struct seg_info *seg, int num)
 	long hpage_size = gethugepagesize();
 	int i;
 	void *p;
+	char c;
 
 	/*
 	 * XXX: The bogus call to mmap below forces ld.so to resolve the
@@ -829,6 +832,46 @@ static void remap_segments(struct seg_info *seg, int num)
 	/* The segments are all back at this point.
 	 * and it should be safe to reference static data
 	 */
+
+	/*
+	 * This pagecache dropping code should not be used for shared
+	 * segments.  But we currently only share read-only segments, so
+	 * the below check for PROT_WRITE is implicitly sufficient.
+	 */
+	for (i = 0; i < num; i++) {
+		if (seg[i].prot & PROT_WRITE) {
+			/*
+			 * take a COW fault on each hugepage in the
+			 * segment's file data ...
+			 */
+			for (p = seg[i].vaddr;
+			     p <= seg[i].vaddr + seg[i].filesz;
+			     p += hpage_size) {
+				memcpy(&c, p, 1);
+				memcpy(p, &c, 1);
+			}
+			/*
+			 * ... as well as each huge page in the
+			 * extracopy area
+			 */
+			if (seg[i].extra_vaddr) {
+				for (p = seg[i].extra_vaddr;
+				     p <= seg[i].extra_vaddr +
+							seg[i].extrasz;
+				     p += hpage_size) {
+					memcpy(&c, p, 1);
+					memcpy(p, &c, 1);
+				}
+			}
+			/*
+			 * Note: fadvise() failing is not actually an
+			 * error, as we'll just use an extra set of
+			 * hugepages (in the pagecache).
+			 */
+			fsync(seg[i].fd);
+			posix_fadvise(seg[i].fd, 0, 0, POSIX_FADV_DONTNEED);
+		}
+	}
 }
 
 static int check_env(void)

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: Hugepages_Rsvd goes huge in 2.6.20-rc7
  2007-02-06  0:55   ` Nishanth Aravamudan
@ 2007-02-06  1:24     ` David Gibson
  2007-02-07  1:05       ` [Libhugetlbfs-devel] " Nishanth Aravamudan
  0 siblings, 1 reply; 5+ messages in thread
From: David Gibson @ 2007-02-06  1:24 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: linux-mm, libhugetlbfs-devel, hugh

On Mon, Feb 05, 2007 at 04:55:47PM -0800, Nishanth Aravamudan wrote:
> On 05.02.2007 [16:25:34 -0800], Nishanth Aravamudan wrote:
> > Sorry, I botched Hugh's e-mail address, please make sure to reply to the
> > correct one.
> > 
> > Thanks,
> > Nish
> > 
> > On 05.02.2007 [16:19:04 -0800], Nishanth Aravamudan wrote:
> > > Hi all,
> > > 
> > > So, here's the current state of the hugepages portion of my
> > > /proc/meminfo (x86_64, 2.6.20-rc7, will test with 2.6.20 shortly,
> > > but AFAICS, there haven't been many changes to hugepage code between
> > > the two):
> 
> Reproduced on 2.6.20, and I think I've got a means to make it more
> easily reproducible (at least on x86_64).
> 
> Please note, I found that when HugePages_Rsvd goes very large, I can
> make it return to 0 by running `make func`, but killing it before it
> gets to the sharing tests.  Rsvd returns to 0 in this case.

Um.. yeah, that may just be because it's reserving some pages, which
rolls the rsvd count back to 0.

> So, here's my means of reproducing it (as root, from the libhugetlbfs
> root directory [1]):
> 
> # make sure everything is clean, hugepages wise
> root@arkanoid# rm -rf /mnt/hugetlbfs/*
> # if /proc/meminfo is already screwed up, run `make func` and kill it
> # around when you see the mprotect testcase run, that seems to always
> # work -- I'll try to be more scientific on this in a bit, to see which
> # test causes the value to return to sanity
> 
> # run the linkshare testcase once, probably will die right away
> root@arkanoid# HUGETLB_VERBOSE=99 HUGETLB_ELFMAP=y HUGETLB_SHARE=1 LD_LIBRARY_PATH=./obj64 ./tests/obj64/xBDT.linkshare
> # you should see the testcase be killed, something like
> # "FAIL    Child 1 killed by signal: Killed"
> root@arkanoid# cat /proc/meminfo
> # and a large value in meminfo now
> 
> Seems to happen every time I do this :) Note, part of this
> reproducibility stems from a small modification to the details I gave
> before. Before doing the posix_fadvise() call, I now do an fsync() on
> the file-descriptor. Without the fsync(), it may take one or two
> invocations before the test fails, but it still will in my experience so
> far.
> 
> Also note, that I'm not trying to defend the way I'm approaching this
> problem in libhugetlbfs (I'm very open to alternatives) -- but
> regardless of what I do there, I don't think Rsvd should be
> 18446744073709551615 ...

Oh, certainly not.  Clearly we're managing to decrement it more times
than we're incrementing it somehow.  I'd check the codepath for the
madvise() thing, we may not be handling that properly.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Libhugetlbfs-devel] Hugepages_Rsvd goes huge in 2.6.20-rc7
  2007-02-06  1:24     ` David Gibson
@ 2007-02-07  1:05       ` Nishanth Aravamudan
  0 siblings, 0 replies; 5+ messages in thread
From: Nishanth Aravamudan @ 2007-02-07  1:05 UTC (permalink / raw)
  To: linux-mm, libhugetlbfs-devel, hugh

On 06.02.2007 [12:24:42 +1100], David Gibson wrote:
> On Mon, Feb 05, 2007 at 04:55:47PM -0800, Nishanth Aravamudan wrote:
> > On 05.02.2007 [16:25:34 -0800], Nishanth Aravamudan wrote:
> > > Sorry, I botched Hugh's e-mail address, please make sure to reply to the
> > > correct one.
> > > 
> > > Thanks,
> > > Nish
> > > 
> > > On 05.02.2007 [16:19:04 -0800], Nishanth Aravamudan wrote:
> > > > Hi all,
> > > > 
> > > > So, here's the current state of the hugepages portion of my
> > > > /proc/meminfo (x86_64, 2.6.20-rc7, will test with 2.6.20
> > > > shortly, but AFAICS, there haven't been many changes to hugepage
> > > > code between the two):
> > 
> > Reproduced on 2.6.20, and I think I've got a means to make it more
> > easily reproducible (at least on x86_64).
<snip>
> > Also note, that I'm not trying to defend the way I'm approaching
> > this problem in libhugetlbfs (I'm very open to alternatives) -- but
> > regardless of what I do there, I don't think Rsvd should be
> > 18446744073709551615 ...
> 
> Oh, certainly not.  Clearly we're managing to decrement it more times
> than we're incrementing it somehow.  I'd check the codepath for the
> madvise() thing, we may not be handling that properly.

FYI, Ken Chen's patch fixes the problem for me:

http://marc2.theaimsgroup.com/?l=linux-mm&m=117079608820399&w=2

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2007-02-07  1:04 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-02-06  0:19 Hugepages_Rsvd goes huge in 2.6.20-rc7 Nishanth Aravamudan
2007-02-06  0:25 ` Nishanth Aravamudan
2007-02-06  0:55   ` Nishanth Aravamudan
2007-02-06  1:24     ` David Gibson
2007-02-07  1:05       ` [Libhugetlbfs-devel] " Nishanth Aravamudan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).