* [RFC] libibverbs: ibv_fork_init() and libhugetlbfs
@ 2010-05-06 7:39 Alexander Schmidt
2010-05-06 20:55 ` Roland Dreier
0 siblings, 1 reply; 5+ messages in thread
From: Alexander Schmidt @ 2010-05-06 7:39 UTC (permalink / raw)
To: Roland Dreier, of-ewg, Linux RDMA
Cc: Hoang-Nam Nguyen, Stefan Roscher, Joachim Fenkes,
Christoph Raisch
Hi all,
we are trying to make use of libhugetlbfs in an application that relies on
ibv_fork_init() to enable fork() support. The problem we are running into is
that calls to the madvise system call fail when registering a memory region
for memory that is provided by libhugetlbfs. We have written a preliminary
fix (see below) for this and are looking for comments / feedback to get an
acceptable solution.
When fork support is enabled in libibverbs, madvise() is called for every
memory page that is registered as a memory region. Memory ranges that
are passed to madvise() must be page aligned and the size must be a
multiple of the page size. libibverbs uses sysconf(_SC_PAGESIZE) to find
out the system page size and rounds all ranges passed to reg_mr() according
to this page size. When memory from libhugetlbfs is passed to reg_mr(), this
does not work as the page size for this memory range might be different
(e.g. 16Mb). So libibverbs would have to use the huge page size to
calculate a page aligned range for madvise.
As huge pages are provided to the application "under the hood" when
preloading libhugetlbfs, the application does not have any knowledge about
when it registers a huge page or a usual page.
The patch below demonstrates a possible solution for this. It parses the
/proc/PID/maps file when registering a memory region and decides if the
memory that is to be registered is part of a libhugetlbfs range or not. If so,
a page size of 16Mb is used to align the memory range passed to madvise().
We see two problems with this: it is not a very elegant solution to parse the
procfs file and the 16Mb are hardcoded currently. The latter point could be
solved by calling gethugepagesize() from libhugetlbfs, which would add a new
dependency to libibverbs.
We are highly interested in reviews, comments, suggestions to get this solved
soon. Thanks!
Signed-off-by: Alexander Schmidt <alexs-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
---
src/memory.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 47 insertions(+), 3 deletions(-)
--- libibverbs-1.1.2.orig/src/memory.c
+++ libibverbs-1.1.2/src/memory.c
@@ -40,6 +40,8 @@
#include <unistd.h>
#include <stdlib.h>
#include <stdint.h>
+#include <stdio.h>
+#include <string.h>
#include "ibverbs.h"
@@ -54,6 +56,8 @@
#define MADV_DOFORK 11
#endif
+#define HUGE_PAGE_SIZE (16 * 1024 * 1024)
+
struct ibv_mem_node {
enum {
IBV_RED,
@@ -446,6 +450,48 @@ static struct ibv_mem_node *__mm_find_st
return node;
}
+static void get_range_address(uintptr_t *start, uintptr_t *end, void *base, size_t size)
+{
+ pid_t pid;
+ FILE *file;
+ char buf[1024], lib[128];
+ int range_page_size = page_size;
+
+ pid = getpid();
+ snprintf(buf, sizeof(buf), "/proc/%d/maps", pid);
+
+ file = fopen(buf, "r");
+ if (!file)
+ goto out;
+
+ while (fgets(buf, sizeof(buf), file) != NULL) {
+ int n;
+ char *substr;
+ uintptr_t range_start, range_end;
+
+ n = sscanf(buf, "%lx-%lx %*s %*x %*s %*u %127s",
+ &range_start, &range_end, &lib);
+
+ if (n < 3)
+ continue;
+
+ substr = strstr(lib, "libhugetlbfs");
+ if (substr) {
+ if ((uintptr_t) base >= range_start &&
+ (uintptr_t) base < range_end) {
+ range_page_size = HUGE_PAGE_SIZE;
+ break;
+ }
+ }
+ }
+ fclose(file);
+
+out:
+ *start = (uintptr_t) base & ~(range_page_size - 1);
+ *end = ((uintptr_t) (base + size + range_page_size - 1) &
+ ~(range_page_size - 1)) - 1;
+}
+
static int ibv_madvise_range(void *base, size_t size, int advice)
{
uintptr_t start, end;
@@ -458,9 +504,7 @@ static int ibv_madvise_range(void *base,
inc = advice == MADV_DONTFORK ? 1 : -1;
- start = (uintptr_t) base & ~(page_size - 1);
- end = ((uintptr_t) (base + size + page_size - 1) &
- ~(page_size - 1)) - 1;
+ get_range_address(&start, &end, base, size);
pthread_mutex_lock(&mm_mutex);
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: [RFC] libibverbs: ibv_fork_init() and libhugetlbfs 2010-05-06 7:39 [RFC] libibverbs: ibv_fork_init() and libhugetlbfs Alexander Schmidt @ 2010-05-06 20:55 ` Roland Dreier [not found] ` <adavdb092d8.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org> 0 siblings, 1 reply; 5+ messages in thread From: Roland Dreier @ 2010-05-06 20:55 UTC (permalink / raw) To: Alexander Schmidt Cc: of-ewg, Linux RDMA, Hoang-Nam Nguyen, Stefan Roscher, Joachim Fenkes, Christoph Raisch, Alex Vainman > When fork support is enabled in libibverbs, madvise() is called for every > memory page that is registered as a memory region. Memory ranges that > are passed to madvise() must be page aligned and the size must be a > multiple of the page size. libibverbs uses sysconf(_SC_PAGESIZE) to find > out the system page size and rounds all ranges passed to reg_mr() according > to this page size. When memory from libhugetlbfs is passed to reg_mr(), this > does not work as the page size for this memory range might be different > (e.g. 16Mb). So libibverbs would have to use the huge page size to > calculate a page aligned range for madvise. Yes, Alex Vainman reaised this same issue a while ago. > The patch below demonstrates a possible solution for this. It parses the > /proc/PID/maps file when registering a memory region and decides if the > memory that is to be registered is part of a libhugetlbfs range or not. If so, > a page size of 16Mb is used to align the memory range passed to madvise(). > > We see two problems with this: it is not a very elegant solution to parse the > procfs file and the 16Mb are hardcoded currently. The latter point could be > solved by calling gethugepagesize() from libhugetlbfs, which would add a new > dependency to libibverbs. I think that we cannot assume huge pages only come from libhugetlbfs -- we should support an application directly enabling huge pages (possibly via another library too, so we can't assume that an application knows the page size for a memory range it is about to register). And also the 16 MB page size constant is of course not feasible -- with all due respect, the x86 page size of 2 MB is much more likely in practice :) (Although perhaps the much slower PowerPC TLB refill makes users more likely to try and use hugetlb pages ;) Alex suggested parsing files in the same way as libhugetlbfs does to get the page size, and that seems to be the best solution, since I don't think the libhugetlbfs license is compatible with the BSD license for libibverbs. But your trick of using /proc/*/maps looks nice. Does that only work for libhugetlbfs or can we recognize direct mmap of hugetlb pages? - R. -- Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 5+ messages in thread
[parent not found: <adavdb092d8.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>]
* Re: [RFC] libibverbs: ibv_fork_init() and libhugetlbfs [not found] ` <adavdb092d8.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org> @ 2010-05-07 10:19 ` Alexander Schmidt 2010-05-12 16:40 ` Roland Dreier 0 siblings, 1 reply; 5+ messages in thread From: Alexander Schmidt @ 2010-05-07 10:19 UTC (permalink / raw) To: Roland Dreier Cc: of-ewg, Linux RDMA, Hoang-Nam Nguyen, Stefan Roscher, Joachim Fenkes, Christoph Raisch, Alex Vainman On Thu, 06 May 2010 13:55:31 -0700 Roland Dreier <rdreier-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> wrote: > I think that we cannot assume huge pages only come from libhugetlbfs -- > we should support an application directly enabling huge pages (possibly > via another library too, so we can't assume that an application knows > the page size for a memory range it is about to register). > > And also the 16 MB page size constant is of course not feasible -- with > all due respect, the x86 page size of 2 MB is much more likely in > practice :) (Although perhaps the much slower PowerPC TLB refill makes > users more likely to try and use hugetlb pages ;) > > Alex suggested parsing files in the same way as libhugetlbfs does to get > the page size, and that seems to be the best solution, since I don't > think the libhugetlbfs license is compatible with the BSD license for > libibverbs. > > But your trick of using /proc/*/maps looks nice. Does that only work > for libhugetlbfs or can we recognize direct mmap of hugetlb pages? Hi Roland, thanks for your comments! I've reworked my patch: * added get_huge_page_size() to read the huge page size from /proc/meminfo. This is done at ibv_fork_init() time. * I noticed that some applications like ibv_rc_pingpong already get memory from libhugetlbfs when running ibv_fork_init(). So I changed the code for testing madvise() to allocate a huge page if the huge page size is set in the system. I have not tested this code with different libraries providing huge pages / mmaped pages yet, but I hope this can be added later on when we have agreed on an approach to handle huge pages. Signed-off-by: Alexander Schmidt <alexs-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> --- src/memory.c | 103 ++++++++++++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 95 insertions(+), 8 deletions(-) --- libibverbs-1.1.2.orig/src/memory.c +++ libibverbs-1.1.2/src/memory.c @@ -40,6 +40,8 @@ #include <unistd.h> #include <stdlib.h> #include <stdint.h> +#include <stdio.h> +#include <string.h> #include "ibverbs.h" @@ -68,12 +70,45 @@ struct ibv_mem_node { static struct ibv_mem_node *mm_root; static pthread_mutex_t mm_mutex = PTHREAD_MUTEX_INITIALIZER; static int page_size; +static int huge_page_size; static int too_late; +static int get_huge_page_size(void) +{ + int ret = -1; + FILE *file; + char *path = "/proc/meminfo"; + char buf[1024], type[128]; + + file = fopen(path, "r"); + if (!file) + goto out; + + while (fgets(buf, sizeof(buf), file) != NULL) { + int n; + unsigned long size; + + n = sscanf(buf, "%127s %lu %*s", &type, &size); + + if (n < 2) + continue; + + if (!strcmp(type, "Hugepagesize:")) { + /* huge page size is printed in Kb */ + ret = size * 1024; + break; + } + } + fclose(file); + +out: + return ret; +} + int ibv_fork_init(void) { void *tmp; - int ret; + int ret, size; if (mm_root) return 0; @@ -85,11 +120,18 @@ int ibv_fork_init(void) if (page_size < 0) return errno; - if (posix_memalign(&tmp, page_size, page_size)) + huge_page_size = get_huge_page_size(); + + if (huge_page_size > page_size) + size = huge_page_size; + else + size = page_size; + + if (posix_memalign(&tmp, size, size)) return ENOMEM; - ret = madvise(tmp, page_size, MADV_DONTFORK) || - madvise(tmp, page_size, MADV_DOFORK); + ret = madvise(tmp, size, MADV_DONTFORK) || + madvise(tmp, size, MADV_DOFORK); free(tmp); @@ -446,11 +488,51 @@ static struct ibv_mem_node *__mm_find_st return node; } +static int is_huge_page(void *base) +{ + int ret = 0; + pid_t pid; + FILE *file; + char buf[1024], lib[128]; + + pid = getpid(); + snprintf(buf, sizeof(buf), "/proc/%d/maps", pid); + + file = fopen(buf, "r"); + if (!file) + goto out; + + while (fgets(buf, sizeof(buf), file) != NULL) { + int n; + char *substr; + uintptr_t range_start, range_end; + + n = sscanf(buf, "%lx-%lx %*s %*x %*s %*u %127s", + &range_start, &range_end, &lib); + + if (n < 3) + continue; + + substr = strstr(lib, "libhugetlbfs"); + if (substr) { + if ((uintptr_t) base >= range_start && + (uintptr_t) base < range_end) { + ret = 1; + break; + } + } + } + fclose(file); + +out: + return ret; +} + static int ibv_madvise_range(void *base, size_t size, int advice) { uintptr_t start, end; struct ibv_mem_node *node, *tmp; - int inc; + int inc, range_page_size; int ret = 0; if (!size) @@ -458,9 +540,14 @@ static int ibv_madvise_range(void *base, inc = advice == MADV_DONTFORK ? 1 : -1; - start = (uintptr_t) base & ~(page_size - 1); - end = ((uintptr_t) (base + size + page_size - 1) & - ~(page_size - 1)) - 1; + if (huge_page_size > page_size && is_huge_page(base)) + range_page_size = huge_page_size; + else + range_page_size = page_size; + + start = (uintptr_t) base & ~(range_page_size - 1); + end = ((uintptr_t) (base + size + range_page_size - 1) & + ~(range_page_size - 1)) - 1; pthread_mutex_lock(&mm_mutex); -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC] libibverbs: ibv_fork_init() and libhugetlbfs 2010-05-07 10:19 ` Alexander Schmidt @ 2010-05-12 16:40 ` Roland Dreier [not found] ` <adafx1xul8v.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org> 0 siblings, 1 reply; 5+ messages in thread From: Roland Dreier @ 2010-05-12 16:40 UTC (permalink / raw) To: Alexander Schmidt Cc: of-ewg, Linux RDMA, Hoang-Nam Nguyen, Stefan Roscher, Joachim Fenkes, Christoph Raisch, Alex Vainman > * added get_huge_page_size() to read the huge page size from > /proc/meminfo. This is done at ibv_fork_init() time. That doesn't work on systems that have multiple huge page sizes (eg powerpc). You can compare the code to get the size in libhugetlbfs. Also I think the munging through /proc/pid/maps doesn't really work. First of all, essentially grepping for libhugetlbfs is not robust as I mentioned -- this will break in the same way for apps using huge pages directly. And while it is nice to be able to tell if a range came from libhugetlbfs, I think there may be some bad performance impact from reading /proc/pid/maps line-by-line. (And by the way, as a trivial optimization, it would make sense to me to check the address of each map before doing the strstr). - R. -- Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 5+ messages in thread
[parent not found: <adafx1xul8v.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>]
* Re: [ewg] [RFC] libibverbs: ibv_fork_init() and libhugetlbfs [not found] ` <adafx1xul8v.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org> @ 2010-05-18 11:04 ` Stefan Roscher 0 siblings, 0 replies; 5+ messages in thread From: Stefan Roscher @ 2010-05-18 11:04 UTC (permalink / raw) To: ewg-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5 Cc: Roland Dreier, Alexander Schmidt, Linux RDMA, Joachim Fenkes, Alex Vainman, Christoph Raisch, Hoang-Nam Nguyen, Stefan Roscher, Mel Gorman Hi Roland, On Wednesday 12 May 2010 06:40:16 pm Roland Dreier wrote: > > * added get_huge_page_size() to read the huge page size from > > /proc/meminfo. This is done at ibv_fork_init() time. > > That doesn't work on systems that have multiple huge page sizes (eg > powerpc). You can compare the code to get the size in libhugetlbfs. > > Also I think the munging through /proc/pid/maps doesn't really work. > First of all, essentially grepping for libhugetlbfs is not robust as I > mentioned -- this will break in the same way for apps using huge pages > directly. I agree that this two points (multiple huge page sizes and performance) are two critical issues which we have to attack. One suggestion from our side would be to use /proc/pid/smaps which would workaround the multiple huge page size problem. It would not solve the perfomance issue, because reading the file would assume long time even if we buffer the ranges and pagesize during fork_init. Do you have an oppinion on that? We appreciate your help! regards Stefan -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2010-05-18 11:04 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-06 7:39 [RFC] libibverbs: ibv_fork_init() and libhugetlbfs Alexander Schmidt
2010-05-06 20:55 ` Roland Dreier
[not found] ` <adavdb092d8.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
2010-05-07 10:19 ` Alexander Schmidt
2010-05-12 16:40 ` Roland Dreier
[not found] ` <adafx1xul8v.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
2010-05-18 11:04 ` [ewg] " Stefan Roscher
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox