* Re: [PATCH] libxc: check return values on mmap() and madvise() on xc_alloc_hypercall_buffer()
2014-05-21 13:27 ` Ian Campbell
@ 2014-05-21 16:05 ` Luis R. Rodriguez
2014-12-19 17:22 ` Olaf Hering
1 sibling, 0 replies; 4+ messages in thread
From: Luis R. Rodriguez @ 2014-05-21 16:05 UTC (permalink / raw)
To: Ian Campbell, Mel Gorman; +Cc: xen-devel
On Wed, May 21, 2014 at 6:27 AM, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> On Tue, 2014-05-20 at 05:37 -0700, Luis R. Rodriguez wrote:
>> From: "Luis R. Rodriguez" <mcgrof@suse.com>
>>
>> On a Thinkpad T4440p with OpenSUSE tumbleweed with v3.15-rc4
>> and today's latest xen tip from the git tree strace -f reveals
>> we end up on a never ending wait shortly after
>>
>> write(20, "backend/console/5\0", 18 <unfinished ...>
>>
>> This is right before we just wait on the qemu process which we
>> had mmap'd for. Without this you'll end up getting stuck on a
>> loop if mmap() worked but madvise() did not. While at it I noticed
>> even the mmap() error fail was not being checked, fix that too.
>>
>> Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
>
> Acked-by: Ian Campbell <ian.campbell@citrix.com> and applied.
>
> OOI why was madvise failing? (should be quite unusual I think?)
For the sake of Mel the context of the patch was:
> tools/libxc/xc_linux_osdep.c | 20 +++++++++++++++++++-
> 1 file changed, 19 insertions(+), 1 deletion(-)
>
> diff --git a/tools/libxc/xc_linux_osdep.c b/tools/libxc/xc_linux_osdep.c
> index 73860a2..86bff3e 100644
> --- a/tools/libxc/xc_linux_osdep.c
> +++ b/tools/libxc/xc_linux_osdep.c
> @@ -92,14 +92,32 @@ static void *linux_privcmd_alloc_hypercall_buffer(xc_interface *xch, xc_osdep_ha
> {
> size_t size = npages * XC_PAGE_SIZE;
> void *p;
> + int rc, saved_errno;
>
> /* Address returned by mmap is page aligned. */
> p = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_LOCKED, -1, 0);
> + if ( p == MAP_FAILED )
> + {
> + PERROR("xc_alloc_hypercall_buffer: mmap failed");
> + return NULL;
> + }
>
> /* Do not copy the VMA to child process on fork. Avoid the page being COW
> on hypercall. */
> - madvise(p, npages * XC_PAGE_SIZE, MADV_DONTFORK);
> + rc = madvise(p, npages * XC_PAGE_SIZE, MADV_DONTFORK);
> + if ( rc < 0 )
> + {
> + PERROR("xc_alloc_hypercall_buffer: madvise failed");
> + goto out;
> + }
> +
> return p;
> +
> +out:
> + saved_errno = errno;
> + (void)munmap(p, size);
> + errno = saved_errno;
> + return NULL;
> }
I tried to dive deep into all the possible worst case scenarios of the
above but moved on after a bit, but I'll give my brain dump so far as
the question of whether or not madvise() failed is just one of the
questions that should be considered here.
Although MADV_HUGEPAGE is not *explicitly* used here strace shows its
also used. For example my traces showed the madvise() was called for
qemu, this is part of strace -f -o stuff.txt xl create
/etc/xen/debian.hvm
1735 execve("/usr/bin/qemu-system-x86_64",
["/usr/bin/qemu-system-x86_64", "-xen-domid", "1", "-chardev",
"socket,id=libxl-cmd,path=/var/ru"..., "-mon",
"chardev=libxl-cmd,mode=control", "-nodefaults", "-name",
"debian.hwm", "-vnc", "127.0.0.1:0,to=99", "-serial", "pty",
"-device", "cirrus-vga", ...], [/* 59 vars */] <unfinished ...>
Further down I see:
1735 mmap(NULL, 1576960, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0f50116000
1735 madvise(0, 1069547520, MADV_HUGEPAGE) = -1 ENOMEM (Cannot allocate memory)
1735 madvise(0, 1069547520, MADV_DONTFORK) = -1 ENOMEM (Cannot allocate memory)
I believe that MADV_HUGEPAGE is used although we didn't ask for it
explicitly because of the size mmap()'d, in this case it had turned
out to be the memory that we'd want to use for our qemu process. I am
not certain but I believe this can happen automatically on kernels
that have huge page support and when the mmap()'d page is huge page
aligned. Now, the madvise() call for MADV_HUGEPAGE should not be
expected to succeed though, for example I suspect this can fail when a
system with huge page tables don't have it set up properly yet [0].
The important piece here however for us is the use of MADV_DONTFORK
given it was passed explicitly and that *can* also fail but its
important for us that is does not fail given the context under which
its being used. The consequences of MADV_DONTFORK failing and not
checking for the error vary, let's explore this particular case a bit
more. In this case its for qemu -- its unclear to me what would happen
if MADV_DONTFORK was used and it failed and we still continued but I
do have some information as to what can happen if MADV_DONTFORK is
*not* used and what gave birth to the flag -- if not used and the
parent process forks its possible that "the driver retains a reference
to the page which now belongs exclusively to the child process(es).
The parent process and the driver will no longer be able to
communicate with each other." [1]. This does not necessarily mean that
failure to check for madvise() with MADV_DONTFORK implicates the same
dangers -- I am not certain of this.
What's curious however is that since MADV_HUGEPAGE was pegged
automatically for us and since a system could not have huge page table
set up properly this madvise() can fail, but -- calls to madvise()
with MADV_HUGEPAGE should not be considered fatal, however since we
were using MADV_DONTFORK and we don't want that to fail the addition
of MADV_HUGEPAGE is perhaps not welcomed here unless we know that was
set up. Its unclear if in such cases using MADV_DONTFORK with
MADV_NOHUGEPAGE is advisable -- but my strace shows these are split up
-- we still however never checked its return value and consequences of
this are still a bit unclear to me.
In the now working situation I see now that madvise() with
MADV_HUGEPAGE and then MADV_DONTFORK fails and then see a mremap() on
the first returned address followed by the same madvise() failing
calls, after that I see a new mmap() with different mmap() flags
passed.
31318 execve("/usr/local/lib/xen/bin/qemu-system-i386",
["/usr/local/lib/xen/bin/qemu-syst"..., "-xen-domid", "2", "-chardev",
"socket,id=libxl-cmd,path=/var/ru"..., "-mon",
"chardev=libxl-cmd,mode=control", "-nodefaults", "-name",
"debian.hwm", "-vnc", "127.0.0.1:0,to=99", "-device",
"cirrus-vga,vgamem_mb=8", "-boot", "order=cda", ...], [/* 60 vars */]
<unfinished ...>
And then:
31318 mmap(NULL, 262144, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fceade29000
31318 madvise(0, 1069547520, MADV_HUGEPAGE) = -1 ENOMEM (Cannot allocate memory)
31318 madvise(0, 1069547520, MADV_DONTFORK) = -1 ENOMEM (Cannot allocate memory)
Followed by:
31318 mremap(0x7fceade29000, 262144, 266240, MREMAP_MAYMOVE) = 0x7fcea2a8f000
31318 madvise(0, 8388608, MADV_HUGEPAGE) = -1 ENOMEM (Cannot allocate memory)
31318 madvise(0, 8388608, MADV_DONTFORK) = -1 ENOMEM (Cannot allocate memory)
31318 mmap(NULL, 8388608, PROT_READ|PROT_WRITE, MAP_SHARED, 12, 0) =
0x7fcea228f000
Note that my patch also covers a failure to check mmap() too --
however I believe if that failed write access to the mmap()'d area
would trigger a quick enough failure. My hope is that this is the same
for the failure to check for madvise() with MADV_DONTFORK. Given my
uncertainty I did not include all this information in the commit log.
[0] http://lwn.net/Articles/376606/
[1] http://lwn.net/Articles/171941/
Luis
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] libxc: check return values on mmap() and madvise() on xc_alloc_hypercall_buffer()
2014-05-21 13:27 ` Ian Campbell
2014-05-21 16:05 ` Luis R. Rodriguez
@ 2014-12-19 17:22 ` Olaf Hering
1 sibling, 0 replies; 4+ messages in thread
From: Olaf Hering @ 2014-12-19 17:22 UTC (permalink / raw)
To: Ian Campbell, Jan Beulich, Ian Jackson; +Cc: xen-devel, Luis R. Rodriguez
Please backport this patch to 4.4.
Other branches may need the mmap() check as well. The callers expect
either NULL or a valid pointer.
It is upstream commit e86539a388314cd3dca88f5e69d7873343197cd8
Thanks,
Olaf
On Wed, May 21, Ian Campbell wrote:
> On Tue, 2014-05-20 at 05:37 -0700, Luis R. Rodriguez wrote:
> > From: "Luis R. Rodriguez" <mcgrof@suse.com>
> >
> > On a Thinkpad T4440p with OpenSUSE tumbleweed with v3.15-rc4
> > and today's latest xen tip from the git tree strace -f reveals
> > we end up on a never ending wait shortly after
> >
> > write(20, "backend/console/5\0", 18 <unfinished ...>
> >
> > This is right before we just wait on the qemu process which we
> > had mmap'd for. Without this you'll end up getting stuck on a
> > loop if mmap() worked but madvise() did not. While at it I noticed
> > even the mmap() error fail was not being checked, fix that too.
> >
> > Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
>
> Acked-by: Ian Campbell <ian.campbell@citrix.com> and applied.
>
> OOI why was madvise failing? (should be quite unusual I think?)
>
> > ---
> > tools/libxc/xc_linux_osdep.c | 20 +++++++++++++++++++-
> > 1 file changed, 19 insertions(+), 1 deletion(-)
> >
> > diff --git a/tools/libxc/xc_linux_osdep.c b/tools/libxc/xc_linux_osdep.c
> > index 73860a2..86bff3e 100644
> > --- a/tools/libxc/xc_linux_osdep.c
> > +++ b/tools/libxc/xc_linux_osdep.c
> > @@ -92,14 +92,32 @@ static void *linux_privcmd_alloc_hypercall_buffer(xc_interface *xch, xc_osdep_ha
> > {
> > size_t size = npages * XC_PAGE_SIZE;
> > void *p;
> > + int rc, saved_errno;
> >
> > /* Address returned by mmap is page aligned. */
> > p = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_LOCKED, -1, 0);
> > + if ( p == MAP_FAILED )
> > + {
> > + PERROR("xc_alloc_hypercall_buffer: mmap failed");
> > + return NULL;
> > + }
> >
> > /* Do not copy the VMA to child process on fork. Avoid the page being COW
> > on hypercall. */
> > - madvise(p, npages * XC_PAGE_SIZE, MADV_DONTFORK);
> > + rc = madvise(p, npages * XC_PAGE_SIZE, MADV_DONTFORK);
> > + if ( rc < 0 )
> > + {
> > + PERROR("xc_alloc_hypercall_buffer: madvise failed");
> > + goto out;
> > + }
> > +
> > return p;
> > +
> > +out:
> > + saved_errno = errno;
> > + (void)munmap(p, size);
> > + errno = saved_errno;
> > + return NULL;
> > }
> >
> > static void linux_privcmd_free_hypercall_buffer(xc_interface *xch, xc_osdep_handle h, void *ptr, int npages)
^ permalink raw reply [flat|nested] 4+ messages in thread