Re: Understanding osdep_xenforeignmemory_map mmap behaviour

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Alex Bennée" <alex.bennee@linaro.org>
To: Juergen Gross <jgross@suse.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>,
	xen-devel@lists.xen.org, Stratos-dev@op-lists.linaro.org,
	viresh.kumar@linaro.org,
	Mathieu Poirier <mathieu.poirier@linaro.org>,
	christopher.w.clark@gmail.com, boris.ostrovsky@oracle.com,
	Arnd Bergmann <arnd.bergmann@linaro.org>
Subject: Re: Understanding osdep_xenforeignmemory_map mmap behaviour
Date: Fri, 25 Mar 2022 16:07:51 +0000	[thread overview]
Message-ID: <87mthe0zea.fsf@linaro.org> (raw)
In-Reply-To: <57549560-879d-f705-8693-9bfdc73e3f7f@suse.com>


(add Arnd to CC)

Juergen Gross <jgross@suse.com> writes:

> [[PGP Signed Part:Undecided]]
> On 24.03.22 02:42, Stefano Stabellini wrote:
>> I am pretty sure the reasons have to do with old x86 PV guests, so I am
>> CCing Juergen and Boris.
>> 
>>> Hi,
>>>
>>> While we've been working on the rust-vmm virtio backends on Xen we
>>> obviously have to map guest memory info the userspace of the daemon.
>>> However following the logic of what is going on is a little confusing.
>>> For example in the Linux backend we have this:
>>>
>>>    void *osdep_xenforeignmemory_map(xenforeignmemory_handle *fmem,
>>>                                     uint32_t dom, void *addr,
>>>                                     int prot, int flags, size_t num,
>>>                                     const xen_pfn_t arr[/*num*/], int err[/*num*/])
>>>    {
>>>        int fd = fmem->fd;
>>>        privcmd_mmapbatch_v2_t ioctlx;
>>>        size_t i;
>>>        int rc;
>>>
>>>        addr = mmap(addr, num << XC_PAGE_SHIFT, prot, flags | MAP_SHARED,
>>>                    fd, 0);
>>>        if ( addr == MAP_FAILED )
>>>            return NULL;
>>>
>>>        ioctlx.num = num;
>>>        ioctlx.dom = dom;
>>>        ioctlx.addr = (unsigned long)addr;
>>>        ioctlx.arr = arr;
>>>        ioctlx.err = err;
>>>
>>>        rc = ioctl(fd, IOCTL_PRIVCMD_MMAPBATCH_V2, &ioctlx);
>>>
>>> Where the fd passed down is associated with the /dev/xen/privcmd device
>>> for issuing hypercalls on userspaces behalf. What is confusing is why
>>> the function does it's own mmap - one would assume the passed addr would
>>> be associated with a anonymous or file backed mmap region already that
>>> the calling code has setup. Applying a mmap to a special device seems a
>>> little odd.
>>>
>>> Looking at the implementation on the kernel side it seems the mmap
>>> handler only sets a few flags:
>>>
>>>    static int privcmd_mmap(struct file *file, struct vm_area_struct *vma)
>>>    {
>>>            /* DONTCOPY is essential for Xen because copy_page_range doesn't know
>>>             * how to recreate these mappings */
>>>            vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTCOPY |
>>>                             VM_DONTEXPAND | VM_DONTDUMP;
>>>            vma->vm_ops = &privcmd_vm_ops;
>>>            vma->vm_private_data = NULL;
>>>
>>>            return 0;
>>>    }
>>>
>>> So can I confirm that the mmap of /dev/xen/privcmd is being called for
>>> side effects? Is it so when the actual ioctl is called the correct flags
>>> are set of the pages associated with the user space virtual address
>>> range?
>>>
>>> Can I confirm there shouldn't be any limitation on where and how the
>>> userspace virtual address space is setup for the mapping in the guest
>>> memory?
>>>
>>> Is there a reason why this isn't done in the ioctl path itself?
>
> For a rather long time we were using "normal" user pages for this purpose,
> which were just locked into memory for doing the hypercall.

Was this using the normal mlock() semantics to stop pages being swapped
out of RAM?

> Unfortunately there have been very rare problems with that approach, as
> the Linux kernel can set a user page related PTE to invalid for short
> periods of time, which led to EFAULT in the hypervisor when trying to
> access the hypercall data.

I must admit I'm not super familiar with the internals of page table
handling with Linux+Xen. Doesn't the kernel need to delegate the
tweaking of page tables to the hypervisor or is it allowed to manipulate
the page tables itself?

> In Linux this can avoided only by using kernel memory, which is the
> reason why the hypercall buffers are allocated and mmap()-ed through the
> privcmd driver.
>
>>>
>>> I'm trying to understand the differences between Xen and KVM in the API
>>> choices here. I think the equivalent is the KVM_SET_USER_MEMORY_REGION
>>> ioctl for KVM which brings a section of the guest physical address space
>>> into the userspaces vaddr range.
>
> The main difference is just that the consumer of the hypercall buffer is
> NOT the kernel, but the hypervisor. In the KVM case both are the same, so
> a brief period of an invalid PTE can be handled just fine in KVM, while
> the Xen hypervisor has no idea that this situation will be over very
> soon.

I still don't follow the details of why we have the separate mmap. Is it
purely because the VM flags of the special file can be changed in a way
that can't be done with a traditional file-backed mmap?

I can see various other devices have their own setting of vm flags but
VM_DONTCOPY for example can be set with the appropriate madvise call:

       MADV_DONTFORK (since Linux 2.6.16)
              Do not make the pages in this range available to the child after
              a fork(2).  This is useful to  prevent  copy-on-write  semantics
              from  changing  the  physical  location  of a page if the parent
              writes to it after a  fork(2).   (Such  page  relocations  cause
              problems for hardware that DMAs into the page.)

For the vhost-user work we need to be able to share the guest memory
between the xen-vhost-master (which is doing the ioctls to talk to Xen)
and the vhost-user daemon (which doesn't know about hypervisors but just
deals in memory and events).

Would it be enough to loosen the API and just have xen_remap_pfn()
verify the kernels VM flags are appropriately set before requesting Xen
updates the page tables?

-- 
Alex Bennée

next prev parent reply	other threads:[~2022-03-25 16:29 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-24  1:42 Understanding osdep_xenforeignmemory_map mmap behaviour Stefano Stabellini
2022-03-24  5:12 ` Juergen Gross
2022-03-25 16:07   ` Alex Bennée [this message]
2022-03-28  6:26     ` Juergen Gross
2022-08-24  9:19   ` Viresh Kumar
2022-08-24  9:44     ` Andrew Cooper
2022-08-24 11:22       ` Alex Bennée
2022-08-24 13:07         ` Juergen Gross
2022-08-24 15:58           ` Alex Bennée
2022-08-25  6:02             ` Juergen Gross
2022-08-31 16:02               ` Alex Bennée
2022-09-01  6:28                 ` Juergen Gross
2022-08-24 10:09     ` Juergen Gross
2022-08-24 11:47       ` Alex Bennée
2022-08-24 13:14         ` Juergen Gross

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87mthe0zea.fsf@linaro.org \
    --to=alex.bennee@linaro.org \
    --cc=Stratos-dev@op-lists.linaro.org \
    --cc=arnd.bergmann@linaro.org \
    --cc=boris.ostrovsky@oracle.com \
    --cc=christopher.w.clark@gmail.com \
    --cc=jgross@suse.com \
    --cc=mathieu.poirier@linaro.org \
    --cc=sstabellini@kernel.org \
    --cc=viresh.kumar@linaro.org \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.