Linux grant map/unmap improvement proposal (Draft B)

All of lore.kernel.org
 help / color / mirror / Atom feed

* Linux grant map/unmap improvement proposal (Draft B)
@ 2014-10-13 13:41 David Vrabel
  2014-10-13 16:43 ` Stefano Stabellini
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: David Vrabel @ 2014-10-13 13:41 UTC (permalink / raw)
  To: Xen-devel@lists.xen.org

Grant mapping in the Linux kernel has a number of problems:

* Grant mapping from userspace is broken for many real world use
  cases.

* Netback does not handle sending packets to network storage provided
  by a VM on the same host.

* Using blkback with network-based storage is unsafe.

* Performance is poor, particularly with userspace grants.

A PDF version of this document as available from:

  http://xenbits.xen.org/people/dvrabel/grant-improvements-B.pdf

Userspace grant maps
--------------------

Certain types of system calls using foreign mappings require
translating the virtual address to a page using `get_user_pages()` or
`get_user_pages_fast()`.  These system calls include direct I/O and
asynchronous I/O (AIO).

In the native case this translation is done by walking the userspace
page tables and looking up the PFN in the L1 entry.  PFN to page is
then trivial.

For a PV guest this L1 entry contains an MFN and this first needs to
be translated into a PFN.  For a normal frame this is a simple lookup
in the M2P.  For foreign pages, the gntdev driver maintains an
additional hash of foreign MFNs to local PFNs called the m2p_override.

The m2p_override table has a fundamental design flaw.

A domain may grant a frame multiple times, using a different grant
reference each time.  The backend maps each grant reference to a
separate page.  The 1-to-many MFN-to-page mapping cannot be
represented in the 1-to-1 m2p_override table and I/O to or from these
mappings cannot get the correct page.

Transmitting foreign pages to guests
------------------------------------

Netback when sending pages to the guest uses a grant copy operation to
copy the data into the frames granted by the guest.  This grant copy
requires either a local GFN _or_ a grant reference; it is not possible
to grant copy to/from a foreign mapping.

In order to support VM to VM traffic, netback stores the grant
reference for the sender VM in the socket buffer structure which may
then be used by the receiving netback for the grant copy.

Packets with foreign pages from other sources cannot be successfully
copied, since netback does not know the grant reference.  Once such
configuration is a VM providing an iSCSI or other network-based
storage that presents a block device in the backend that is then used
by another VM on the same host.

Blkback and network storage
---------------------------

Blkback unmaps the foreign pages in a I/O request when the request is
completed.  If networked storage is used it is possible for requests
to be completed while the skbs referring to those pages are still
queued for transmit (e.g., because a retransmission was queued while
the responds to the original packet was in flight).

When the network driver attempts to send the packet with the unmapped
page it may:

- Fault while trying to access the unmapped page.

- Transmit from a frame that is no longer granted (potentially
  transmitting sensitive guest or Xen data).

The fault does not occur with userspace storage backends since gntdev
replaces the foreign mapping with one to a local scratch page.  It
uses GNTOP_unmap_and_replace which atomically replaces the foreign
mapping with another (source) mapping.  However, this cannot be used
with batched operations since it clears the source mapping and it does
not prevent against transmitting from a non-granted frame.

Design
======

Map onto ballooned pages only
-----------------------------

Grant maps will only be permitted with ballooned pages.

The original p2m entry for these pages will always be INVALID_MFN and
thus the original MFN does not need to saved on map and restored on
unmap.

Grant map/unmap will no longer need to use or clobber `page->index`.
This allows a workaround in netback to clear `page->pfmemalloc` to be
removed (`index` and `pfmemalloc` are part of the same union).

Safe grant unmap
----------------

Grant references will only be unmapped when they are no longer in use.
i.e., the page reference count is one.

    int gnttab_unmap_refs_async(struct gnttab_unmap_grant_ref *unmap_ops,
        struct gnttab_unmap_grant_ref *kunmap_ops,
        struct page **pages, unsigned int count,
        void (*done)(void *data), void *data);

The `gnttab_unmap_refs_async()` function will unmap the grant
references using the supplied unmap operations and call `done(data)`.
The grant unmap will only be done once all pages are no longer in use.

It shall run synchronously on the first attempt (this is expected to
be the most common case).  If any page is in use, it shall queue the
unmap request to be tried at a later time.

Only the blkback and gntdev devices need to use asynchronouse unmaps.

Userspace address to page translation
-------------------------------------

The m2p_override table shall be removed.

Each VMA (struct vm_struct) shall contain an additional pointer to an
optional array of pages.  This array shall be sized to cover the full
extent of the VMA.

The gntdev driver populates this array with the relevant pages for the
foreign mappings as they are mapped.  It shall also clear them when
unmapping.  The gntdev driver must ensure it properly splits the page
array when the VMA itself is split.

Since the m2p lookup will not return a local PFN, the native
get_user_pages_fast() call will fail.  Prior to attempting to fault in
the pages, get_user_pages() can simply look up the pages in the VMA's
page array.

`page->private` will no longer need to be set to the MFN.

This is similar to the approach used in the classic kernel.

Identifying foreign pages
-------------------------

A new page flag is introduced: PG_foreign.  This will alias PG_pinned
so it does not require an additional bit.

If PG_foreign is set then `page->private` contains the grant reference
and domid for this foreign page.  This information can only be packed
into an unsigned long on 64-bit platforms.  32-bit platforms will have
to allocate an additional structure to store the domid and gref.

The aliasing of PG_foreign and PG_pinned is safe because:

- Page table pages will never be foreign.
- Foreign pages shall have `p2m[P] & FOREIGN_FRAME_BIT`.

The use of the private field is safe because:

- The page is allocated by the balloon driver and thus it owns the
  private field.

- The other fields in the union (ptl, slab_cache, and first_page) will
  not be used because the page is not used in a page table, slab or
  compound page.

Netback can thus:

1. Test PG_foreign.
2. Verify that the page is foreign via the p2m.
3. Extract the domid and gref from page->private.

The PG_foreign test is not strictly necessary as the p2m lookup is
sufficient, but it should be quicker for non-foreign pages.

Userspace grant performance
---------------------------

Since the m2p_override table will be removed, the gntdev device may
easy batch the grant map and unmap hypercalls that update the kernel
mappings.

The use of the scratch pages on unmap will be unnecessary and can be
removed.

Other improvements that may be considered are:

- Batch the userspace and kernel map and unmap.

- Lazily map grants into userspace on faults.  For applications that
  do not access the foreign frames by the userspace mappings (such as
  block backends using direct I/O) this would avoid a set of maps and
  unmaps. This lazy mode would have to be requested by the userspace
  program (since faulting many pages would be much more expensive than
  a single batched map).

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Linux grant map/unmap improvement proposal (Draft B)
  2014-10-13 13:41 Linux grant map/unmap improvement proposal (Draft B) David Vrabel
@ 2014-10-13 16:43 ` Stefano Stabellini
  2014-10-13 17:22   ` David Vrabel
  2014-10-14 10:27 ` Ian Campbell
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: Stefano Stabellini @ 2014-10-13 16:43 UTC (permalink / raw)
  To: David Vrabel; +Cc: Xen-devel@lists.xen.org

On Mon, 13 Oct 2014, David Vrabel wrote:
> Grant mapping in the Linux kernel has a number of problems:
> 
> * Grant mapping from userspace is broken for many real world use
>   cases.
> 
> * Netback does not handle sending packets to network storage provided
>   by a VM on the same host.
> 
> * Using blkback with network-based storage is unsafe.
> 
> * Performance is poor, particularly with userspace grants.
> 
> A PDF version of this document as available from:
> 
>   http://xenbits.xen.org/people/dvrabel/grant-improvements-B.pdf
> 
> 
> Userspace grant maps
> --------------------
> 
> Certain types of system calls using foreign mappings require
> translating the virtual address to a page using `get_user_pages()` or
> `get_user_pages_fast()`.  These system calls include direct I/O and
> asynchronous I/O (AIO).
> 
> In the native case this translation is done by walking the userspace
> page tables and looking up the PFN in the L1 entry.  PFN to page is
> then trivial.
> 
> For a PV guest this L1 entry contains an MFN and this first needs to
> be translated into a PFN.  For a normal frame this is a simple lookup
> in the M2P.  For foreign pages, the gntdev driver maintains an
> additional hash of foreign MFNs to local PFNs called the m2p_override.
> 
> The m2p_override table has a fundamental design flaw.
> 
> A domain may grant a frame multiple times, using a different grant
> reference each time.  The backend maps each grant reference to a
> separate page.  The 1-to-many MFN-to-page mapping cannot be
> represented in the 1-to-1 m2p_override table and I/O to or from these
> mappings cannot get the correct page.
> 
> Transmitting foreign pages to guests
> ------------------------------------
> 
> Netback when sending pages to the guest uses a grant copy operation to
> copy the data into the frames granted by the guest.  This grant copy
> requires either a local GFN _or_ a grant reference; it is not possible
> to grant copy to/from a foreign mapping.
> 
> In order to support VM to VM traffic, netback stores the grant
> reference for the sender VM in the socket buffer structure which may
> then be used by the receiving netback for the grant copy.
> 
> Packets with foreign pages from other sources cannot be successfully
> copied, since netback does not know the grant reference.  Once such
> configuration is a VM providing an iSCSI or other network-based
> storage that presents a block device in the backend that is then used
> by another VM on the same host.
> 
> Blkback and network storage
> ---------------------------
> 
> Blkback unmaps the foreign pages in a I/O request when the request is
> completed.  If networked storage is used it is possible for requests
> to be completed while the skbs referring to those pages are still
> queued for transmit (e.g., because a retransmission was queued while
> the responds to the original packet was in flight).
> 
> When the network driver attempts to send the packet with the unmapped
> page it may:
> 
> - Fault while trying to access the unmapped page.
> 
> - Transmit from a frame that is no longer granted (potentially
>   transmitting sensitive guest or Xen data).
> 
> The fault does not occur with userspace storage backends since gntdev
> replaces the foreign mapping with one to a local scratch page.  It
> uses GNTOP_unmap_and_replace which atomically replaces the foreign
> mapping with another (source) mapping.  However, this cannot be used
> with batched operations since it clears the source mapping and it does
> not prevent against transmitting from a non-granted frame.
 
This is a very good summary of the issues we are currently having with
Xen support in Linux. As such, I think I should add one that is missing
from the list, but good to keep in mind. I should point out that I am
not asking you to do anything about it at the moment.


dma_ops.unmap_page and dma_ops.unmap_sg only pass dma addresses as arguments
----------------------------------------------------------------------------

The Linux dma_map_ops API consists of a number of functions that only
provide the dma address of the dma request as argument, not the struct
page or the physical address. For example unmap_page and unmap_sg.

For Xen PV guests the dma address is a machine address. If the machine
address corresponds to a foreign page (granted to the current domain),
there is no easy way for us to retrieve the corresponding struct page or
guest physical address (other than the m2p_override with all its
problems).

This is a serious limitation, in particular if we need to do any
operations on the memory region at the time one of these functions are
called:
- on x86 fortunately we don't need to do anything;
- on ARM, if the device is not dma coherent, we might have to issue cache
maintenance operations.

 
> Design
> ======
> 
> Map onto ballooned pages only
> -----------------------------
> 
> Grant maps will only be permitted with ballooned pages.
> 
> The original p2m entry for these pages will always be INVALID_MFN and
> thus the original MFN does not need to saved on map and restored on
> unmap.
> 
> Grant map/unmap will no longer need to use or clobber `page->index`.
> This allows a workaround in netback to clear `page->pfmemalloc` to be
> removed (`index` and `pfmemalloc` are part of the same union).
> 
> 
> Safe grant unmap
> ----------------
> 
> Grant references will only be unmapped when they are no longer in use.
> i.e., the page reference count is one.
> 
>     int gnttab_unmap_refs_async(struct gnttab_unmap_grant_ref *unmap_ops,
>         struct gnttab_unmap_grant_ref *kunmap_ops,
>         struct page **pages, unsigned int count,
>         void (*done)(void *data), void *data);
> 
> The `gnttab_unmap_refs_async()` function will unmap the grant
> references using the supplied unmap operations and call `done(data)`.
> The grant unmap will only be done once all pages are no longer in use.
> 
> It shall run synchronously on the first attempt (this is expected to
> be the most common case).  If any page is in use, it shall queue the
> unmap request to be tried at a later time.
> 
> Only the blkback and gntdev devices need to use asynchronouse unmaps.
> 
> 
> Userspace address to page translation
> -------------------------------------
> 
> The m2p_override table shall be removed.
> 
> Each VMA (struct vm_struct) shall contain an additional pointer to an
> optional array of pages.  This array shall be sized to cover the full
> extent of the VMA.
> 
> The gntdev driver populates this array with the relevant pages for the
> foreign mappings as they are mapped.  It shall also clear them when
> unmapping.  The gntdev driver must ensure it properly splits the page
> array when the VMA itself is split.
> 
> Since the m2p lookup will not return a local PFN, the native
> get_user_pages_fast() call will fail.  Prior to attempting to fault in
> the pages, get_user_pages() can simply look up the pages in the VMA's
> page array.
> 
> `page->private` will no longer need to be set to the MFN.
> 
> This is similar to the approach used in the classic kernel.

It is worth pointing out that if/when non dma coherent devices are going
to start appearing in x86-land, this solution won't suffice.


> Identifying foreign pages
> -------------------------
> 
> A new page flag is introduced: PG_foreign.  This will alias PG_pinned
> so it does not require an additional bit.
> 
> If PG_foreign is set then `page->private` contains the grant reference
> and domid for this foreign page.  This information can only be packed
> into an unsigned long on 64-bit platforms.  32-bit platforms will have
> to allocate an additional structure to store the domid and gref.
> 
> The aliasing of PG_foreign and PG_pinned is safe because:
> 
> - Page table pages will never be foreign.
> - Foreign pages shall have `p2m[P] & FOREIGN_FRAME_BIT`.
> 
> The use of the private field is safe because:
> 
> - The page is allocated by the balloon driver and thus it owns the
>   private field.
> 
> - The other fields in the union (ptl, slab_cache, and first_page) will
>   not be used because the page is not used in a page table, slab or
>   compound page.
> 
> Netback can thus:
> 
> 1. Test PG_foreign.
> 2. Verify that the page is foreign via the p2m.
> 3. Extract the domid and gref from page->private.
> 
> The PG_foreign test is not strictly necessary as the p2m lookup is
> sufficient, but it should be quicker for non-foreign pages.
> 
> 
> Userspace grant performance
> ---------------------------
> 
> Since the m2p_override table will be removed, the gntdev device may
> easy batch the grant map and unmap hypercalls that update the kernel
> mappings.
> 
> The use of the scratch pages on unmap will be unnecessary and can be
> removed.
> 
> Other improvements that may be considered are:
> 
> - Batch the userspace and kernel map and unmap.
> 
> - Lazily map grants into userspace on faults.  For applications that
>   do not access the foreign frames by the userspace mappings (such as
>   block backends using direct I/O) this would avoid a set of maps and
>   unmaps. This lazy mode would have to be requested by the userspace
>   program (since faulting many pages would be much more expensive than
>   a single batched map).
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Linux grant map/unmap improvement proposal (Draft B)
  2014-10-13 16:43 ` Stefano Stabellini
@ 2014-10-13 17:22   ` David Vrabel
  0 siblings, 0 replies; 11+ messages in thread
From: David Vrabel @ 2014-10-13 17:22 UTC (permalink / raw)
  To: Stefano Stabellini, David Vrabel; +Cc: Xen-devel@lists.xen.org

On 13/10/14 17:43, Stefano Stabellini wrote:
> On Mon, 13 Oct 2014, David Vrabel wrote:
>> Grant mapping in the Linux kernel has a number of problems:
>>
[...]
> This is a very good summary of the issues we are currently having with
> Xen support in Linux. As such, I think I should add one that is missing
> from the list, but good to keep in mind. I should point out that I am
> not asking you to do anything about it at the moment.
> 
> 
> dma_ops.unmap_page and dma_ops.unmap_sg only pass dma addresses as arguments
> ----------------------------------------------------------------------------
> 
> The Linux dma_map_ops API consists of a number of functions that only
> provide the dma address of the dma request as argument, not the struct
> page or the physical address. For example unmap_page and unmap_sg.
> 
> For Xen PV guests the dma address is a machine address. If the machine
> address corresponds to a foreign page (granted to the current domain),
> there is no easy way for us to retrieve the corresponding struct page or
> guest physical address (other than the m2p_override with all its
> problems).
> 
> This is a serious limitation, in particular if we need to do any
> operations on the memory region at the time one of these functions are
> called:
> - on x86 fortunately we don't need to do anything;
> - on ARM, if the device is not dma coherent, we might have to issue cache
> maintenance operations.

I can add this section but it would be even better with a solution for
the "Design" section.

>> Userspace address to page translation
>> -------------------------------------
>>
>> The m2p_override table shall be removed.
>>
>> Each VMA (struct vm_struct) shall contain an additional pointer to an
>> optional array of pages.  This array shall be sized to cover the full
>> extent of the VMA.
>>
>> The gntdev driver populates this array with the relevant pages for the
>> foreign mappings as they are mapped.  It shall also clear them when
>> unmapping.  The gntdev driver must ensure it properly splits the page
>> array when the VMA itself is split.
>>
>> Since the m2p lookup will not return a local PFN, the native
>> get_user_pages_fast() call will fail.  Prior to attempting to fault in
>> the pages, get_user_pages() can simply look up the pages in the VMA's
>> page array.
>>
>> `page->private` will no longer need to be set to the MFN.
>>
>> This is similar to the approach used in the classic kernel.
> 
> It is worth pointing out that if/when non dma coherent devices are going
> to start appearing in x86-land, this solution won't suffice.

It's not trying to solve that problem.

If non-coherent devices ever become a problem on x86 I would probably
extend the DMA API with map/unmap functions that return/accept handles
and update the drivers for the non-coherent devices to use the new
functions.

David

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Linux grant map/unmap improvement proposal (Draft B)
  2014-10-13 13:41 Linux grant map/unmap improvement proposal (Draft B) David Vrabel
  2014-10-13 16:43 ` Stefano Stabellini
@ 2014-10-14 10:27 ` Ian Campbell
  2014-10-14 10:32   ` David Vrabel
  2014-10-15 17:45 ` Zoltan Kiss
  2014-12-18 17:55 ` David Vrabel
  3 siblings, 1 reply; 11+ messages in thread
From: Ian Campbell @ 2014-10-14 10:27 UTC (permalink / raw)
  To: David Vrabel; +Cc: Xen-devel@lists.xen.org

On Mon, 2014-10-13 at 14:41 +0100, David Vrabel wrote:
> Safe grant unmap
> ----------------
> 
> Grant references will only be unmapped when they are no longer in use.
> i.e., the page reference count is one.
> 
>     int gnttab_unmap_refs_async(struct gnttab_unmap_grant_ref *unmap_ops,
>         struct gnttab_unmap_grant_ref *kunmap_ops,
>         struct page **pages, unsigned int count,
>         void (*done)(void *data), void *data);
> 
> The `gnttab_unmap_refs_async()` function will unmap the grant
> references using the supplied unmap operations and call `done(data)`.
> The grant unmap will only be done once all pages are no longer in use.
> 
> It shall run synchronously on the first attempt (this is expected to
> be the most common case).  If any page is in use, it shall queue the
> unmap request to be tried at a later time.
> 
> Only the blkback and gntdev devices need to use asynchronouse unmaps.

What about storage over networking? Does this work for that case too? I
suppose that would just manifest as >1 reference counts when the blk op
finishes, which would be taken care of by the delay.

Ian.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Linux grant map/unmap improvement proposal (Draft B)
  2014-10-14 10:27 ` Ian Campbell
@ 2014-10-14 10:32   ` David Vrabel
  2014-10-14 10:35     ` Ian Campbell
  0 siblings, 1 reply; 11+ messages in thread
From: David Vrabel @ 2014-10-14 10:32 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Xen-devel@lists.xen.org

On 14/10/14 11:27, Ian Campbell wrote:
> On Mon, 2014-10-13 at 14:41 +0100, David Vrabel wrote:
>> Safe grant unmap
>> ----------------
>>
>> Grant references will only be unmapped when they are no longer in use.
>> i.e., the page reference count is one.
>>
>>     int gnttab_unmap_refs_async(struct gnttab_unmap_grant_ref *unmap_ops,
>>         struct gnttab_unmap_grant_ref *kunmap_ops,
>>         struct page **pages, unsigned int count,
>>         void (*done)(void *data), void *data);
>>
>> The `gnttab_unmap_refs_async()` function will unmap the grant
>> references using the supplied unmap operations and call `done(data)`.
>> The grant unmap will only be done once all pages are no longer in use.
>>
>> It shall run synchronously on the first attempt (this is expected to
>> be the most common case).  If any page is in use, it shall queue the
>> unmap request to be tried at a later time.
>>
>> Only the blkback and gntdev devices need to use asynchronouse unmaps.
> 
> What about storage over networking? Does this work for that case too? I
> suppose that would just manifest as >1 reference counts when the blk op
> finishes, which would be taken care of by the delay.

I'm not sure I follow what use case you're talking about here.  If the
guest is using NFS or iSCSI or similar, then netback just sees ethernet
packets and doesn't need to distinguish between different types of
network traffic from the guest.

David

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Linux grant map/unmap improvement proposal (Draft B)
  2014-10-14 10:32   ` David Vrabel
@ 2014-10-14 10:35     ` Ian Campbell
  2014-10-14 12:49       ` David Vrabel
  0 siblings, 1 reply; 11+ messages in thread
From: Ian Campbell @ 2014-10-14 10:35 UTC (permalink / raw)
  To: David Vrabel; +Cc: Xen-devel@lists.xen.org

On Tue, 2014-10-14 at 11:32 +0100, David Vrabel wrote:
> On 14/10/14 11:27, Ian Campbell wrote:
> > On Mon, 2014-10-13 at 14:41 +0100, David Vrabel wrote:
> >> Safe grant unmap
> >> ----------------
> >>
> >> Grant references will only be unmapped when they are no longer in use.
> >> i.e., the page reference count is one.
> >>
> >>     int gnttab_unmap_refs_async(struct gnttab_unmap_grant_ref *unmap_ops,
> >>         struct gnttab_unmap_grant_ref *kunmap_ops,
> >>         struct page **pages, unsigned int count,
> >>         void (*done)(void *data), void *data);
> >>
> >> The `gnttab_unmap_refs_async()` function will unmap the grant
> >> references using the supplied unmap operations and call `done(data)`.
> >> The grant unmap will only be done once all pages are no longer in use.
> >>
> >> It shall run synchronously on the first attempt (this is expected to
> >> be the most common case).  If any page is in use, it shall queue the
> >> unmap request to be tried at a later time.
> >>
> >> Only the blkback and gntdev devices need to use asynchronouse unmaps.
> > 
> > What about storage over networking? Does this work for that case too? I
> > suppose that would just manifest as >1 reference counts when the blk op
> > finishes, which would be taken care of by the delay.
> 
> I'm not sure I follow what use case you're talking about here.  If the
> guest is using NFS or iSCSI or similar, then netback just sees ethernet
> packets and doesn't need to distinguish between different types of
> network traffic from the guest.

I meant dom0 mounted NFS/ISCSI disks (either loopback or from driver
domains) going out over either physical or virtual network interfaces.

Ian.
> 
> David

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Linux grant map/unmap improvement proposal (Draft B)
  2014-10-14 10:35     ` Ian Campbell
@ 2014-10-14 12:49       ` David Vrabel
  2014-10-14 12:59         ` Ian Campbell
  0 siblings, 1 reply; 11+ messages in thread
From: David Vrabel @ 2014-10-14 12:49 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Xen-devel@lists.xen.org

On 14/10/14 11:35, Ian Campbell wrote:
> On Tue, 2014-10-14 at 11:32 +0100, David Vrabel wrote:
>> On 14/10/14 11:27, Ian Campbell wrote:
>>> On Mon, 2014-10-13 at 14:41 +0100, David Vrabel wrote:
>>>> Safe grant unmap
>>>> ----------------
>>>>
>>>> Grant references will only be unmapped when they are no longer in use.
>>>> i.e., the page reference count is one.
>>>>
>>>>     int gnttab_unmap_refs_async(struct gnttab_unmap_grant_ref *unmap_ops,
>>>>         struct gnttab_unmap_grant_ref *kunmap_ops,
>>>>         struct page **pages, unsigned int count,
>>>>         void (*done)(void *data), void *data);
>>>>
>>>> The `gnttab_unmap_refs_async()` function will unmap the grant
>>>> references using the supplied unmap operations and call `done(data)`.
>>>> The grant unmap will only be done once all pages are no longer in use.
>>>>
>>>> It shall run synchronously on the first attempt (this is expected to
>>>> be the most common case).  If any page is in use, it shall queue the
>>>> unmap request to be tried at a later time.
>>>>
>>>> Only the blkback and gntdev devices need to use asynchronouse unmaps.
>>>
>>> What about storage over networking? Does this work for that case too? I
>>> suppose that would just manifest as >1 reference counts when the blk op
>>> finishes, which would be taken care of by the delay.
>>
>> I'm not sure I follow what use case you're talking about here.  If the
>> guest is using NFS or iSCSI or similar, then netback just sees ethernet
>> packets and doesn't need to distinguish between different types of
>> network traffic from the guest.
> 
> I meant dom0 mounted NFS/ISCSI disks (either loopback or from driver
> domains) going out over either physical or virtual network interfaces.

I'm still confused.  Is this not the use case I describe in the "Blkback
and network storage" section?  Whether the retransmitted packet is sent
via a physical NIC or a virtual one doesn't matter.

David

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Linux grant map/unmap improvement proposal (Draft B)
  2014-10-14 12:49       ` David Vrabel
@ 2014-10-14 12:59         ` Ian Campbell
  0 siblings, 0 replies; 11+ messages in thread
From: Ian Campbell @ 2014-10-14 12:59 UTC (permalink / raw)
  To: David Vrabel; +Cc: Xen-devel@lists.xen.org

On Tue, 2014-10-14 at 13:49 +0100, David Vrabel wrote:
> On 14/10/14 11:35, Ian Campbell wrote:
> > On Tue, 2014-10-14 at 11:32 +0100, David Vrabel wrote:
> >> On 14/10/14 11:27, Ian Campbell wrote:
> >>> On Mon, 2014-10-13 at 14:41 +0100, David Vrabel wrote:
> >>>> Safe grant unmap
> >>>> ----------------
> >>>>
> >>>> Grant references will only be unmapped when they are no longer in use.
> >>>> i.e., the page reference count is one.
> >>>>
> >>>>     int gnttab_unmap_refs_async(struct gnttab_unmap_grant_ref *unmap_ops,
> >>>>         struct gnttab_unmap_grant_ref *kunmap_ops,
> >>>>         struct page **pages, unsigned int count,
> >>>>         void (*done)(void *data), void *data);
> >>>>
> >>>> The `gnttab_unmap_refs_async()` function will unmap the grant
> >>>> references using the supplied unmap operations and call `done(data)`.
> >>>> The grant unmap will only be done once all pages are no longer in use.
> >>>>
> >>>> It shall run synchronously on the first attempt (this is expected to
> >>>> be the most common case).  If any page is in use, it shall queue the
> >>>> unmap request to be tried at a later time.
> >>>>
> >>>> Only the blkback and gntdev devices need to use asynchronouse unmaps.
> >>>
> >>> What about storage over networking? Does this work for that case too? I
> >>> suppose that would just manifest as >1 reference counts when the blk op
> >>> finishes, which would be taken care of by the delay.
> >>
> >> I'm not sure I follow what use case you're talking about here.  If the
> >> guest is using NFS or iSCSI or similar, then netback just sees ethernet
> >> packets and doesn't need to distinguish between different types of
> >> network traffic from the guest.
> > 
> > I meant dom0 mounted NFS/ISCSI disks (either loopback or from driver
> > domains) going out over either physical or virtual network interfaces.
> 
> I'm still confused.  Is this not the use case I describe in the "Blkback
> and network storage" section?  Whether the retransmitted packet is sent
> via a physical NIC or a virtual one doesn't matter.

Ah yes, but that was in the "problems" not the "solutions" section.
wasn't it?

So my question is ultimately: is this safe unmap functionality intended
to address the problem introduced in "Blkback and network storage".
Sounds like the answer is yes.

> 
> David

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Linux grant map/unmap improvement proposal (Draft B)
  2014-10-13 13:41 Linux grant map/unmap improvement proposal (Draft B) David Vrabel
  2014-10-13 16:43 ` Stefano Stabellini
  2014-10-14 10:27 ` Ian Campbell
@ 2014-10-15 17:45 ` Zoltan Kiss
  2014-10-16 15:54   ` David Vrabel
  2014-12-18 17:55 ` David Vrabel
  3 siblings, 1 reply; 11+ messages in thread
From: Zoltan Kiss @ 2014-10-15 17:45 UTC (permalink / raw)
  To: David Vrabel, Xen-devel@lists.xen.org



On 13/10/2014 14:41, David Vrabel wrote:
[...]
> Packets with foreign pages from other sources cannot be successfully
> copied, since netback does not know the grant reference.  Once such
"... One such"
> configuration is a VM providing an iSCSI or other network-based
> storage that presents a block device in the backend that is then used
> by another VM on the same host.
If the packet coming from the storage target VM is delivered to L3 in 
Dom0's stack, the foreign pages will be swapped out with local copies. 
That's a feature of the zerocopy framework used by netback, mostly due 
to fears that strange things can happen in and above the IP layer.
So unless the storage backend in Dom0 implements an own TCP/IP stack and 
uses the vifX.Y device directly, it probably won't see foreign frames 
from the storage target. Of course it wouldn't be smart to rely on this 
on the long term, it would be good to remove that copy.
Or do you mean the other direction, when the guest using this storage 
writes to it, and that date is mapped by the block backend and used to 
construct an SKB? (by the time I finished the sentence I realized you 
meant this scenarie, but I leave the above comments just for the sake of 
clarification)

>
> Blkback and network storage
> ---------------------------
>
> Blkback unmaps the foreign pages in a I/O request when the request is
> completed.  If networked storage is used it is possible for requests
> to be completed while the skbs referring to those pages are still
> queued for transmit (e.g., because a retransmission was queued while
> the responds to the original packet was in flight).
>
> When the network driver attempts to send the packet with the unmapped
> page it may:
>
> - Fault while trying to access the unmapped page.
>
> - Transmit from a frame that is no longer granted (potentially
>    transmitting sensitive guest or Xen data).
>
> The fault does not occur with userspace storage backends since gntdev
> replaces the foreign mapping with one to a local scratch page.  It
> uses GNTOP_unmap_and_replace which atomically replaces the foreign
> mapping with another (source) mapping.  However, this cannot be used
> with batched operations since it clears the source mapping and it does
> not prevent against transmitting from a non-granted frame.
>
>
>
>
> Safe grant unmap
> ----------------
>
> Grant references will only be unmapped when they are no longer in use.
> i.e., the page reference count is one.
>
>      int gnttab_unmap_refs_async(struct gnttab_unmap_grant_ref *unmap_ops,
>          struct gnttab_unmap_grant_ref *kunmap_ops,
>          struct page **pages, unsigned int count,
>          void (*done)(void *data), void *data);
>
> The `gnttab_unmap_refs_async()` function will unmap the grant
> references using the supplied unmap operations and call `done(data)`.
> The grant unmap will only be done once all pages are no longer in use.
I'm a bit confused about this function. I guess it checks the refcount 
before unmap. But then what does the done(data) function does?
>
> It shall run synchronously on the first attempt (this is expected to
> be the most common case).  If any page is in use, it shall queue the
> unmap request to be tried at a later time.
Who will own this queue? The caller (e.g. blkback)? How often should it 
retry? That retry is triggered by a timer?
>
> Only the blkback and gntdev devices need to use asynchronouse unmaps.
>
[...]

>
> Identifying foreign pages
> -------------------------
>
> A new page flag is introduced: PG_foreign.  This will alias PG_pinned
> so it does not require an additional bit.
>
> If PG_foreign is set then `page->private` contains the grant reference
> and domid for this foreign page.  This information can only be packed
> into an unsigned long on 64-bit platforms.  32-bit platforms will have
> to allocate an additional structure to store the domid and gref.
>
> The aliasing of PG_foreign and PG_pinned is safe because:
>
> - Page table pages will never be foreign.
> - Foreign pages shall have `p2m[P] & FOREIGN_FRAME_BIT`.
>
> The use of the private field is safe because:
>
> - The page is allocated by the balloon driver and thus it owns the
>    private field.
>
> - The other fields in the union (ptl, slab_cache, and first_page) will
>    not be used because the page is not used in a page table, slab or
>    compound page.
>
This flag sounds similar to the flag used in classic for netback grant 
mapping. Would it be accepted in upstream? Aliasing PG_pinned would make 
sure of that?
> Netback can thus:
>
> 1. Test PG_foreign.
> 2. Verify that the page is foreign via the p2m.
> 3. Extract the domid and gref from page->private.
>
> The PG_foreign test is not strictly necessary as the p2m lookup is
> sufficient, but it should be quicker for non-foreign pages.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Linux grant map/unmap improvement proposal (Draft B)
  2014-10-15 17:45 ` Zoltan Kiss
@ 2014-10-16 15:54   ` David Vrabel
  0 siblings, 0 replies; 11+ messages in thread
From: David Vrabel @ 2014-10-16 15:54 UTC (permalink / raw)
  To: Zoltan Kiss, Xen-devel@lists.xen.org

On 15/10/14 18:45, Zoltan Kiss wrote:
>> Safe grant unmap
>> ----------------
>>
>> Grant references will only be unmapped when they are no longer in use.
>> i.e., the page reference count is one.
>>
>>      int gnttab_unmap_refs_async(struct gnttab_unmap_grant_ref
>> *unmap_ops,
>>          struct gnttab_unmap_grant_ref *kunmap_ops,
>>          struct page **pages, unsigned int count,
>>          void (*done)(void *data), void *data);
>>
>> The `gnttab_unmap_refs_async()` function will unmap the grant
>> references using the supplied unmap operations and call `done(data)`.
>> The grant unmap will only be done once all pages are no longer in use.
> I'm a bit confused about this function. I guess it checks the refcount
> before unmap. But then what does the done(data) function does?

It's needed to callback into the calling driver once the unmap is
eventually complete, so it can complete the request etc.

>> It shall run synchronously on the first attempt (this is expected to
>> be the most common case).  If any page is in use, it shall queue the
>> unmap request to be tried at a later time.
>
> Who will own this queue? The caller (e.g. blkback)? How often should it
> retry? That retry is triggered by a timer?

It will be common and managed the core grant table subsystem/module.

The retry will be a simple timer.  This is expected to be such a rare
event that anything more sophisticated isn't necessary.

>> Identifying foreign pages
>> -------------------------
>>
>> A new page flag is introduced: PG_foreign.  This will alias PG_pinned
>> so it does not require an additional bit.
>>
>> If PG_foreign is set then `page->private` contains the grant reference
>> and domid for this foreign page.  This information can only be packed
>> into an unsigned long on 64-bit platforms.  32-bit platforms will have
>> to allocate an additional structure to store the domid and gref.
>>
>> The aliasing of PG_foreign and PG_pinned is safe because:
>>
>> - Page table pages will never be foreign.
>> - Foreign pages shall have `p2m[P] & FOREIGN_FRAME_BIT`.
>>
>> The use of the private field is safe because:
>>
>> - The page is allocated by the balloon driver and thus it owns the
>>    private field.
>>
>> - The other fields in the union (ptl, slab_cache, and first_page) will
>>    not be used because the page is not used in a page table, slab or
>>    compound page.
>>
> This flag sounds similar to the flag used in classic for netback grant
> mapping. Would it be accepted in upstream? Aliasing PG_pinned would make
> sure of that?

This should be the case.

David

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Linux grant map/unmap improvement proposal (Draft B)
  2014-10-13 13:41 Linux grant map/unmap improvement proposal (Draft B) David Vrabel
                   ` (2 preceding siblings ...)
  2014-10-15 17:45 ` Zoltan Kiss
@ 2014-12-18 17:55 ` David Vrabel
  3 siblings, 0 replies; 11+ messages in thread
From: David Vrabel @ 2014-12-18 17:55 UTC (permalink / raw)
  To: David Vrabel, Xen-devel@lists.xen.org; +Cc: Jennifer Herbert

On 13/10/14 14:41, David Vrabel wrote:
> 
> Design
> ======

Jennifer has put together most of the initial implementation of this so
expect a full series some time next year.

It didn't quite end up as described here.

> Userspace address to page translation
> -------------------------------------
> 
> The m2p_override table shall be removed.
> 
> Each VMA (struct vm_struct) shall contain an additional pointer to an
> optional array of pages.  This array shall be sized to cover the full
> extent of the VMA.
> 
> The gntdev driver populates this array with the relevant pages for the
> foreign mappings as they are mapped.  It shall also clear them when
> unmapping.  The gntdev driver must ensure it properly splits the page
> array when the VMA itself is split.
> 
> Since the m2p lookup will not return a local PFN, the native
> get_user_pages_fast() call will fail.  Prior to attempting to fault in
> the pages, get_user_pages() can simply look up the pages in the VMA's
> page array.

This was not true.  Instead, we mark the userspace PTEs as special
(_PAGE_SPECIAL set) which causes the generic x86 code to skip the fast path.

We also changed vm_normal_page() to look in vma->pages which puts the
extra code outside of any common use case (i.e., away from any handling
of non-special mappings), further reducing the impact on existing use cases.

For the curious, the 3-liner mm/memory.c change is below (although this
does not handle VMA splitting yet, but that should be straight-forwards).

> Userspace grant performance
> ---------------------------
> 
> - Lazily map grants into userspace on faults.  For applications that
>   do not access the foreign frames by the userspace mappings (such as
>   block backends using direct I/O) this would avoid a set of maps and
>   unmaps. This lazy mode would have to be requested by the userspace
>   program (since faulting many pages would be much more expensive than
>   a single batched map).

This does not look possible without more invasive changes to core MM
code.  Although we can lazily fault in the mappings we still need PTEs
to allow get_user_pages() to work, so map-on-fault isn't useful.

David

--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -289,6 +289,7 @@ struct vm_area_struct {
 #ifdef CONFIG_NUMA
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
 #endif
+	struct page	**pages;
 };

 struct core_thread {
diff --git a/mm/memory.c b/mm/memory.c
index 4b60011..3ca13bb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -774,6 +774,8 @@ struct page *vm_normal_page(struct vm_area_struct
*vma, unsigned long addr,
 	if (HAVE_PTE_SPECIAL) {
 		if (likely(!pte_special(pte)))
 			goto check_pfn;
+		if (vma->pages)
+			return vma->pages[(addr - vma->vm_start) >> PAGE_SHIFT];
 		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
 			return NULL;
 		if (!is_zero_pfn(pfn))
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-12-18 17:55 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-10-13 13:41 Linux grant map/unmap improvement proposal (Draft B) David Vrabel
2014-10-13 16:43 ` Stefano Stabellini
2014-10-13 17:22   ` David Vrabel
2014-10-14 10:27 ` Ian Campbell
2014-10-14 10:32   ` David Vrabel
2014-10-14 10:35     ` Ian Campbell
2014-10-14 12:49       ` David Vrabel
2014-10-14 12:59         ` Ian Campbell
2014-10-15 17:45 ` Zoltan Kiss
2014-10-16 15:54   ` David Vrabel
2014-12-18 17:55 ` David Vrabel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.