[PATCH] xen-mapcache: Fix the bug when overlapping emulated DMA operations may cause inconsistency in guest memory mappings

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] xen-mapcache: Fix the bug when overlapping emulated DMA operations may cause inconsistency in guest memory mappings
@ 2017-07-10 20:00 Alexey G
  2017-07-18 22:17 ` Stefano Stabellini
  0 siblings, 1 reply; 7+ messages in thread
From: Alexey G @ 2017-07-10 20:00 UTC (permalink / raw)
  To: xen-devel, Anthony PERARD, Stefano Stabellini

Under certain circumstances normal xen-mapcache functioning may be broken
by guest's actions. This may lead to either QEMU performing exit() due to
a caught bad pointer (and with QEMU process gone the guest domain simply
appears hung afterwards) or actual use of the incorrect pointer inside
QEMU address space -- a write to unmapped memory is possible. The bug is
hard to reproduce on a i440 machine as multiple DMA sources are required
(though it's possible in theory, using multiple emulated devices), but can
be reproduced somewhat easily on a Q35 machine using an emulated AHCI
controller -- each NCQ queue command slot may be used as an independent
DMA source ex. using READ FPDMA QUEUED command, so a single storage
device on the AHCI controller port will be enough to produce multiple DMAs
(up to 32). The detailed description of the issue follows.

Xen-mapcache provides an ability to map parts of a guest memory into
QEMU's own address space to work with.

There are two types of cache lookups:
 - translating a guest physical address into a pointer in QEMU's address
   space, mapping a part of guest domain memory if necessary (while trying
   to reduce a number of such (re)mappings to a minimum)
 - translating a QEMU's pointer back to its physical address in guest RAM

These lookups are managed via two linked-lists of structures.
MapCacheEntry is used for forward cache lookups, while MapCacheRev -- for
reverse lookups.

Every guest physical address is broken down into 2 parts:
    address_index  = phys_addr >> MCACHE_BUCKET_SHIFT;
    address_offset = phys_addr & (MCACHE_BUCKET_SIZE - 1);

MCACHE_BUCKET_SHIFT depends on a system (32/64) and is equal to 20 for
a 64-bit system (which assumed for the further description). Basically,
this means that we deal with 1 MB chunks and offsets within those 1 MB
chunks. All mappings are created with 1MB-granularity, i.e. 1MB/2MB/3MB
etc. Most DMA transfers typically are less than 1MB, however, if the
transfer crosses any 1MB border(s) - than a nearest larger mapping size
will be used, so ex. a 512-byte DMA transfer with the start address
700FFF80h will actually require a 2MB range.

Current implementation assumes that MapCacheEntries are unique for a given
address_index and size pair and that a single MapCacheEntry may be reused
by multiple requests -- in this case the 'lock' field will be larger than
1. On other hand, each requested guest physical address (with 'lock' flag)
is described by each own MapCacheRev. So there may be multiple MapCacheRev
entries corresponding to a single MapCacheEntry. The xen-mapcache code
uses MapCacheRev entries to retrieve the address_index & size pair which
in turn used to find a related MapCacheEntry. The 'lock' field within
a MapCacheEntry structure is actually a reference counter which shows
a number of corresponding MapCacheRev entries.

The bug lies in ability for the guest to indirectly manipulate with the
xen-mapcache MapCacheEntries list via a special sequence of DMA
operations, typically for storage devices. In order to trigger the bug,
guest needs to issue DMA operations in specific order and timing.
Although xen-mapcache is protected by the mutex lock -- this doesn't help
in this case, as the bug is not due to a race condition.

Suppose we have 3 DMA transfers, namely A, B and C, where
- transfer A crosses 1MB border and thus uses a 2MB mapping
- transfers B and C are normal transfers within 1MB range
- and all 3 transfers belong to the same address_index

In this case, if all these transfers are to be executed one-by-one
(without overlaps), no special treatment necessary -- each transfer's
mapping lock will be set and then cleared on unmap before starting
the next transfer.
The situation changes when DMA transfers overlap in time, ex. like this:

  |===== transfer A (2MB) =====|

              |===== transfer B (1MB) =====|

                          |===== transfer C (1MB) =====|
 time --->

In this situation the following sequence of actions happens:

1. transfer A creates a mapping to 2MB area (lock=1)
2. transfer B (1MB) tries to find available mapping but cannot find one
   because transfer A is still in progress, and it has 2MB size + non-zero
   lock. So transfer B creates another mapping -- same address_index,
   but 1MB size.
3. transfer A completes, making 1st mapping entry available by setting its
   lock to 0
4. transfer C starts and tries to find available mapping entry and sees
   that 1st entry has lock=0, so it uses this entry but remaps the mapping
   to a 1MB size
5. transfer B completes and by this time
  - there are two locked entries in the MapCacheEntry list with the SAME
    values for both address_index and size
  - the entry for transfer B actually resides farther in list while
    transfer C's entry is first
6. xen_ram_addr_from_mapcache() for transfer B gets correct address_index
   and size pair from corresponding MapCacheRev entry, but then it starts
   looking for MapCacheEntry with these values and finds the first entry
   -- which belongs to transfer C.

At this point there may be following possible (bad) consequences:

1. xen_ram_addr_from_mapcache() will use a wrong entry->vaddr_base value
   in this statement:

   raddr = (reventry->paddr_index << MCACHE_BUCKET_SHIFT) +
       ((unsigned long) ptr - (unsigned long) entry->vaddr_base);

resulting in an incorrent raddr value returned from the function. The
(ptr - entry->vaddr_base) expression may produce both positive and negative
numbers and its actual value may differ greatly as there are many
map/unmap operations take place. If the value will be beyond guest RAM
limits then a "Bad RAM offset" error will be triggered and logged,
followed by exit() in QEMU.

2. If raddr value won't exceed guest RAM boundaries, the same sequence
of actions will be performed for xen_invalidate_map_cache_entry() on DMA
unmap, resulting in a wrong MapCacheEntry being unmapped while DMA
operation which uses it is still active. The above example must
be extended by one more DMA transfer in order to allow unmapping as the
first mapping in the list is sort of resident.

The patch modifies the behavior in which MapCacheEntry's are added to the
list, avoiding duplicates.

Signed-off-by: Alexey Gerasimenko <x1917x@gmail.com>
---
 hw/i386/xen/xen-mapcache.c | 33 ++++++++++++++++++++++++++++-----
 1 file changed, 28 insertions(+), 5 deletions(-)

diff --git a/hw/i386/xen/xen-mapcache.c b/hw/i386/xen/xen-mapcache.c
index e60156c..84f25ef 100644
--- a/hw/i386/xen/xen-mapcache.c
+++ b/hw/i386/xen/xen-mapcache.c
@@ -206,6 +206,7 @@ static uint8_t *xen_map_cache_unlocked(hwaddr phys_addr, hwaddr size,
                                        uint8_t lock, bool dma)
 {
     MapCacheEntry *entry, *pentry = NULL;
+    MapCacheEntry *avl_entry = NULL, *avl_entry_prev = NULL;
     hwaddr address_index;
     hwaddr address_offset;
     hwaddr cache_size = size;
@@ -251,14 +252,36 @@ tryagain:

     entry = &mapcache->entry[address_index % mapcache->nr_buckets];

-    while (entry && entry->lock && entry->vaddr_base &&
-            (entry->paddr_index != address_index || entry->size != cache_size ||
-             !test_bits(address_offset >> XC_PAGE_SHIFT,
-                 test_bit_size >> XC_PAGE_SHIFT,
-                 entry->valid_mapping))) {
+    /* find a remappable entry. An existing locked entry which can be reused
+     * has a priority over all other entries (with lock=0, etc).
+     * Normally there will be just few entries for a given address_index
+     * bucket, typically 1-2 entries only
+     */
+    while (entry) {
+        if (entry->lock &&
+            entry->paddr_index == address_index &&
+            entry->size == cache_size &&
+            test_bits(address_offset >> XC_PAGE_SHIFT,
+                 test_bit_size >> XC_PAGE_SHIFT, entry->valid_mapping)) {
+            break;
+        }
+        else if (!entry->lock || !entry->vaddr_base) {
+            avl_entry = entry;
+            avl_entry_prev = pentry;
+        }
+
         pentry = entry;
         entry = entry->next;
     }
+
+    /* if the reuseable entry was not found, use any available.
+     * Otherwise, a new entry will be created
+     */
+    if (avl_entry && !entry) {
+        pentry = avl_entry_prev;
+        entry = avl_entry;
+    }
+
     if (!entry) {
         entry = g_malloc0(sizeof (MapCacheEntry));
         pentry->next = entry;
-- 
2.1.4

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] xen-mapcache: Fix the bug when overlapping emulated DMA operations may cause inconsistency in guest memory mappings
  2017-07-10 20:00 [PATCH] xen-mapcache: Fix the bug when overlapping emulated DMA operations may cause inconsistency in guest memory mappings Alexey G
@ 2017-07-18 22:17 ` Stefano Stabellini
  2017-07-19  9:06   ` Alexey G
  0 siblings, 1 reply; 7+ messages in thread
From: Stefano Stabellini @ 2017-07-18 22:17 UTC (permalink / raw)
  To: Alexey G; +Cc: Anthony PERARD, james.mckenzie, Stefano Stabellini, xen-devel

On Tue, 11 Jul 2017, Alexey G wrote:
> Under certain circumstances normal xen-mapcache functioning may be broken
> by guest's actions. This may lead to either QEMU performing exit() due to
> a caught bad pointer (and with QEMU process gone the guest domain simply
> appears hung afterwards) or actual use of the incorrect pointer inside
> QEMU address space -- a write to unmapped memory is possible. The bug is
> hard to reproduce on a i440 machine as multiple DMA sources are required
> (though it's possible in theory, using multiple emulated devices), but can
> be reproduced somewhat easily on a Q35 machine using an emulated AHCI
> controller -- each NCQ queue command slot may be used as an independent
> DMA source ex. using READ FPDMA QUEUED command, so a single storage
> device on the AHCI controller port will be enough to produce multiple DMAs
> (up to 32). The detailed description of the issue follows.
> 
> Xen-mapcache provides an ability to map parts of a guest memory into
> QEMU's own address space to work with.
> 
> There are two types of cache lookups:
>  - translating a guest physical address into a pointer in QEMU's address
>    space, mapping a part of guest domain memory if necessary (while trying
>    to reduce a number of such (re)mappings to a minimum)
>  - translating a QEMU's pointer back to its physical address in guest RAM
> 
> These lookups are managed via two linked-lists of structures.
> MapCacheEntry is used for forward cache lookups, while MapCacheRev -- for
> reverse lookups.
> 
> Every guest physical address is broken down into 2 parts:
>     address_index  = phys_addr >> MCACHE_BUCKET_SHIFT;
>     address_offset = phys_addr & (MCACHE_BUCKET_SIZE - 1);
> 
> MCACHE_BUCKET_SHIFT depends on a system (32/64) and is equal to 20 for
> a 64-bit system (which assumed for the further description). Basically,
> this means that we deal with 1 MB chunks and offsets within those 1 MB
> chunks. All mappings are created with 1MB-granularity, i.e. 1MB/2MB/3MB
> etc. Most DMA transfers typically are less than 1MB, however, if the
> transfer crosses any 1MB border(s) - than a nearest larger mapping size
> will be used, so ex. a 512-byte DMA transfer with the start address
> 700FFF80h will actually require a 2MB range.
> 
> Current implementation assumes that MapCacheEntries are unique for a given
> address_index and size pair and that a single MapCacheEntry may be reused
> by multiple requests -- in this case the 'lock' field will be larger than
> 1. On other hand, each requested guest physical address (with 'lock' flag)
> is described by each own MapCacheRev. So there may be multiple MapCacheRev
> entries corresponding to a single MapCacheEntry. The xen-mapcache code
> uses MapCacheRev entries to retrieve the address_index & size pair which
> in turn used to find a related MapCacheEntry. The 'lock' field within
> a MapCacheEntry structure is actually a reference counter which shows
> a number of corresponding MapCacheRev entries.
> 
> The bug lies in ability for the guest to indirectly manipulate with the
> xen-mapcache MapCacheEntries list via a special sequence of DMA
> operations, typically for storage devices. In order to trigger the bug,
> guest needs to issue DMA operations in specific order and timing.
> Although xen-mapcache is protected by the mutex lock -- this doesn't help
> in this case, as the bug is not due to a race condition.
> 
> Suppose we have 3 DMA transfers, namely A, B and C, where
> - transfer A crosses 1MB border and thus uses a 2MB mapping
> - transfers B and C are normal transfers within 1MB range
> - and all 3 transfers belong to the same address_index
> 
> In this case, if all these transfers are to be executed one-by-one
> (without overlaps), no special treatment necessary -- each transfer's
> mapping lock will be set and then cleared on unmap before starting
> the next transfer.
> The situation changes when DMA transfers overlap in time, ex. like this:
> 
>   |===== transfer A (2MB) =====|
> 
>               |===== transfer B (1MB) =====|
> 
>                           |===== transfer C (1MB) =====|
>  time --->
> 
> In this situation the following sequence of actions happens:
> 
> 1. transfer A creates a mapping to 2MB area (lock=1)
> 2. transfer B (1MB) tries to find available mapping but cannot find one
>    because transfer A is still in progress, and it has 2MB size + non-zero
>    lock. So transfer B creates another mapping -- same address_index,
>    but 1MB size.
> 3. transfer A completes, making 1st mapping entry available by setting its
>    lock to 0
> 4. transfer C starts and tries to find available mapping entry and sees
>    that 1st entry has lock=0, so it uses this entry but remaps the mapping
>    to a 1MB size
> 5. transfer B completes and by this time
>   - there are two locked entries in the MapCacheEntry list with the SAME
>     values for both address_index and size
>   - the entry for transfer B actually resides farther in list while
>     transfer C's entry is first
> 6. xen_ram_addr_from_mapcache() for transfer B gets correct address_index
>    and size pair from corresponding MapCacheRev entry, but then it starts
>    looking for MapCacheEntry with these values and finds the first entry
>    -- which belongs to transfer C.
> 
> At this point there may be following possible (bad) consequences:
> 
> 1. xen_ram_addr_from_mapcache() will use a wrong entry->vaddr_base value
>    in this statement:
> 
>    raddr = (reventry->paddr_index << MCACHE_BUCKET_SHIFT) +
>        ((unsigned long) ptr - (unsigned long) entry->vaddr_base);
> 
> resulting in an incorrent raddr value returned from the function. The
> (ptr - entry->vaddr_base) expression may produce both positive and negative
> numbers and its actual value may differ greatly as there are many
> map/unmap operations take place. If the value will be beyond guest RAM
> limits then a "Bad RAM offset" error will be triggered and logged,
> followed by exit() in QEMU.
> 
> 2. If raddr value won't exceed guest RAM boundaries, the same sequence
> of actions will be performed for xen_invalidate_map_cache_entry() on DMA
> unmap, resulting in a wrong MapCacheEntry being unmapped while DMA
> operation which uses it is still active. The above example must
> be extended by one more DMA transfer in order to allow unmapping as the
> first mapping in the list is sort of resident.

Thanks for the well written description of the problem!


> The patch modifies the behavior in which MapCacheEntry's are added to the
> list, avoiding duplicates.

I take that the idea is to always go through the whole list to check for
duplicate locked entries, right?


> Signed-off-by: Alexey Gerasimenko <x1917x@gmail.com>
> ---
>  hw/i386/xen/xen-mapcache.c | 33 ++++++++++++++++++++++++++++-----
>  1 file changed, 28 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/i386/xen/xen-mapcache.c b/hw/i386/xen/xen-mapcache.c
> index e60156c..84f25ef 100644
> --- a/hw/i386/xen/xen-mapcache.c
> +++ b/hw/i386/xen/xen-mapcache.c
> @@ -206,6 +206,7 @@ static uint8_t *xen_map_cache_unlocked(hwaddr phys_addr, hwaddr size,
>                                         uint8_t lock, bool dma)
>  {
>      MapCacheEntry *entry, *pentry = NULL;
> +    MapCacheEntry *avl_entry = NULL, *avl_entry_prev = NULL;
>      hwaddr address_index;
>      hwaddr address_offset;
>      hwaddr cache_size = size;
> @@ -251,14 +252,36 @@ tryagain:
>  
>      entry = &mapcache->entry[address_index % mapcache->nr_buckets];
>  
> -    while (entry && entry->lock && entry->vaddr_base &&
> -            (entry->paddr_index != address_index || entry->size != cache_size ||
> -             !test_bits(address_offset >> XC_PAGE_SHIFT,
> -                 test_bit_size >> XC_PAGE_SHIFT,
> -                 entry->valid_mapping))) {
> +    /* find a remappable entry. An existing locked entry which can be reused
> +     * has a priority over all other entries (with lock=0, etc).
> +     * Normally there will be just few entries for a given address_index
> +     * bucket, typically 1-2 entries only
> +     */
> +    while (entry) {
> +        if (entry->lock &&
> +            entry->paddr_index == address_index &&
> +            entry->size == cache_size &&
> +            test_bits(address_offset >> XC_PAGE_SHIFT,
> +                 test_bit_size >> XC_PAGE_SHIFT, entry->valid_mapping)) {
> +            break;
> +        }
> +        else if (!entry->lock || !entry->vaddr_base) {
> +            avl_entry = entry;
> +            avl_entry_prev = pentry;
> +        }
> +
>          pentry = entry;
>          entry = entry->next;
>      }
> +
> +    /* if the reuseable entry was not found, use any available.
> +     * Otherwise, a new entry will be created
> +     */
> +    if (avl_entry && !entry) {
> +        pentry = avl_entry_prev;
> +        entry = avl_entry;
> +    }

Yes, I think this would work, but we should make sure to scan the whole
list only when lock ==  true. Something like the following:

diff --git a/hw/i386/xen/xen-mapcache.c b/hw/i386/xen/xen-mapcache.c
index 2a1fbd1..f964143 100644
--- a/hw/i386/xen/xen-mapcache.c
+++ b/hw/i386/xen/xen-mapcache.c
@@ -234,7 +234,8 @@ static void xen_remap_bucket(MapCacheEntry *entry,
 static uint8_t *xen_map_cache_unlocked(hwaddr phys_addr, hwaddr size,
                                        uint8_t lock, bool dma)
 {
-    MapCacheEntry *entry, *pentry = NULL;
+    MapCacheEntry *entry, *pentry = NULL,
+                  *free_entry = NULL, *free_pentry = NULL;
     hwaddr address_index;
     hwaddr address_offset;
     hwaddr cache_size = size;
@@ -281,14 +282,22 @@ tryagain:
 
     entry = &mapcache->entry[address_index % mapcache->nr_buckets];
 
-    while (entry && entry->lock && entry->vaddr_base &&
+    while (entry && (lock || entry->lock) && entry->vaddr_base &&
             (entry->paddr_index != address_index || entry->size != cache_size ||
              !test_bits(address_offset >> XC_PAGE_SHIFT,
                  test_bit_size >> XC_PAGE_SHIFT,
                  entry->valid_mapping))) {
+        if (!free_entry && !entry->lock) {
+            free_entry = entry;
+            free_pentry = pentry;
+        }
         pentry = entry;
         entry = entry->next;
     }
+    if (!entry && free_entry) {
+        entry = free_entry;
+        pentry = free_pentry;
+    }
     if (!entry) {
         entry = g_malloc0(sizeof (MapCacheEntry));
         pentry->next = entry;


Would this work?

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] xen-mapcache: Fix the bug when overlapping emulated DMA operations may cause inconsistency in guest memory mappings
  2017-07-18 22:17 ` Stefano Stabellini
@ 2017-07-19  9:06   ` Alexey G
  2017-07-19 18:00     ` Stefano Stabellini
  0 siblings, 1 reply; 7+ messages in thread
From: Alexey G @ 2017-07-19  9:06 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: Anthony PERARD, james.mckenzie, xen-devel

Stefano,

On Tue, 18 Jul 2017 15:17:25 -0700 (PDT)
Stefano Stabellini <sstabellini@kernel.org> wrote:

> > The patch modifies the behavior in which MapCacheEntry's are added to
> > the list, avoiding duplicates.  
> 
> I take that the idea is to always go through the whole list to check for
> duplicate locked entries, right?

That's a short list.

In fact, it's not easy to generate multiple linked entries in this list --
normally, entries will be added, used and then removed immediately by
xen_invalidate_map_cache(). Specific conditions are required to make the
list grow -- like simultaneous DMA operations (of different cache_size)
originating the same address_index or presence of the Option ROM mapping in
the area. 

So normally we deal with just 1-2 entries in the list. Even three entries
are likely to be generated only intentionally and with a bit of luck as it
depends on host's performance/workload a lot. Also, a good cache_size
diversity is required to produce entries in the list but we actually
limited to only few multiplies of MCACHE_BUCKET_SIZE due to the maximum DMA
size limitations of emulated devices.

> Yes, I think this would work, but we should make sure to scan the whole
> list only when lock ==  true. Something like the following:
> 
> -    while (entry && entry->lock && entry->vaddr_base &&
> +    while (entry && (lock || entry->lock) && entry->vaddr_base &&
>              (entry->paddr_index != address_index || entry->size !=
> cache_size || !test_bits(address_offset >> XC_PAGE_SHIFT,
>                   test_bit_size >> XC_PAGE_SHIFT,
>                   entry->valid_mapping))) {
> +        if (!free_entry && !entry->lock) {
> +            free_entry = entry;
> +            free_pentry = pentry;
> +        }
>          pentry = entry;
>          entry = entry->next;
>      }
> 
> Would this work?

This would, but the question is if there will be a benefit. In this way we
avoiding to traverse the rest of the list (few entries, if any) if we asked
for some lock=0 mapping and found such entry before the reuseable lock=n
entry. We win few iterations of quick checks, but on other hand risking to
have to execute xen_remap_bucket() for this entry (with lot of fairly slow
stuff). If there was a reusable entry later in the list -- using it instead
of (possibly) remapping an entry will be faster... so it's pros and cons
here.

We can use locked entry for "non-locked" request as it is protected by the
same (kinda suspicious) rcu_read_lock/rcu_read_unlock mechanism above. The
big question here is whether rcu_read_(un)lock is enough at all
for underneath xen-mapcache usage -- seems like the xen-mapcache-related
code in QEMU expects RCU read lock to work like a plain critical section...
although this needs to be checked.

One possible minor optimization for xen-mapcache would be to reuse larger
mappings for mappings of lesser cache_size. Right now existing code does
checks in the "entry->size == cache_size" manner, while we can use
"entry->size >= cache_size" here. However, we may end up with resident
MapCacheEntries being mapped to a bigger mapping sizes than necessary and
thus might need to add remapping back to the normal size in
xen_invalidate_map_cache_entry_unlocked() when there are no other mappings.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] xen-mapcache: Fix the bug when overlapping emulated DMA operations may cause inconsistency in guest memory mappings
  2017-07-19  9:06   ` Alexey G
@ 2017-07-19 18:00     ` Stefano Stabellini
  2017-07-20  5:53       ` Alexey G
  0 siblings, 1 reply; 7+ messages in thread
From: Stefano Stabellini @ 2017-07-19 18:00 UTC (permalink / raw)
  To: Alexey G; +Cc: Anthony PERARD, james.mckenzie, Stefano Stabellini, xen-devel

On Wed, 19 Jul 2017, Alexey G wrote:
> Stefano,
> 
> On Tue, 18 Jul 2017 15:17:25 -0700 (PDT)
> Stefano Stabellini <sstabellini@kernel.org> wrote:
> 
> > > The patch modifies the behavior in which MapCacheEntry's are added to
> > > the list, avoiding duplicates.  
> > 
> > I take that the idea is to always go through the whole list to check for
> > duplicate locked entries, right?
>  
> That's a short list.
> 
> In fact, it's not easy to generate multiple linked entries in this list --
> normally, entries will be added, used and then removed immediately by
> xen_invalidate_map_cache(). Specific conditions are required to make the
> list grow -- like simultaneous DMA operations (of different cache_size)
> originating the same address_index or presence of the Option ROM mapping in
> the area. 
> 
> So normally we deal with just 1-2 entries in the list. Even three entries
> are likely to be generated only intentionally and with a bit of luck as it
> depends on host's performance/workload a lot. Also, a good cache_size
> diversity is required to produce entries in the list but we actually
> limited to only few multiplies of MCACHE_BUCKET_SIZE due to the maximum DMA
> size limitations of emulated devices.
> 
> > Yes, I think this would work, but we should make sure to scan the whole
> > list only when lock ==  true. Something like the following:
> > 
> > -    while (entry && entry->lock && entry->vaddr_base &&
> > +    while (entry && (lock || entry->lock) && entry->vaddr_base &&
> >              (entry->paddr_index != address_index || entry->size !=
> > cache_size || !test_bits(address_offset >> XC_PAGE_SHIFT,
> >                   test_bit_size >> XC_PAGE_SHIFT,
> >                   entry->valid_mapping))) {
> > +        if (!free_entry && !entry->lock) {
> > +            free_entry = entry;
> > +            free_pentry = pentry;
> > +        }
> >          pentry = entry;
> >          entry = entry->next;
> >      }
> > 
> > Would this work?
> 
> This would, but the question is if there will be a benefit. In this way we
> avoiding to traverse the rest of the list (few entries, if any) if we asked
> for some lock=0 mapping and found such entry before the reuseable lock=n
> entry. We win few iterations of quick checks, but on other hand risking to
> have to execute xen_remap_bucket() for this entry (with lot of fairly slow
> stuff). If there was a reusable entry later in the list -- using it instead
> of (possibly) remapping an entry will be faster... so it's pros and cons
> here.

My expectation is that unlocked mappings are much more frequent than
locked mappings. Also, I expect that only very rarely we'll be able to
reuse locked mappings. Over the course of a VM lifetime, it seems to me
that walking the list every time would cost more than it would benefit.

These are only "expectations", I would love to see numbers. Numbers make
for better decisions :-)  Would you be up for gathering some of these
numbers? Such as how many times you get to reuse locked mappings and how
many times we walk items on the list fruitlessly?

Otherwise, would you be up for just testing the modified version of the
patch I sent to verify that solves the bug?



> We can use locked entry for "non-locked" request as it is protected by the
> same (kinda suspicious) rcu_read_lock/rcu_read_unlock mechanism above. The
> big question here is whether rcu_read_(un)lock is enough at all
> for underneath xen-mapcache usage -- seems like the xen-mapcache-related
> code in QEMU expects RCU read lock to work like a plain critical section...
> although this needs to be checked.
> 
> One possible minor optimization for xen-mapcache would be to reuse larger
> mappings for mappings of lesser cache_size. Right now existing code does
> checks in the "entry->size == cache_size" manner, while we can use
> "entry->size >= cache_size" here. However, we may end up with resident
> MapCacheEntries being mapped to a bigger mapping sizes than necessary and
> thus might need to add remapping back to the normal size in
> xen_invalidate_map_cache_entry_unlocked() when there are no other mappings.

Yes, I thought about it, that would be a good improvement to have.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] xen-mapcache: Fix the bug when overlapping emulated DMA operations may cause inconsistency in guest memory mappings
  2017-07-19 18:00     ` Stefano Stabellini
@ 2017-07-20  5:53       ` Alexey G
  2017-07-20 23:57         ` Stefano Stabellini
  2017-07-22  0:39         ` Stefano Stabellini
  0 siblings, 2 replies; 7+ messages in thread
From: Alexey G @ 2017-07-20  5:53 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: Anthony PERARD, james.mckenzie, xen-devel

On Wed, 19 Jul 2017 11:00:26 -0700 (PDT)
Stefano Stabellini <sstabellini@kernel.org> wrote:

> My expectation is that unlocked mappings are much more frequent than
> locked mappings. Also, I expect that only very rarely we'll be able to
> reuse locked mappings. Over the course of a VM lifetime, it seems to me
> that walking the list every time would cost more than it would benefit.
> 
> These are only "expectations", I would love to see numbers. Numbers make
> for better decisions :-)  Would you be up for gathering some of these
> numbers? Such as how many times you get to reuse locked mappings and how
> many times we walk items on the list fruitlessly?
> 
> Otherwise, would you be up for just testing the modified version of the
> patch I sent to verify that solves the bug?

Numbers will show that there is a one single entry in the bucket's list
most of the time. :) Even two entries are rare encounters, typically to be
seen only when guest performs some intensive I/O. OK, I'll collect some real
stats for different scenarios, these are interesting numbers, might come
useful for later optimizations.

The approach your proposed is good, but it allows reusing of suitable
locked entries only when they come first in list (an existing behavior).
But we can actually reuse a locked entry which may come next (if any) in
the list as well. When we have the situation when lock=0 entry comes first
in the list and lock=1 entry is the second -- there is a chance the first
entry was a 2MB-type (must be some reason why 2nd entry was added to the
list), so picking it for a lock0-request might result in
xen_remap_bucket... which should be avoided. Anyway, there is no big deal
which approach is better as these situations are uncommon. After all,
mostly it's just a single entry in the bucket's list. 

> > One possible minor optimization for xen-mapcache would be to reuse
> > larger mappings for mappings of lesser cache_size. Right now existing
> > code does checks in the "entry->size == cache_size" manner, while we
> > can use "entry->size >= cache_size" here. However, we may end up with
> > resident MapCacheEntries being mapped to a bigger mapping sizes than
> > necessary and thus might need to add remapping back to the normal size
> > in xen_invalidate_map_cache_entry_unlocked() when there are no other
> > mappings.  
> 
> Yes, I thought about it, that would be a good improvement to have.

Well, it appears there is a lot of space for improvements in xen-mapcache
usage. Probably getting rid of the lock0/lock1-request separation will
allow to drastically reduce the number of xen_remap_bucket calls.

There also might be a possible bug for lock0-mappings being remapped by
concurrent xen-mapcache requests. 

The whole xen_map_cache(addr, 0, lock=0) thing looks very strange. As it
seems, the idea was to have a way to receive a temporary mapping to read
some tiny item from guest's RAM and after that leaving this mapping on its
own without bothering to unmap it. So it will be either reused later by
some other lock0/1-request or even remapped.

It appears that lock=0 mappings are very fragile. Their typical usage
scenario is like this:

    rcu_read_lock();
    ...
    ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
    memcpy(buf, ptr, len);
    ...
    rcu_read_unlock();

Here qemu_map_ram_ptr calls xen_map_cache(lock=0) which returns the actual
ptr. This scenario assumes there will be no intervention between
qemu_map_ram_ptr and rcu_read_unlock, providing ptr validity.

This might be ok for QEMU alone, but with underlying xen-mapcache usage
it seems to be assumed for RCU read lock to provide protection against
concurrent remappings of ptr's MapCacheEntry... which it doesn't obviously.
The problem is that rcu_read_lock() seems to be used to protect QEMU stuff
only, leaving us only mapcache_(un)lock to sync execution. But, upon return
from qemu_map_ram_ptr we don't hold the xen-mapcache lock anymore, so the
question is how rcu read lock supposed to save us from concurrent
qemu_map_ram_ptr/qemu_ram_ptr_length's? If there will be some DMA mapping
(lock=1) for that address_index or even some another lock0-read (of
different size) -- they will see an unlocked entry which can be remapped
without hesitation, breaking the ptr mapping which might be still in use.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] xen-mapcache: Fix the bug when overlapping emulated DMA operations may cause inconsistency in guest memory mappings
  2017-07-20  5:53       ` Alexey G
@ 2017-07-20 23:57         ` Stefano Stabellini
  2017-07-22  0:39         ` Stefano Stabellini
  1 sibling, 0 replies; 7+ messages in thread
From: Stefano Stabellini @ 2017-07-20 23:57 UTC (permalink / raw)
  To: Alexey G; +Cc: Anthony PERARD, james.mckenzie, Stefano Stabellini, xen-devel

On Thu, 20 Jul 2017, Alexey G wrote:
> On Wed, 19 Jul 2017 11:00:26 -0700 (PDT)
> Stefano Stabellini <sstabellini@kernel.org> wrote:
> 
> > My expectation is that unlocked mappings are much more frequent than
> > locked mappings. Also, I expect that only very rarely we'll be able to
> > reuse locked mappings. Over the course of a VM lifetime, it seems to me
> > that walking the list every time would cost more than it would benefit.
> > 
> > These are only "expectations", I would love to see numbers. Numbers make
> > for better decisions :-)  Would you be up for gathering some of these
> > numbers? Such as how many times you get to reuse locked mappings and how
> > many times we walk items on the list fruitlessly?
> > 
> > Otherwise, would you be up for just testing the modified version of the
> > patch I sent to verify that solves the bug?
> 
> Numbers will show that there is a one single entry in the bucket's list
> most of the time. :) Even two entries are rare encounters, typically to be
> seen only when guest performs some intensive I/O. OK, I'll collect some real
> stats for different scenarios, these are interesting numbers, might come
> useful for later optimizations.
> 
> The approach your proposed is good, but it allows reusing of suitable
> locked entries only when they come first in list (an existing behavior).
> But we can actually reuse a locked entry which may come next (if any) in
> the list as well. When we have the situation when lock=0 entry comes first
> in the list and lock=1 entry is the second -- there is a chance the first
> entry was a 2MB-type (must be some reason why 2nd entry was added to the
> list), so picking it for a lock0-request might result in
> xen_remap_bucket... which should be avoided. Anyway, there is no big deal
> which approach is better as these situations are uncommon. After all,
> mostly it's just a single entry in the bucket's list. 
> 
> > > One possible minor optimization for xen-mapcache would be to reuse
> > > larger mappings for mappings of lesser cache_size. Right now existing
> > > code does checks in the "entry->size == cache_size" manner, while we
> > > can use "entry->size >= cache_size" here. However, we may end up with
> > > resident MapCacheEntries being mapped to a bigger mapping sizes than
> > > necessary and thus might need to add remapping back to the normal size
> > > in xen_invalidate_map_cache_entry_unlocked() when there are no other
> > > mappings.  
> > 
> > Yes, I thought about it, that would be a good improvement to have.
> 
> Well, it appears there is a lot of space for improvements in xen-mapcache
> usage. Probably getting rid of the lock0/lock1-request separation will
> allow to drastically reduce the number of xen_remap_bucket calls.
> 
> 
> There also might be a possible bug for lock0-mappings being remapped by
> concurrent xen-mapcache requests. 
> 
> The whole xen_map_cache(addr, 0, lock=0) thing looks very strange. As it
> seems, the idea was to have a way to receive a temporary mapping to read
> some tiny item from guest's RAM and after that leaving this mapping on its
> own without bothering to unmap it. So it will be either reused later by
> some other lock0/1-request or even remapped.
> 
> It appears that lock=0 mappings are very fragile. Their typical usage
> scenario is like this:
> 
>     rcu_read_lock();
>     ...
>     ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
>     memcpy(buf, ptr, len);
>     ...
>     rcu_read_unlock();
> 
> Here qemu_map_ram_ptr calls xen_map_cache(lock=0) which returns the actual
> ptr. This scenario assumes there will be no intervention between
> qemu_map_ram_ptr and rcu_read_unlock, providing ptr validity.
> 
> This might be ok for QEMU alone, but with underlying xen-mapcache usage
> it seems to be assumed for RCU read lock to provide protection against
> concurrent remappings of ptr's MapCacheEntry... which it doesn't obviously.
> The problem is that rcu_read_lock() seems to be used to protect QEMU stuff
> only, leaving us only mapcache_(un)lock to sync execution. But, upon return
> from qemu_map_ram_ptr we don't hold the xen-mapcache lock anymore, so the
> question is how rcu read lock supposed to save us from concurrent
> qemu_map_ram_ptr/qemu_ram_ptr_length's? If there will be some DMA mapping
> (lock=1) for that address_index or even some another lock0-read (of
> different size) -- they will see an unlocked entry which can be remapped
> without hesitation, breaking the ptr mapping which might be still in use.

I think the answer is that there aren't supposed to be any concurrent
qemu_map_ram_ptr/qemu_ram_ptr_length calls because QEMU is still
synchronous in respect to running the emulators (such as
hw/net/e1000.c), which should be the only ones that end up calling those
functions.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] xen-mapcache: Fix the bug when overlapping emulated DMA operations may cause inconsistency in guest memory mappings
  2017-07-20  5:53       ` Alexey G
  2017-07-20 23:57         ` Stefano Stabellini
@ 2017-07-22  0:39         ` Stefano Stabellini
  1 sibling, 0 replies; 7+ messages in thread
From: Stefano Stabellini @ 2017-07-22  0:39 UTC (permalink / raw)
  To: Alexey G; +Cc: Anthony PERARD, james.mckenzie, Stefano Stabellini, xen-devel

On Thu, 20 Jul 2017, Alexey G wrote:
> On Wed, 19 Jul 2017 11:00:26 -0700 (PDT)
> Stefano Stabellini <sstabellini@kernel.org> wrote:
> 
> > My expectation is that unlocked mappings are much more frequent than
> > locked mappings. Also, I expect that only very rarely we'll be able to
> > reuse locked mappings. Over the course of a VM lifetime, it seems to me
> > that walking the list every time would cost more than it would benefit.
> > 
> > These are only "expectations", I would love to see numbers. Numbers make
> > for better decisions :-)  Would you be up for gathering some of these
> > numbers? Such as how many times you get to reuse locked mappings and how
> > many times we walk items on the list fruitlessly?
> > 
> > Otherwise, would you be up for just testing the modified version of the
> > patch I sent to verify that solves the bug?
> 
> Numbers will show that there is a one single entry in the bucket's list
> most of the time. :) Even two entries are rare encounters, typically to be
> seen only when guest performs some intensive I/O. OK, I'll collect some real
> stats for different scenarios, these are interesting numbers, might come
> useful for later optimizations.
> 
> The approach your proposed is good, but it allows reusing of suitable
> locked entries only when they come first in list (an existing behavior).
> But we can actually reuse a locked entry which may come next (if any) in
> the list as well. When we have the situation when lock=0 entry comes first
> in the list and lock=1 entry is the second -- there is a chance the first
> entry was a 2MB-type (must be some reason why 2nd entry was added to the
> list), so picking it for a lock0-request might result in
> xen_remap_bucket... which should be avoided. Anyway, there is no big deal
> which approach is better as these situations are uncommon. After all,
> mostly it's just a single entry in the bucket's list. 

Given that QEMU is about to release and I have to send a pull request
with another fix now, I am going to also send my version of the fix
right away (keeping you as main author of course).

However, I am more than happy to change the behavior of the algorithm in
the future if the numbers show that your version is better.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-07-22  0:39 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-07-10 20:00 [PATCH] xen-mapcache: Fix the bug when overlapping emulated DMA operations may cause inconsistency in guest memory mappings Alexey G
2017-07-18 22:17 ` Stefano Stabellini
2017-07-19  9:06   ` Alexey G
2017-07-19 18:00     ` Stefano Stabellini
2017-07-20  5:53       ` Alexey G
2017-07-20 23:57         ` Stefano Stabellini
2017-07-22  0:39         ` Stefano Stabellini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).