From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:57712)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <pbonzini@redhat.com>) id 1byzyP-0005dm-66
	for qemu-devel@nongnu.org; Tue, 25 Oct 2016 07:32:55 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <pbonzini@redhat.com>) id 1byzyM-0006Ky-FL
	for qemu-devel@nongnu.org; Tue, 25 Oct 2016 07:32:53 -0400
Received: from mx1.redhat.com ([209.132.183.28]:53074)
	by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32)
	(Exim 4.71) (envelope-from <pbonzini@redhat.com>) id 1byzyM-0006Ki-6b
	for qemu-devel@nongnu.org; Tue, 25 Oct 2016 07:32:50 -0400
References: <20161024224044.7168.64176.stgit@gimli.home>
	<20161024225959.7168.20626.stgit@gimli.home>
From: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <1bacf82c-0758-0232-937b-1bac6ff9ad96@redhat.com>
Date: Tue, 25 Oct 2016 13:32:43 +0200
MIME-Version: 1.0
In-Reply-To: <20161024225959.7168.20626.stgit@gimli.home>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] [PATCH 2/2] memory: Don't use memcpy for
 ram_device regions
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Alex Williamson <alex.williamson@redhat.com>, qemu-devel@nongnu.org
Cc: thorsten.kohfeldt@gmx.de


On 25/10/2016 01:00, Alex Williamson wrote:
> With a vfio assigned device we lay down a base MemoryRegion registered
> as an IO region, giving us read & write accessors.  If the region
> supports mmap, we lay down a higher priority sub-region MemoryRegion
> on top of the base layer initialized as a RAM device pointer to the
> mmap.  Finally, if we have any quirks for the device (ie. address
> ranges that need additional virtualization support), we put another IO
> sub-region on top of the mmap MemoryRegion.  When this is flattened,
> we now potentially have sub-page mmap MemoryRegions exposed which
> cannot be directly mapped through KVM.
>=20
> This is as expected, but a subtle detail of this is that we end up
> with two different access mechanisms through QEMU.  If we disable the
> mmap MemoryRegion, we make use of the IO MemoryRegion and service
> accesses using pread and pwrite to the vfio device file descriptor.
> If the mmap MemoryRegion is enabled and results in one of these
> sub-page gaps, QEMU handles the access as RAM, using memcpy to the
> mmap.  Using either pread/pwrite or the mmap directly should be
> correct, but using memcpy causes us problems.  I expect that not only
> does memcpy not necessarily honor the original width and alignment in
> performing a copy, but it potentially also uses processor instructions
> not intended for MMIO spaces.  It turns out that this has been a
> problem for Realtek NIC assignment, which has such a quirk that
> creates a sub-page mmap MemoryRegion access.
>=20
> To resolve this, we disable memory_access_is_direct() for ram_device
> regions since QEMU assumes that it can use memcpy for those regions.
> Instead we access through MemoryRegionOps, which replaces the memcpy
> with simple de-references of standard sizes to the host memory.  This
> also allows us to eliminate the ram_device bool from the MemoryRegion
> structure since we can simply test the ops pointer.
>=20
> With this patch we attempt to provide unrestricted access to the RAM
> device, allowing byte through qword access as well as unaligned
> access.  The assumption here is that accesses initiated by the VM are
> driven by a device specific driver, which knows the device
> capabilities.  If unaligned accesses are not supported by the device,
> we don't want them to work in a VM by performing multiple aligned
> accesses to compose the unaligned access.  A down-side of this
> philosophy is that the xp command from the monitor attempts to use
> the largest available access weidth, unaware of the underlying
> device.  Using memcpy had this same restriction, but at least now an
> operator can dump individual registers, even if blocks of device
> memory may result in access widths beyond the capabilities of a
> given device (RTL NICs only support up to dword).
>=20
> Reported-by: Thorsten Kohfeldt <thorsten.kohfeldt@gmx.de>
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  include/exec/memory.h |    7 +++--
>  memory.c              |   70 +++++++++++++++++++++++++++++++++++++++++=
+++++++-
>  trace-events          |    2 +
>  3 files changed, 74 insertions(+), 5 deletions(-)
>=20
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index a75b8c3..2d4a287 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -209,7 +209,6 @@ struct MemoryRegion {
>      void (*destructor)(MemoryRegion *mr);
>      uint64_t align;
>      bool terminates;
> -    bool ram_device;
>      bool enabled;
>      bool warning_printed; /* For reservations */
>      uint8_t vga_logging_count;
> @@ -1480,9 +1479,11 @@ void *qemu_map_ram_ptr(RAMBlock *ram_block, ram_=
addr_t addr);
>  static inline bool memory_access_is_direct(MemoryRegion *mr, bool is_w=
rite)
>  {
>      if (is_write) {
> -        return memory_region_is_ram(mr) && !mr->readonly;
> +        return memory_region_is_ram(mr) &&
> +               !mr->readonly && !memory_region_is_ram_device(mr);
>      } else {
> -        return memory_region_is_ram(mr) || memory_region_is_romd(mr);
> +        return (memory_region_is_ram(mr) && !memory_region_is_ram_devi=
ce(mr)) ||
> +               memory_region_is_romd(mr);

Oops, I hadn't thought of this.  This is a bit slower, especially so if=20
you have to compare the ops pointer.  It's probably best to keep the=20
mr->ram_device flag, so that later we can place all five relevant flags=20
(ram, readonly, ram_device, rom_device, romd_mode) into the same word=20
and play games with bits, like the following:

	// write case
	mr->flags =3D=3D MR_RAM

	// read cases
	(mr->flags & ~MR_READONLY) =3D=3D MR_RAM ||
	(mr->flags & ~MR_READONLY) =3D=3D (MR_ROM_DEVICE | MR_ROMD_MODE)

(the latter in turn can be optimized to a range check if the flags are
arranged cleverly).

The patch is okay with mr->ram_device left in place, feel free to merge=20
it through your own VFIO tree.

Paolo

>      }
>  }
> =20
> diff --git a/memory.c b/memory.c
> index 7ffcff1..d07f785 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1128,6 +1128,71 @@ const MemoryRegionOps unassigned_mem_ops =3D {
>      .endianness =3D DEVICE_NATIVE_ENDIAN,
>  };
> =20
> +static uint64_t memory_region_ram_device_read(void *opaque,
> +                                              hwaddr addr, unsigned si=
ze)
> +{
> +    MemoryRegion *mr =3D opaque;
> +    uint64_t data =3D (uint64_t)~0;
> +
> +    switch (size) {
> +    case 1:
> +        data =3D *(uint8_t *)(mr->ram_block->host + addr);
> +        break;
> +    case 2:
> +        data =3D *(uint16_t *)(mr->ram_block->host + addr);
> +        break;
> +    case 4:
> +        data =3D *(uint32_t *)(mr->ram_block->host + addr);
> +        break;
> +    case 8:
> +        data =3D *(uint64_t *)(mr->ram_block->host + addr);
> +        break;
> +    }
> +
> +    trace_memory_region_ram_device_read(get_cpu_index(), mr, addr, dat=
a, size);
> +
> +    return data;
> +}
> +
> +static void memory_region_ram_device_write(void *opaque, hwaddr addr,
> +                                           uint64_t data, unsigned siz=
e)
> +{
> +    MemoryRegion *mr =3D opaque;
> +
> +    trace_memory_region_ram_device_write(get_cpu_index(), mr, addr, da=
ta, size);
> +
> +    switch (size) {
> +    case 1:
> +        *(uint8_t *)(mr->ram_block->host + addr) =3D (uint8_t)data;
> +        break;
> +    case 2:
> +        *(uint16_t *)(mr->ram_block->host + addr) =3D (uint16_t)data;
> +        break;
> +    case 4:
> +        *(uint32_t *)(mr->ram_block->host + addr) =3D (uint32_t)data;
> +        break;
> +    case 8:
> +        *(uint64_t *)(mr->ram_block->host + addr) =3D data;
> +        break;
> +    }
> +}
> +
> +static const MemoryRegionOps ram_device_mem_ops =3D {
> +    .read =3D memory_region_ram_device_read,
> +    .write =3D memory_region_ram_device_write,
> +    .endianness =3D DEVICE_NATIVE_ENDIAN,
> +    .valid =3D {
> +        .min_access_size =3D 1,
> +        .max_access_size =3D 8,
> +        .unaligned =3D true,
> +    },
> +    .impl =3D {
> +        .min_access_size =3D 1,
> +        .max_access_size =3D 8,
> +        .unaligned =3D true,
> +    },
> +};
> +
>  bool memory_region_access_valid(MemoryRegion *mr,
>                                  hwaddr addr,
>                                  unsigned size,
> @@ -1362,7 +1427,8 @@ void memory_region_init_ram_device_ptr(MemoryRegi=
on *mr,
>                                         void *ptr)
>  {
>      memory_region_init_ram_ptr(mr, owner, name, size, ptr);
> -    mr->ram_device =3D true;
> +    mr->ops =3D &ram_device_mem_ops;
> +    mr->opaque =3D mr;
>  }
> =20
>  void memory_region_init_alias(MemoryRegion *mr,
> @@ -1498,7 +1564,7 @@ const char *memory_region_name(const MemoryRegion=
 *mr)
> =20
>  bool memory_region_is_ram_device(MemoryRegion *mr)
>  {
> -    return mr->ram_device;
> +    return mr->ops =3D=3D &ram_device_mem_ops;
>  }
> =20
>  uint8_t memory_region_get_dirty_log_mask(MemoryRegion *mr)
> diff --git a/trace-events b/trace-events
> index 8ecded5..f74e1d3 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -121,6 +121,8 @@ memory_region_subpage_read(int cpu_index, void *mr,=
 uint64_t offset, uint64_t va
>  memory_region_subpage_write(int cpu_index, void *mr, uint64_t offset, =
uint64_t value, unsigned size) "cpu %d mr %p offset %#"PRIx64" value %#"P=
RIx64" size %u"
>  memory_region_tb_read(int cpu_index, uint64_t addr, uint64_t value, un=
signed size) "cpu %d addr %#"PRIx64" value %#"PRIx64" size %u"
>  memory_region_tb_write(int cpu_index, uint64_t addr, uint64_t value, u=
nsigned size) "cpu %d addr %#"PRIx64" value %#"PRIx64" size %u"
> +memory_region_ram_device_read(int cpu_index, void *mr, uint64_t addr, =
uint64_t value, unsigned size) "cpu %d mr %p addr %#"PRIx64" value %#"PRI=
x64" size %u"
> +memory_region_ram_device_write(int cpu_index, void *mr, uint64_t addr,=
 uint64_t value, unsigned size) "cpu %d mr %p addr %#"PRIx64" value %#"PR=
Ix64" size %u"
> =20
>  ### Guest events, keep at bottom
> =20
>=20