From: Paolo Bonzini <pbonzini@redhat.com>
To: Alex Williamson <alex.williamson@redhat.com>, qemu-devel@nongnu.org
Cc: thorsten.kohfeldt@gmx.de
Subject: Re: [Qemu-devel] [PATCH 2/2] memory: Don't use memcpy for ram_device regions
Date: Tue, 25 Oct 2016 13:32:43 +0200 [thread overview]
Message-ID: <1bacf82c-0758-0232-937b-1bac6ff9ad96@redhat.com> (raw)
In-Reply-To: <20161024225959.7168.20626.stgit@gimli.home>
On 25/10/2016 01:00, Alex Williamson wrote:
> With a vfio assigned device we lay down a base MemoryRegion registered
> as an IO region, giving us read & write accessors. If the region
> supports mmap, we lay down a higher priority sub-region MemoryRegion
> on top of the base layer initialized as a RAM device pointer to the
> mmap. Finally, if we have any quirks for the device (ie. address
> ranges that need additional virtualization support), we put another IO
> sub-region on top of the mmap MemoryRegion. When this is flattened,
> we now potentially have sub-page mmap MemoryRegions exposed which
> cannot be directly mapped through KVM.
>
> This is as expected, but a subtle detail of this is that we end up
> with two different access mechanisms through QEMU. If we disable the
> mmap MemoryRegion, we make use of the IO MemoryRegion and service
> accesses using pread and pwrite to the vfio device file descriptor.
> If the mmap MemoryRegion is enabled and results in one of these
> sub-page gaps, QEMU handles the access as RAM, using memcpy to the
> mmap. Using either pread/pwrite or the mmap directly should be
> correct, but using memcpy causes us problems. I expect that not only
> does memcpy not necessarily honor the original width and alignment in
> performing a copy, but it potentially also uses processor instructions
> not intended for MMIO spaces. It turns out that this has been a
> problem for Realtek NIC assignment, which has such a quirk that
> creates a sub-page mmap MemoryRegion access.
>
> To resolve this, we disable memory_access_is_direct() for ram_device
> regions since QEMU assumes that it can use memcpy for those regions.
> Instead we access through MemoryRegionOps, which replaces the memcpy
> with simple de-references of standard sizes to the host memory. This
> also allows us to eliminate the ram_device bool from the MemoryRegion
> structure since we can simply test the ops pointer.
>
> With this patch we attempt to provide unrestricted access to the RAM
> device, allowing byte through qword access as well as unaligned
> access. The assumption here is that accesses initiated by the VM are
> driven by a device specific driver, which knows the device
> capabilities. If unaligned accesses are not supported by the device,
> we don't want them to work in a VM by performing multiple aligned
> accesses to compose the unaligned access. A down-side of this
> philosophy is that the xp command from the monitor attempts to use
> the largest available access weidth, unaware of the underlying
> device. Using memcpy had this same restriction, but at least now an
> operator can dump individual registers, even if blocks of device
> memory may result in access widths beyond the capabilities of a
> given device (RTL NICs only support up to dword).
>
> Reported-by: Thorsten Kohfeldt <thorsten.kohfeldt@gmx.de>
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
> include/exec/memory.h | 7 +++--
> memory.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++-
> trace-events | 2 +
> 3 files changed, 74 insertions(+), 5 deletions(-)
>
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index a75b8c3..2d4a287 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -209,7 +209,6 @@ struct MemoryRegion {
> void (*destructor)(MemoryRegion *mr);
> uint64_t align;
> bool terminates;
> - bool ram_device;
> bool enabled;
> bool warning_printed; /* For reservations */
> uint8_t vga_logging_count;
> @@ -1480,9 +1479,11 @@ void *qemu_map_ram_ptr(RAMBlock *ram_block, ram_addr_t addr);
> static inline bool memory_access_is_direct(MemoryRegion *mr, bool is_write)
> {
> if (is_write) {
> - return memory_region_is_ram(mr) && !mr->readonly;
> + return memory_region_is_ram(mr) &&
> + !mr->readonly && !memory_region_is_ram_device(mr);
> } else {
> - return memory_region_is_ram(mr) || memory_region_is_romd(mr);
> + return (memory_region_is_ram(mr) && !memory_region_is_ram_device(mr)) ||
> + memory_region_is_romd(mr);
Oops, I hadn't thought of this. This is a bit slower, especially so if
you have to compare the ops pointer. It's probably best to keep the
mr->ram_device flag, so that later we can place all five relevant flags
(ram, readonly, ram_device, rom_device, romd_mode) into the same word
and play games with bits, like the following:
// write case
mr->flags == MR_RAM
// read cases
(mr->flags & ~MR_READONLY) == MR_RAM ||
(mr->flags & ~MR_READONLY) == (MR_ROM_DEVICE | MR_ROMD_MODE)
(the latter in turn can be optimized to a range check if the flags are
arranged cleverly).
The patch is okay with mr->ram_device left in place, feel free to merge
it through your own VFIO tree.
Paolo
> }
> }
>
> diff --git a/memory.c b/memory.c
> index 7ffcff1..d07f785 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1128,6 +1128,71 @@ const MemoryRegionOps unassigned_mem_ops = {
> .endianness = DEVICE_NATIVE_ENDIAN,
> };
>
> +static uint64_t memory_region_ram_device_read(void *opaque,
> + hwaddr addr, unsigned size)
> +{
> + MemoryRegion *mr = opaque;
> + uint64_t data = (uint64_t)~0;
> +
> + switch (size) {
> + case 1:
> + data = *(uint8_t *)(mr->ram_block->host + addr);
> + break;
> + case 2:
> + data = *(uint16_t *)(mr->ram_block->host + addr);
> + break;
> + case 4:
> + data = *(uint32_t *)(mr->ram_block->host + addr);
> + break;
> + case 8:
> + data = *(uint64_t *)(mr->ram_block->host + addr);
> + break;
> + }
> +
> + trace_memory_region_ram_device_read(get_cpu_index(), mr, addr, data, size);
> +
> + return data;
> +}
> +
> +static void memory_region_ram_device_write(void *opaque, hwaddr addr,
> + uint64_t data, unsigned size)
> +{
> + MemoryRegion *mr = opaque;
> +
> + trace_memory_region_ram_device_write(get_cpu_index(), mr, addr, data, size);
> +
> + switch (size) {
> + case 1:
> + *(uint8_t *)(mr->ram_block->host + addr) = (uint8_t)data;
> + break;
> + case 2:
> + *(uint16_t *)(mr->ram_block->host + addr) = (uint16_t)data;
> + break;
> + case 4:
> + *(uint32_t *)(mr->ram_block->host + addr) = (uint32_t)data;
> + break;
> + case 8:
> + *(uint64_t *)(mr->ram_block->host + addr) = data;
> + break;
> + }
> +}
> +
> +static const MemoryRegionOps ram_device_mem_ops = {
> + .read = memory_region_ram_device_read,
> + .write = memory_region_ram_device_write,
> + .endianness = DEVICE_NATIVE_ENDIAN,
> + .valid = {
> + .min_access_size = 1,
> + .max_access_size = 8,
> + .unaligned = true,
> + },
> + .impl = {
> + .min_access_size = 1,
> + .max_access_size = 8,
> + .unaligned = true,
> + },
> +};
> +
> bool memory_region_access_valid(MemoryRegion *mr,
> hwaddr addr,
> unsigned size,
> @@ -1362,7 +1427,8 @@ void memory_region_init_ram_device_ptr(MemoryRegion *mr,
> void *ptr)
> {
> memory_region_init_ram_ptr(mr, owner, name, size, ptr);
> - mr->ram_device = true;
> + mr->ops = &ram_device_mem_ops;
> + mr->opaque = mr;
> }
>
> void memory_region_init_alias(MemoryRegion *mr,
> @@ -1498,7 +1564,7 @@ const char *memory_region_name(const MemoryRegion *mr)
>
> bool memory_region_is_ram_device(MemoryRegion *mr)
> {
> - return mr->ram_device;
> + return mr->ops == &ram_device_mem_ops;
> }
>
> uint8_t memory_region_get_dirty_log_mask(MemoryRegion *mr)
> diff --git a/trace-events b/trace-events
> index 8ecded5..f74e1d3 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -121,6 +121,8 @@ memory_region_subpage_read(int cpu_index, void *mr, uint64_t offset, uint64_t va
> memory_region_subpage_write(int cpu_index, void *mr, uint64_t offset, uint64_t value, unsigned size) "cpu %d mr %p offset %#"PRIx64" value %#"PRIx64" size %u"
> memory_region_tb_read(int cpu_index, uint64_t addr, uint64_t value, unsigned size) "cpu %d addr %#"PRIx64" value %#"PRIx64" size %u"
> memory_region_tb_write(int cpu_index, uint64_t addr, uint64_t value, unsigned size) "cpu %d addr %#"PRIx64" value %#"PRIx64" size %u"
> +memory_region_ram_device_read(int cpu_index, void *mr, uint64_t addr, uint64_t value, unsigned size) "cpu %d mr %p addr %#"PRIx64" value %#"PRIx64" size %u"
> +memory_region_ram_device_write(int cpu_index, void *mr, uint64_t addr, uint64_t value, unsigned size) "cpu %d mr %p addr %#"PRIx64" value %#"PRIx64" size %u"
>
> ### Guest events, keep at bottom
>
>
prev parent reply other threads:[~2016-10-25 11:32 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-10-24 22:59 [Qemu-devel] [PATCH 0/2] memory: Convert skip_dump to ram_device and avoid memcpy Alex Williamson
2016-10-24 22:59 ` [Qemu-devel] [PATCH 1/2] memory: Replace skip_dump flag with "ram_device" Alex Williamson
2016-10-24 23:00 ` [Qemu-devel] [PATCH 2/2] memory: Don't use memcpy for ram_device regions Alex Williamson
2016-10-25 11:32 ` Paolo Bonzini [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1bacf82c-0758-0232-937b-1bac6ff9ad96@redhat.com \
--to=pbonzini@redhat.com \
--cc=alex.williamson@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=thorsten.kohfeldt@gmx.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).