From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:57712) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1byzyP-0005dm-66 for qemu-devel@nongnu.org; Tue, 25 Oct 2016 07:32:55 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1byzyM-0006Ky-FL for qemu-devel@nongnu.org; Tue, 25 Oct 2016 07:32:53 -0400 Received: from mx1.redhat.com ([209.132.183.28]:53074) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1byzyM-0006Ki-6b for qemu-devel@nongnu.org; Tue, 25 Oct 2016 07:32:50 -0400 References: <20161024224044.7168.64176.stgit@gimli.home> <20161024225959.7168.20626.stgit@gimli.home> From: Paolo Bonzini Message-ID: <1bacf82c-0758-0232-937b-1bac6ff9ad96@redhat.com> Date: Tue, 25 Oct 2016 13:32:43 +0200 MIME-Version: 1.0 In-Reply-To: <20161024225959.7168.20626.stgit@gimli.home> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [PATCH 2/2] memory: Don't use memcpy for ram_device regions List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Alex Williamson , qemu-devel@nongnu.org Cc: thorsten.kohfeldt@gmx.de On 25/10/2016 01:00, Alex Williamson wrote: > With a vfio assigned device we lay down a base MemoryRegion registered > as an IO region, giving us read & write accessors. If the region > supports mmap, we lay down a higher priority sub-region MemoryRegion > on top of the base layer initialized as a RAM device pointer to the > mmap. Finally, if we have any quirks for the device (ie. address > ranges that need additional virtualization support), we put another IO > sub-region on top of the mmap MemoryRegion. When this is flattened, > we now potentially have sub-page mmap MemoryRegions exposed which > cannot be directly mapped through KVM. >=20 > This is as expected, but a subtle detail of this is that we end up > with two different access mechanisms through QEMU. If we disable the > mmap MemoryRegion, we make use of the IO MemoryRegion and service > accesses using pread and pwrite to the vfio device file descriptor. > If the mmap MemoryRegion is enabled and results in one of these > sub-page gaps, QEMU handles the access as RAM, using memcpy to the > mmap. Using either pread/pwrite or the mmap directly should be > correct, but using memcpy causes us problems. I expect that not only > does memcpy not necessarily honor the original width and alignment in > performing a copy, but it potentially also uses processor instructions > not intended for MMIO spaces. It turns out that this has been a > problem for Realtek NIC assignment, which has such a quirk that > creates a sub-page mmap MemoryRegion access. >=20 > To resolve this, we disable memory_access_is_direct() for ram_device > regions since QEMU assumes that it can use memcpy for those regions. > Instead we access through MemoryRegionOps, which replaces the memcpy > with simple de-references of standard sizes to the host memory. This > also allows us to eliminate the ram_device bool from the MemoryRegion > structure since we can simply test the ops pointer. >=20 > With this patch we attempt to provide unrestricted access to the RAM > device, allowing byte through qword access as well as unaligned > access. The assumption here is that accesses initiated by the VM are > driven by a device specific driver, which knows the device > capabilities. If unaligned accesses are not supported by the device, > we don't want them to work in a VM by performing multiple aligned > accesses to compose the unaligned access. A down-side of this > philosophy is that the xp command from the monitor attempts to use > the largest available access weidth, unaware of the underlying > device. Using memcpy had this same restriction, but at least now an > operator can dump individual registers, even if blocks of device > memory may result in access widths beyond the capabilities of a > given device (RTL NICs only support up to dword). >=20 > Reported-by: Thorsten Kohfeldt > Signed-off-by: Alex Williamson > --- > include/exec/memory.h | 7 +++-- > memory.c | 70 +++++++++++++++++++++++++++++++++++++++++= +++++++- > trace-events | 2 + > 3 files changed, 74 insertions(+), 5 deletions(-) >=20 > diff --git a/include/exec/memory.h b/include/exec/memory.h > index a75b8c3..2d4a287 100644 > --- a/include/exec/memory.h > +++ b/include/exec/memory.h > @@ -209,7 +209,6 @@ struct MemoryRegion { > void (*destructor)(MemoryRegion *mr); > uint64_t align; > bool terminates; > - bool ram_device; > bool enabled; > bool warning_printed; /* For reservations */ > uint8_t vga_logging_count; > @@ -1480,9 +1479,11 @@ void *qemu_map_ram_ptr(RAMBlock *ram_block, ram_= addr_t addr); > static inline bool memory_access_is_direct(MemoryRegion *mr, bool is_w= rite) > { > if (is_write) { > - return memory_region_is_ram(mr) && !mr->readonly; > + return memory_region_is_ram(mr) && > + !mr->readonly && !memory_region_is_ram_device(mr); > } else { > - return memory_region_is_ram(mr) || memory_region_is_romd(mr); > + return (memory_region_is_ram(mr) && !memory_region_is_ram_devi= ce(mr)) || > + memory_region_is_romd(mr); Oops, I hadn't thought of this. This is a bit slower, especially so if=20 you have to compare the ops pointer. It's probably best to keep the=20 mr->ram_device flag, so that later we can place all five relevant flags=20 (ram, readonly, ram_device, rom_device, romd_mode) into the same word=20 and play games with bits, like the following: // write case mr->flags =3D=3D MR_RAM // read cases (mr->flags & ~MR_READONLY) =3D=3D MR_RAM || (mr->flags & ~MR_READONLY) =3D=3D (MR_ROM_DEVICE | MR_ROMD_MODE) (the latter in turn can be optimized to a range check if the flags are arranged cleverly). The patch is okay with mr->ram_device left in place, feel free to merge=20 it through your own VFIO tree. Paolo > } > } > =20 > diff --git a/memory.c b/memory.c > index 7ffcff1..d07f785 100644 > --- a/memory.c > +++ b/memory.c > @@ -1128,6 +1128,71 @@ const MemoryRegionOps unassigned_mem_ops =3D { > .endianness =3D DEVICE_NATIVE_ENDIAN, > }; > =20 > +static uint64_t memory_region_ram_device_read(void *opaque, > + hwaddr addr, unsigned si= ze) > +{ > + MemoryRegion *mr =3D opaque; > + uint64_t data =3D (uint64_t)~0; > + > + switch (size) { > + case 1: > + data =3D *(uint8_t *)(mr->ram_block->host + addr); > + break; > + case 2: > + data =3D *(uint16_t *)(mr->ram_block->host + addr); > + break; > + case 4: > + data =3D *(uint32_t *)(mr->ram_block->host + addr); > + break; > + case 8: > + data =3D *(uint64_t *)(mr->ram_block->host + addr); > + break; > + } > + > + trace_memory_region_ram_device_read(get_cpu_index(), mr, addr, dat= a, size); > + > + return data; > +} > + > +static void memory_region_ram_device_write(void *opaque, hwaddr addr, > + uint64_t data, unsigned siz= e) > +{ > + MemoryRegion *mr =3D opaque; > + > + trace_memory_region_ram_device_write(get_cpu_index(), mr, addr, da= ta, size); > + > + switch (size) { > + case 1: > + *(uint8_t *)(mr->ram_block->host + addr) =3D (uint8_t)data; > + break; > + case 2: > + *(uint16_t *)(mr->ram_block->host + addr) =3D (uint16_t)data; > + break; > + case 4: > + *(uint32_t *)(mr->ram_block->host + addr) =3D (uint32_t)data; > + break; > + case 8: > + *(uint64_t *)(mr->ram_block->host + addr) =3D data; > + break; > + } > +} > + > +static const MemoryRegionOps ram_device_mem_ops =3D { > + .read =3D memory_region_ram_device_read, > + .write =3D memory_region_ram_device_write, > + .endianness =3D DEVICE_NATIVE_ENDIAN, > + .valid =3D { > + .min_access_size =3D 1, > + .max_access_size =3D 8, > + .unaligned =3D true, > + }, > + .impl =3D { > + .min_access_size =3D 1, > + .max_access_size =3D 8, > + .unaligned =3D true, > + }, > +}; > + > bool memory_region_access_valid(MemoryRegion *mr, > hwaddr addr, > unsigned size, > @@ -1362,7 +1427,8 @@ void memory_region_init_ram_device_ptr(MemoryRegi= on *mr, > void *ptr) > { > memory_region_init_ram_ptr(mr, owner, name, size, ptr); > - mr->ram_device =3D true; > + mr->ops =3D &ram_device_mem_ops; > + mr->opaque =3D mr; > } > =20 > void memory_region_init_alias(MemoryRegion *mr, > @@ -1498,7 +1564,7 @@ const char *memory_region_name(const MemoryRegion= *mr) > =20 > bool memory_region_is_ram_device(MemoryRegion *mr) > { > - return mr->ram_device; > + return mr->ops =3D=3D &ram_device_mem_ops; > } > =20 > uint8_t memory_region_get_dirty_log_mask(MemoryRegion *mr) > diff --git a/trace-events b/trace-events > index 8ecded5..f74e1d3 100644 > --- a/trace-events > +++ b/trace-events > @@ -121,6 +121,8 @@ memory_region_subpage_read(int cpu_index, void *mr,= uint64_t offset, uint64_t va > memory_region_subpage_write(int cpu_index, void *mr, uint64_t offset, = uint64_t value, unsigned size) "cpu %d mr %p offset %#"PRIx64" value %#"P= RIx64" size %u" > memory_region_tb_read(int cpu_index, uint64_t addr, uint64_t value, un= signed size) "cpu %d addr %#"PRIx64" value %#"PRIx64" size %u" > memory_region_tb_write(int cpu_index, uint64_t addr, uint64_t value, u= nsigned size) "cpu %d addr %#"PRIx64" value %#"PRIx64" size %u" > +memory_region_ram_device_read(int cpu_index, void *mr, uint64_t addr, = uint64_t value, unsigned size) "cpu %d mr %p addr %#"PRIx64" value %#"PRI= x64" size %u" > +memory_region_ram_device_write(int cpu_index, void *mr, uint64_t addr,= uint64_t value, unsigned size) "cpu %d mr %p addr %#"PRIx64" value %#"PR= Ix64" size %u" > =20 > ### Guest events, keep at bottom > =20 >=20