From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.2 required=3.0 tests=DATE_IN_PAST_96_XX, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 40084C35DF5 for ; Tue, 25 Feb 2020 07:23:56 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id EFBBE218AC for ; Tue, 25 Feb 2020 07:23:55 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org EFBBE218AC Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=nutanix.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Received: from localhost ([::1]:49820 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1j6UZP-0002KR-59 for qemu-devel@archiver.kernel.org; Tue, 25 Feb 2020 02:23:55 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:39467) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1j6UYf-0001tT-Om for qemu-devel@nongnu.org; Tue, 25 Feb 2020 02:23:11 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1j6UYd-0002Zd-G3 for qemu-devel@nongnu.org; Tue, 25 Feb 2020 02:23:08 -0500 Received: from [192.146.154.243] (port=30916 helo=mcp01.nutanix.com) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1j6UYd-0002XD-8O for qemu-devel@nongnu.org; Tue, 25 Feb 2020 02:23:07 -0500 Received: from localhost.localdomain (unknown [10.40.36.165]) by mcp01.nutanix.com (Postfix) with ESMTP id A5B7A1008443; Tue, 25 Feb 2020 07:23:03 +0000 (UTC) Date: Thu, 20 Feb 2020 02:03:08 -0500 From: Raphael Norwitz To: "Michael S. Tsirkin" Subject: Re: [PATCH v2 3/3] Lift max memory slots limit imposed by vhost-user Message-ID: <20200220070308.GA2215@localhost.localdomain> References: <1579143426-18305-1-git-send-email-raphael.norwitz@nutanix.com> <1579143426-18305-4-git-send-email-raphael.norwitz@nutanix.com> <20200206032332-mutt-send-email-mst@kernel.org> <20200209174335.GA15050@localhost.localdomain> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20200209174335.GA15050@localhost.localdomain> User-Agent: Mutt/1.5.21 (2010-09-15) Content-Transfer-Encoding: quoted-printable X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 192.146.154.243 X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: qemu-devel@nongnu.org Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" Ping On Sun, Feb 09, 2020 at 12:43:35PM -0500, Raphael Norwitz wrote: >=20 > On Thu, Feb 06, 2020 at 03:32:38AM -0500, Michael S. Tsirkin wrote: > >=20 > > On Wed, Jan 15, 2020 at 09:57:06PM -0500, Raphael Norwitz wrote: > > > The current vhost-user implementation in Qemu imposes a limit on th= e > > > maximum number of memory slots exposed to a VM using a vhost-user > > > device. This change provides a new protocol feature > > > VHOST_USER_F_CONFIGURE_SLOTS which, when enabled, lifts this limit = and > > > allows a VM with a vhost-user device to expose a configurable numbe= r of > > > memory slots, up to the ACPI defined maximum. Existing backends whi= ch > > > do not support this protocol feature are unaffected. > >=20 > > Hmm ACPI maximum seems to be up to 512 - is this too much to fit in a > > single message? So can't we just increase the number (after negotiat= ing > > with remote) and be done with it, instead of add/remove? Or is there > > another reason to prefer add/remove? > > >=20 > As mentioned in my cover letter, we experimented with simply increasing= the > message size and it didn=E2=80=99t work on our setup. We debugged down = to the socket > layer and found that on the receiving end the messages were truncated a= t > around 512 bytes, or around 16 memory regions.=C2=A0To support 512 memo= ry regions > we would need a message size of around=C2=A0=C2=A032 = * 512 > + 8 ~=3D 16k packet size. That wou= ld be 64 > times larger than the next largest message size. We thought it would be= cleaner > and more in line with the rest of the protocol to keep the message size= s > smaller. In particular, we thought memory regions should be treated lik= e the > rings, which are sent over one message at a time instead of in one larg= e message. > Whether or not such a large message size can be made to work in our cas= e, > separate messages will always work on Linux, and most likely all other = UNIX > platforms QEMU is used on. >=20 > > >=20 > > > This feature works by using three new messages, > > > VHOST_USER_GET_MAX_MEM_SLOTS, VHOST_USER_ADD_MEM_REG and > > > VHOST_USER_REM_MEM_REG. VHOST_USER_GET_MAX_MEM_SLOTS gets the > > > number of memory slots the backend is willing to accept when the > > > backend is initialized. Then, when the memory tables are set or upd= ated, > > > a series of VHOST_USER_ADD_MEM_REG and VHOST_USER_REM_MEM_REG messa= ges > > > are sent to transmit the regions to map and/or unmap instead of try= ing > > > to send all the regions in one fixed size VHOST_USER_SET_MEM_TABLE > > > message. > > >=20 > > > The vhost_user struct maintains a shadow state of the VM=E2=80=99s = memory > > > regions. When the memory tables are modified, the > > > vhost_user_set_mem_table() function compares the new device memory = state > > > to the shadow state and only sends regions which need to be unmappe= d or > > > mapped in. The regions which must be unmapped are sent first, follo= wed > > > by the new regions to be mapped in. After all the messages have bee= n > > > sent, the shadow state is set to the current virtual device state. > > >=20 > > > The current feature implementation does not work with postcopy migr= ation > > > and cannot be enabled if the VHOST_USER_PROTOCOL_F_REPLY_ACK featur= e has > > > also been negotiated. > >=20 > > Hmm what would it take to lift the restrictions? > > conflicting features like this makes is very hard for users to make > > an informed choice what to support. > > >=20 > We would need a setup with a backend which supports these features (REP= LY_ACK > and postcopy migration). At first glance it looks like DPDK could work = but > I'm not sure how easy it will be to test postcopy migration with the re= sources > we have. > =20 > > > Signed-off-by: Raphael Norwitz > > > Signed-off-by: Peter Turschmid > > > Suggested-by: Mike Cui > > > --- > > > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c > > > index af83fdd..fed6d02 100644 > > > --- a/hw/virtio/vhost-user.c > > > +++ b/hw/virtio/vhost-user.c > > > @@ -35,11 +35,29 @@ > > > #include > > > #endif > > > =20 > > > -#define VHOST_MEMORY_MAX_NREGIONS 8 > > > +#define VHOST_MEMORY_LEGACY_NREGIONS 8 > >=20 > > Hardly legacy when this is intended to always be used e.g. with > > postcopy, right? > > >=20 > How about 'BASELINE'? > > > + msg->hdr.size =3D sizeof(msg->payload.mem_reg.padding); > > > + msg->hdr.size +=3D sizeof(VhostUserMemoryRegion); > > > + > > > + /* > > > + * Send VHOST_USER_REM_MEM_REG for memory regions in our shado= w state > > > + * which are not found not in the device's memory state. > >=20 > > double negation - could not parse this. > > >=20 > Apologies - typo here. It should say =E2=80=9CSend VHOST_USER_REM_MEM_R= EG for memory > regions in our shadow state which are not found in the device's memory = state.=E2=80=9D=20 > i.e. send messages to remove regions in the shadow state but not in the= updated > device state.=20 > =20 > > > + */ > > > + for (i =3D 0; i < u->num_shadow_regions; ++i) { > > > + reg =3D dev->mem->regions; > > > + > > > + for (j =3D 0; j < dev->mem->nregions; j++) { > > > + reg =3D dev->mem->regions + j; > > > + > > > + assert((uintptr_t)reg->userspace_addr =3D=3D reg->user= space_addr); > > > + mr =3D memory_region_from_host((void *)(uintptr_t)reg-= >userspace_addr, > > > + &offset); > > > + fd =3D memory_region_get_fd(mr); > > > + > > > + if (reg_equal(&u->shadow_regions[i], reg)) { > > > + matching =3D true; > > > + found[j] =3D true; > > > + break; > > > + } > > > + } > > > + > > > + if (fd > 0 && !matching) { > > > + msg->hdr.request =3D VHOST_USER_REM_MEM_REG; > > > + msg->payload.mem_reg.region.userspace_addr =3D reg->us= erspace_addr; > > > + msg->payload.mem_reg.region.memory_size =3D reg->memor= y_size; > > > + msg->payload.mem_reg.region.guest_phys_addr =3D > > > + reg->guest_phys_addr; > > > + msg->payload.mem_reg.region.mmap_offset =3D offset; > > > + > > > + if (vhost_user_write(dev, msg, &fd, 1) < 0) { > > > + return -1; > > > + } > > > + } > > > + } > > > +