From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id A3FFFC636CC for ; Tue, 31 Jan 2023 15:14:31 +0000 (UTC) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1pMrLG-0007nA-Em; Tue, 31 Jan 2023 09:10:34 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1pMrLE-0007mA-DQ for qemu-devel@nongnu.org; Tue, 31 Jan 2023 09:10:32 -0500 Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1pMrLC-0005O5-E2 for qemu-devel@nongnu.org; Tue, 31 Jan 2023 09:10:32 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1675174229; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=4iqyiozujzDzENFofgCw2Tx0N+bVJhe8WFZZXqWJgOk=; b=jOg/HdtubALcULDK3Whb4ugdpGa/dwOMBaOdfMFWaox/vlLeZDMSOJQADvLGSjOjJkAVWu PtN8nXPk/4EIEY5tTImIdXELLFuUavk48YIQzirjMXKaP370q4O2v2Hn+m5vIzaGr/d0FB tF+7oB6455tSW2BqOoTGmQ2gYBGdNM0= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-382-C8R7nQ3mMRG6FP47TFD30Q-1; Tue, 31 Jan 2023 09:10:22 -0500 X-MC-Unique: C8R7nQ3mMRG6FP47TFD30Q-1 Received: from smtp.corp.redhat.com (int-mx09.intmail.prod.int.rdu2.redhat.com [10.11.54.9]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 9781A3C16E81; Tue, 31 Jan 2023 14:10:21 +0000 (UTC) Received: from localhost (unknown [10.39.195.59]) by smtp.corp.redhat.com (Postfix) with ESMTP id ADE2B492B05; Tue, 31 Jan 2023 14:10:20 +0000 (UTC) Date: Tue, 31 Jan 2023 09:10:18 -0500 From: Stefan Hajnoczi To: Daniel =?iso-8859-1?Q?P=2E_Berrang=E9?= Cc: Stefan Hajnoczi , Sam Li , qemu-devel@nongnu.org, dmitry.fomichev@wdc.com, Raphael Norwitz , "Michael S. Tsirkin" , Kevin Wolf , damien.lemoal@opensource.wdc.com, hare@suse.de, Markus Armbruster , qemu-block@nongnu.org, Eric Blake , Hanna Reitz Subject: Re: [RFC v6 2/4] virtio-blk: add zoned storage emulation for zoned devices Message-ID: References: <20230129103951.86063-1-faithilikerun@gmail.com> <20230129103951.86063-3-faithilikerun@gmail.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="F2yDXFpxF759takx" Content-Disposition: inline In-Reply-To: X-Scanned-By: MIMEDefang 3.1 on 10.11.54.9 Received-SPF: pass client-ip=170.10.133.124; envelope-from=stefanha@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org --F2yDXFpxF759takx Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Mon, Jan 30, 2023 at 06:30:16PM +0000, Daniel P. Berrang=E9 wrote: > On Mon, Jan 30, 2023 at 10:17:48AM -0500, Stefan Hajnoczi wrote: > > On Mon, 30 Jan 2023 at 07:33, Daniel P. Berrang=E9 wrote: > > > > > > On Sun, Jan 29, 2023 at 06:39:49PM +0800, Sam Li wrote: > > > > This patch extends virtio-blk emulation to handle zoned device comm= ands > > > > by calling the new block layer APIs to perform zoned device I/O on > > > > behalf of the guest. It supports Report Zone, four zone oparations = (open, > > > > close, finish, reset), and Append Zone. > > > > > > > > The VIRTIO_BLK_F_ZONED feature bit will only be set if the host does > > > > support zoned block devices. Regular block devices(conventional zon= es) > > > > will not be set. > > > > > > > > The guest os can use blktests, fio to test those commands on zoned = devices. > > > > Furthermore, using zonefs to test zone append write is also support= ed. > > > > > > > > Signed-off-by: Sam Li > > > > --- > > > > hw/block/virtio-blk-common.c | 2 + > > > > hw/block/virtio-blk.c | 394 +++++++++++++++++++++++++++++++= ++++ > > > > 2 files changed, 396 insertions(+) > > > > > > > > > > > @@ -949,6 +1311,30 @@ static void virtio_blk_update_config(VirtIODe= vice *vdev, uint8_t *config) > > > > blkcfg.write_zeroes_may_unmap =3D 1; > > > > virtio_stl_p(vdev, &blkcfg.max_write_zeroes_seg, 1); > > > > } > > > > + if (bs->bl.zoned !=3D BLK_Z_NONE) { > > > > + switch (bs->bl.zoned) { > > > > + case BLK_Z_HM: > > > > + blkcfg.zoned.model =3D VIRTIO_BLK_Z_HM; > > > > + break; > > > > + case BLK_Z_HA: > > > > + blkcfg.zoned.model =3D VIRTIO_BLK_Z_HA; > > > > + break; > > > > + default: > > > > + g_assert_not_reached(); > > > > + } > > > > + > > > > + virtio_stl_p(vdev, &blkcfg.zoned.zone_sectors, > > > > + bs->bl.zone_size / 512); > > > > + virtio_stl_p(vdev, &blkcfg.zoned.max_active_zones, > > > > + bs->bl.max_active_zones); > > > > + virtio_stl_p(vdev, &blkcfg.zoned.max_open_zones, > > > > + bs->bl.max_open_zones); > > > > + virtio_stl_p(vdev, &blkcfg.zoned.write_granularity, blk_si= ze); > > > > + virtio_stl_p(vdev, &blkcfg.zoned.max_append_sectors, > > > > + bs->bl.max_append_sectors); > > > > > > So these are all ABI sensitive frontend device settings, but they are > > > not exposed as tunables on the virtio-blk device, instead they are > > > implicitly set from the backend. > > > > > > We have done this kind of thing before in QEMU, but several times it > > > has bitten QEMU maintainers/users, as having a backend affect the > > > frontend ABI is not to typical. It wouldn't be immediately obvious > > > when starting QEMU on a target host that the live migration would > > > be breaking ABI if the target host wasn't using a zoned device with > > > exact same settings. > > > > > > This also limits mgmt flexibility across live migration, if the > > > mgmt app wants/needs to change the storage backend. eg maybe they > > > need to evacuate the host for an emergency, but don't have spare > > > hosts with same kind of storage. It might be desirable to migrate > > > and switch to a plain block device or raw/qcow2 file, rather than > > > let the VM die. > > > > > > Can we make these virtio setting be explicitly controlled on the > > > virtio-blk device. If not specified explicitly they could be > > > auto-populated from the backend for ease of use, but if specified > > > then simply validate the backend is a match. libvirt would then > > > make sure these are always explicitly set on the frontend. > >=20 > > I think this is a good idea, especially if we streamline the > > file-posix.c driver by merging --blockdev zoned_host_device into > > --blockdev host_device. It won't be obvious from the command-line > > whether this is a zoned or non-zoned device. There should be a > > --device virtio-blk-pci,drive=3Ddrive0,zoned=3Don option that fails when > > drive0 isn't zoned. It should probably be on/off/auto where auto is > > the default and doesn't check anything, on requires a zoned device, > > and off requires a non-zoned device. That will prevent accidental > > migration between zoned/non-zoned devices. > >=20 > > I want to point out that virtio-blk doesn't have checks for the disk > > size or other details, so what you're suggesting for zone_sectors, etc > > is stricter than what QEMU does today. Since the virtio-blk parameters > > you're proposing are optional, I think it doesn't hurt though. >=20 > Yeah, it is slightly different than some of the parameters handling. > I guess you could say that with disk capacity, matching size is a > fairly obvious constraint/expectation to manage, and also long standing.= =20 >=20 > With disk capacity, you can add the 'raw' driver on top of any block > driver stack, to apply an arbitrary offset+size, to make the storage > smaller than it otherwise is on disk. Conceptually than could have > been done on the frontend device(s) too, but I guess it made more > sense to do it in the block layer to give consistent enforcement > of the limits across frontends. It is fuzzy whether such a use of > the 'raw' driver is really considered backend config, as opposed to > frontend config but to me it feels likle frontend config. >=20 > You could possibly come up with the concept of a 'zoned' format that > can be layered on top of a block driver stack to add zoned I/O constraints > for sake of compatibility, where none otherwise exists in the physical > storage. Possibly useful if multiple frontends all support zoned storage, > to avoid duplicating the constraints across all ? Maybe: DEFINE_BLOCK_ZONED_PROPERTIES(VirtIOBlock, conf.conf), and then: bool blkconf_check_zoned_properties(BlockBackend *blk, BlockZonedConf *co= nf, Error **errp); That macro and helper function can be shared by all emulated storage controllers that implement zoned storage. However, there's one problem: some storage interfaces extend the zoned storage model (e.g. NVMe ZNS seems to have functionality that's not available elsewhere). It would be necessary to check whether there is a common subset of parameters with matching property names (because terminology could be different) across emulated storage controllers. But I think it's likely that this will work. I think the macro and helper function approach is nice because it's internal to QEMU and users don't need to set up a --blockdev enforce-zoned. Stefan --F2yDXFpxF759takx Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQEzBAEBCAAdFiEEhpWov9P5fNqsNXdanKSrs4Grc8gFAmPZIUoACgkQnKSrs4Gr c8ihCwf9HdR+1fBmXDkkxARcSPN35/t+ZW2H1cnfhGaG73v6VcSqbXMLL1hOkw7K KQ6OGodcGuR4Fvm+VjlP2B0C49Te7dtw0Q15XhqT00vq05AVSEQcuKryqqkflwth 2rQm5QoglHs0Oe90a0p/NwwfPjtK9GR44Hjdc2BgIERq5RCmm0fooaocUe+uMvwZ 9XstRmbn2J6cJ8NXIePD17mq2hVyhGBx/jIULChOZVwPJyfakEA68pA8llIBLYwV 3aq7xKn3q9PK65N5tEznXzMWbczbGjYbtgvsOyGdvq2GH44bdlz5rEgZORz2PYU5 TWeSZeX5cCf6/da/t9CLpSUv6T9MPg== =QTkv -----END PGP SIGNATURE----- --F2yDXFpxF759takx--