From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 625DCC47DD9 for ; Wed, 28 Feb 2024 20:21:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D814E6B0072; Wed, 28 Feb 2024 15:21:21 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id D30086B0075; Wed, 28 Feb 2024 15:21:21 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BF8BF6B009B; Wed, 28 Feb 2024 15:21:21 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id A8CE26B0072 for ; Wed, 28 Feb 2024 15:21:21 -0500 (EST) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 546681A0B01 for ; Wed, 28 Feb 2024 20:21:21 +0000 (UTC) X-FDA: 81842332362.27.E2AEB5F Received: from outgoing.mit.edu (outgoing-auth-1.mit.edu [18.9.28.11]) by imf30.hostedemail.com (Postfix) with ESMTP id B554C8000E for ; Wed, 28 Feb 2024 20:21:18 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=mit.edu header.s=outgoing header.b=Hf7KAHht; spf=pass (imf30.hostedemail.com: domain of tytso@mit.edu designates 18.9.28.11 as permitted sender) smtp.mailfrom=tytso@mit.edu; dmarc=pass (policy=none) header.from=mit.edu ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709151679; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=cgjdDx6fKehdNQxWaNb/qHiXdMwxlGu5vAXf/Nz+i4I=; b=jX/dj3aDwqEswmp/GPZe9dsBgc/XOjsUdMNyZ9up9Xi97+Mqk5p6QSxsbusEaRs/wdcK1R UfR+/3mempXefCsEs18XNzDUbnbD42i0ib6fPOhAYE1i0IDIl5BOwYsdzLPO4n/VeAApzr KVDwlXTl/tCLS6TmfCCLDyaJ3BWl9A0= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709151679; a=rsa-sha256; cv=none; b=h2etikVcn77W++ppfiLJZZ4Ln5ptgbFGvlS7PmkF7AGl4X5tNp13mGhW/b4vEts0motCRt VuuDO7OksMrpIQVrH+T+L6Xg043soP/724p6oemB21Tfzx3i2bLHgzJqq6J+TaKucbPPaP j6816Pyklto7BhEVwreV2EWl3Yq6WZM= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=mit.edu header.s=outgoing header.b=Hf7KAHht; spf=pass (imf30.hostedemail.com: domain of tytso@mit.edu designates 18.9.28.11 as permitted sender) smtp.mailfrom=tytso@mit.edu; dmarc=pass (policy=none) header.from=mit.edu Received: from macsyma.thunk.org (c-73-8-226-230.hsd1.il.comcast.net [73.8.226.230]) (authenticated bits=0) (User authenticated as tytso@ATHENA.MIT.EDU) by outgoing.mit.edu (8.14.7/8.12.4) with ESMTP id 41SKL4as019146 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 28 Feb 2024 15:21:05 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mit.edu; s=outgoing; t=1709151667; bh=cgjdDx6fKehdNQxWaNb/qHiXdMwxlGu5vAXf/Nz+i4I=; h=Date:From:Subject:Message-ID:MIME-Version:Content-Type; b=Hf7KAHhtNtaSzNEGskWU2VWPTlInxZNfliMVR74RR8YPJq2f/dmuDFJV91ZGqx63P +b//LyXhRrUT0CedQSjeSr3s9JIPKMxWdoDBLznUvpNioKuAFSnmG3NCAgoDpTAVFw oqGK4E2DKjrarZ/YDlPaLamjhsO78xh6BCdYuajWb76vS4fA6wdDEuBcHLY0bOpyW9 3VN80LnMAU/CY7jrHqQJg/GMkiPX6FbwFmqBRtoKdTCGhw0sknD18toFSUMtY2JRmr vMQ6xsyBZObvb1riqSQagD+F9xkNkzOxhC+eHB8auAYnqd4bpfI/hRENrOkZamBWnE Kb81dDqS5vnWg== Received: by macsyma.thunk.org (Postfix, from userid 15806) id 015653404B0; Wed, 28 Feb 2024 14:21:03 -0600 (CST) Date: Wed, 28 Feb 2024 14:21:03 -0600 From: "Theodore Ts'o" To: Amir Goldstein Cc: "Luis R. Rodriguez" , lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm , Jan Kara Subject: Re: [Lsf-pc] [LSF/MM/BPF TOPIC] untorn buffered writes Message-ID: <20240228202103.GA177082@mit.edu> References: <20240228061257.GA106651@mit.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Queue-Id: B554C8000E X-Rspam-User: X-Stat-Signature: gibauxytagrtbnxh9zm5fhcq8ohr47gh X-Rspamd-Server: rspam03 X-HE-Tag: 1709151678-484159 X-HE-Meta: U2FsdGVkX1/bCNKTKeyficFe50q7Obafx2kFI2eyOM6/0qS/uXmTGXP27nWCbDR+Afqk1FNuk7M46YckTsLIIPDrTt48a9TMb/HBX199JM/1WfnGYcR4mQRFAHmnM/xHRdWyWmq9jYkPseynKBP/cIagpxaDn0p7L7eNBqrNVXJvvufIUrh/N+eO85bz1JA4IQEuEAGngnh2lBz6aEjPH2OhmK7zV9AvdnAgKBBPXhA+8pW5evQc7+J0SQI9tQliAF9cx2BCKY3zexFUSldLBLuK9WW5XfHdMkQDSxml5+TsHSUYZPkVXSIoEpGtQXi4YKwFyTTdWiaxUAhjOhgofhwT0cxo5oNv/BeAri8U6PPiz/dKfaD8eodHGZYlaOgG9tLd+TRljjg47RTsQ5/T3fgwBwkZsmWKVxrX/vnbLAw5A9ZEmcdiBlEX8zoIjWVh4792ws172yNmexE1sjWOKuLrU1u6KZzX6fq2IroMow3c5i+YCUipM/kSxYpWEfF+buUQT13mzzJMW2RncJEhXZZJLFf7ELab00QTbzO63Nq6FEYBF1Lp6omlo4O2gembqtfxuXHKPgNKbXzfsa10B2RZj+1mNWQYPavHfY1EVeU9iy5Mj4wCRpZOkNvcy2U6QkHOZ6ETu794d6N2YXfRd6JdJtQswej5YEihraUc8ZN6+ItRemdGYi9OcmX7yLiOxhux5vLDntCKg1M5AEHCxLWrcPV/s7/W4v1gx13KQ++lIrv+Yak0X2o5Sj3Y7g8XLVFOYp+vLGcDXhA4InG0WaUCqSXExS3XbByJSIdDmqDnUlvXg3kPkAEEY6aJpvZyXV1AyPS7Iz4= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Feb 28, 2024 at 01:38:44PM +0200, Amir Goldstein wrote: > > Seems a duplicate of this topic proposed by Luis? > > https://lore.kernel.org/linux-fsdevel/ZdfDxN26VOFaT_Tv@bombadil.infradead.org/ Maybe. I did see Luis's topic, but it seemed to me to be largely orthogonal to what I was interested in talking about. Maybe I'm missing something, but my observations were largely similar to Dave Chinner's comments here: https://lore.kernel.org/r/ZdvXAn1Q%2F+QX5sPQ@dread.disaster.area/ To wit, there are two cases here; either the desired untorn write granularity is smaller than the large block size, in which case there really nothing that needs to be done from an API perspective. Alternatively, if the desired untorn granularity is *larger* than the large block size, then the API considerations are the same with or without LBS support. >From the implementation perspective, yes, there is a certain amount of commonality, but that to me is relatively trivial --- or at least, it isn't a particular subtle design. That is, in the writeback code, it needs to know what the desired write granularity, whether it is required by the device because the logical sector size is larger than the page size, or because there is an untorn write granularity requested by the userspace process doing the writing (in practice, pretty much always 16k for databases). In terms of what the writeback code needs to do, it needs to make sure that gathers up pages respecting the alignment and required size, and if a page is locked, we have to wait until it is available, instead of skipping that page in the case of a non-data-integrity writeback. As far as tooling/testing is concerned, against, it appears to me that the requirements of LBA and the desire for untorn writes in units of granularity larger than the block size are quite orthogonal. For LBA, all you need is some kind of synthetic/debug device which has a logical block size larger than the page size. This could be done a number of ways: * via the VMM --- e.g., a QEMU block device that has a 64k logical sector size. * via loop device that exports a larger logical sector size * via blktrace (or its ebpf or ftrace) and making sure that size of every write request is the right multiple of 512 byte sectors For testing untorn writes, life is a bit tricker, because not all writes will be larger than the page size. For example, we might have an ext4 file system with a 4k blocksize, so metadata writes to the inode table, etc., will be in 4k writes. However, when writing to the database file, *those* writes need to be in multiples of 16k, with 16k alignment required, and if a write needs to be broken up it must be at a 16k boundary. The tooling for this, which is untorn write specific, and completely irrelevant for the LBS case, needs to know which parts of the storage device are assigned to the database file --- and which are not. If the database file is not getting deleted or truncated, it's relatively easy to take a blktrace (or ebpf or ftrace equivalent) and validate all of the I/O's, after the fact. The tooling to do this isn't terribly complicated, would involve using filefrag -v if the file system is already mounted, and a file system specific tool (i.e., debugfs for ext4, or xfs_db for xfs) if the file system is not mounted. Cheers, - Ted