From: "Theodore Ts'o" <tytso@mit.edu>
To: Amir Goldstein <amir73il@gmail.com>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>,
lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org,
linux-mm <linux-mm@kvack.org>, Jan Kara <jack@suse.cz>
Subject: Re: [Lsf-pc] [LSF/MM/BPF TOPIC] untorn buffered writes
Date: Wed, 28 Feb 2024 14:21:03 -0600 [thread overview]
Message-ID: <20240228202103.GA177082@mit.edu> (raw)
In-Reply-To: <CAOQ4uxhZ5KOTdi01C87wYwvB_K=HDYdLy7LHzXnC-C3U_OFEnQ@mail.gmail.com>
On Wed, Feb 28, 2024 at 01:38:44PM +0200, Amir Goldstein wrote:
>
> Seems a duplicate of this topic proposed by Luis?
>
> https://lore.kernel.org/linux-fsdevel/ZdfDxN26VOFaT_Tv@bombadil.infradead.org/
Maybe. I did see Luis's topic, but it seemed to me to be largely
orthogonal to what I was interested in talking about. Maybe I'm
missing something, but my observations were largely similar to Dave
Chinner's comments here:
https://lore.kernel.org/r/ZdvXAn1Q%2F+QX5sPQ@dread.disaster.area/
To wit, there are two cases here; either the desired untorn write
granularity is smaller than the large block size, in which case there
really nothing that needs to be done from an API perspective.
Alternatively, if the desired untorn granularity is *larger* than the
large block size, then the API considerations are the same with or
without LBS support.
From the implementation perspective, yes, there is a certain amount of
commonality, but that to me is relatively trivial --- or at least, it
isn't a particular subtle design. That is, in the writeback code, it
needs to know what the desired write granularity, whether it is
required by the device because the logical sector size is larger than
the page size, or because there is an untorn write granularity
requested by the userspace process doing the writing (in practice,
pretty much always 16k for databases). In terms of what the writeback
code needs to do, it needs to make sure that gathers up pages
respecting the alignment and required size, and if a page is locked,
we have to wait until it is available, instead of skipping that page
in the case of a non-data-integrity writeback.
As far as tooling/testing is concerned, against, it appears to me that
the requirements of LBA and the desire for untorn writes in units of
granularity larger than the block size are quite orthogonal. For LBA,
all you need is some kind of synthetic/debug device which has a
logical block size larger than the page size. This could be done a
number of ways:
* via the VMM --- e.g., a QEMU block device that has a 64k logical
sector size.
* via loop device that exports a larger logical sector size
* via blktrace (or its ebpf or ftrace) and making sure that size of every
write request is the right multiple of 512 byte sectors
For testing untorn writes, life is a bit tricker, because not all
writes will be larger than the page size. For example, we might have
an ext4 file system with a 4k blocksize, so metadata writes to the
inode table, etc., will be in 4k writes. However, when writing to the
database file, *those* writes need to be in multiples of 16k, with 16k
alignment required, and if a write needs to be broken up it must be at
a 16k boundary.
The tooling for this, which is untorn write specific, and completely
irrelevant for the LBS case, needs to know which parts of the storage
device are assigned to the database file --- and which are not. If
the database file is not getting deleted or truncated, it's relatively
easy to take a blktrace (or ebpf or ftrace equivalent) and validate
all of the I/O's, after the fact. The tooling to do this isn't
terribly complicated, would involve using filefrag -v if the file
system is already mounted, and a file system specific tool (i.e.,
debugfs for ext4, or xfs_db for xfs) if the file system is not mounted.
Cheers,
- Ted
next prev parent reply other threads:[~2024-02-28 20:21 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-28 6:12 [LSF/MM/BPF TOPIC] untorn buffered writes Theodore Ts'o
2024-02-28 11:38 ` [Lsf-pc] " Amir Goldstein
2024-02-28 20:21 ` Theodore Ts'o [this message]
2024-02-28 14:11 ` Matthew Wilcox
2024-02-28 23:33 ` Theodore Ts'o
2024-02-29 1:07 ` Dave Chinner
2024-02-28 16:06 ` John Garry
2024-02-28 23:24 ` Theodore Ts'o
2024-02-29 16:28 ` John Garry
2024-02-29 21:21 ` Ritesh Harjani
2024-02-29 0:52 ` Dave Chinner
2024-03-11 8:42 ` John Garry
2024-05-15 19:54 ` John Garry
2024-05-22 21:56 ` Luis Chamberlain
2024-05-23 11:59 ` John Garry
2024-06-01 9:33 ` Theodore Ts'o
2024-06-11 15:23 ` John Garry
2024-05-23 12:59 ` Christoph Hellwig
2024-05-28 9:21 ` John Garry
2024-05-28 10:57 ` Christoph Hellwig
2024-05-28 11:09 ` John Garry
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240228202103.GA177082@mit.edu \
--to=tytso@mit.edu \
--cc=amir73il@gmail.com \
--cc=jack@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=mcgrof@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).