Re: [LSF/MM/BPF TOPIC] untorn buffered writes

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: John Garry <john.g.garry@oracle.com>
To: Theodore Ts'o <tytso@mit.edu>, lsf-pc@lists.linux-foundation.org
Cc: linux-fsdevel@vger.kernel.org, linux-mm <linux-mm@kvack.org>
Subject: Re: [LSF/MM/BPF TOPIC] untorn buffered writes
Date: Wed, 28 Feb 2024 16:06:43 +0000	[thread overview]
Message-ID: <b184a072-86ef-462b-a6da-c2537299aa59@oracle.com> (raw)
In-Reply-To: <20240228061257.GA106651@mit.edu>

On 28/02/2024 06:12, Theodore Ts'o wrote:
> Last year, I talked about an interest to provide database such as
> MySQL with the ability to issue writes that would not be torn as they
> write 16k database pages[1].
> 
> [1] https://urldefense.com/v3/__https://lwn.net/Articles/932900/__;!!ACWV5N9M2RV99hQ!Ij_ZeSZrJ4uPL94Im73udLMjqpkcZwHmuNnznogL68ehu6TDTXqbMsC4xLUqh18hq2Ib77p1D8_4mV5Q$
> 
> There is a patch set being worked on by John Garry which provides
> stronger guarantees than what is actually required for this use case,
> called "atomic writes".  The proposed interface for this facility
> involves passing a new flag to pwritev2(2), RWF_ATOMIC, which requests
> that the specific write be written to the storage device in an
> all-or-nothing fashion, and if it can not be guaranteed, that the
> write should fail.  In this interface, if the userspace sends an 128k
> write with the RWF_ATOMIC flag, if the storage device will support
> that an all-or-nothing write with the given size and alignment the
> kernel will guarantee that it will be sent as a single 128k request
> --- although from the database perspective, if it is using 16k
> database pages, it only needs to guarantee that if the write is torn,
> it only happen on a 16k boundary.  That is, if the write is split into
> 32k and 96k request, that would be totally fine as far as the database
> is concerned --- and so the RWF_ATOMIC interface is a stronger
> guarantee than what might be needed.

Note that the initial RFC for my series did propose an interface that 
does allow a write to be split in the kernel on a boundary, and that 
boundary was evaluated on a per-write basis by the length and alignment 
of the write along with any extent alignment granularity.

We decided not to pursue that, and instead require a write per 16K page, 
for the example above.

> 
> So far, the "atomic write" patchset has only focused on Direct I/O,
> where this stronger guarantee is mostly harmless, even if it is
> unneeded for the original motivating use case.  Which might be OK,
> since perhaps there might be other future use cases where they might
> want some 32k writes to be "atomic", while other 128k writes might
> want to be "atomic" (that is to say, persisted with all-or-nothing
> semantics), and the proposed RWF_ATOMIC interface might permit that
> --- even though no one can seem top come up with a credible use case
> that would require this.
> 
> 
> However, this proposed interface is highly problematic when it comes
> to buffered writes, and Postgress database uses buffered, not direct
> I/O writes.   Suppose the database performs a 16k write, followed by a
> 64k write, followed by a 128k write --- and these writes are done
> using a file descriptor that does not have O_DIRECT enable, and let's
> suppose they are written using the proposed RWF_ATOMIC flag.   In
> order to provide the (stronger than we need) RWF_ATOMIC guarantee, the
> kernel would need to store the fact that certain pages in the page
> cache were dirtied as part of a 16k RWF_ATOMIC write, and other pages
> were dirtied as part of a 32k RWF_ATOMIC write, etc, so that the
> writeback code knows what the "atomic" guarantee that was made at
> write time.   This very quickly becomes a mess. >
> Another interface that one be much simpler to implement for buffered
> writes would be one the untorn write granularity is set on a per-file
> descriptor basis, using fcntl(2).  We validate whether the untorn
> write granularity is one that can be supported when fcntl(2) is
> called, and we also store in the inode the largest untorn write
> granularity that has been requested by a file descriptor for that
> inode.  (When the last file descriptor opened for writing has been
> closed, the largest untorn write granularity for that inode can be set
> back down to zero.)

If you check the latest discussion on XFS support we are proposing 
something along those lines:
https://lore.kernel.org/linux-fsdevel/Zc1GwE%2F7QJisKZCX@dread.disaster.area/

There FS_IOC_FSSETXATTR would be used to set extent size w/ 
fsx.fsx_extsize and new flag FS_XGLAG_FORCEALIGN to guarantee extent 
alignment, and this alignment would be the largest untorn write granularity.

Note that I already got push back on using fcntl for this.

So whether you are more interested in ext4 and how ext4 can adopt that 
API is another matter.. but I did consider adding something like struct 
inode.i_blkbits for this untorn write granularity, so an FS would just 
need to set that. But I am not proposing that ATM.

> 
> The write(2) system call will check whether the size and alignment of
> the write are valid given the requested untorn write granularity.  And
> in the writeback path, the writeback will detect if there are
> contiguous (aligned) dirty pages, and make sure they are sent to the
> storage device in multiples of the largest requested untorn write
> granularity.  This provides only the guarantees required by databases,
> and obviates the need to track which pages were dirtied by an
> RWF_ATOMIC flag, and the size of the RWF_ATOMIC write.
> 
> I'd like to discuss at LSF/MM what the best interface would be for
> buffered, untorn writes (I am deliberately avoiding the use of the
> word "atomic" since that presumes stronger guarantees than what we
> need, and because it has led to confusion in previous discussions),
> and what might be needed to support it.
> 

Thanks,
John

next prev parent reply	other threads:[~2024-02-28 16:07 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-28  6:12 [LSF/MM/BPF TOPIC] untorn buffered writes Theodore Ts'o
2024-02-28 11:38 ` [Lsf-pc] " Amir Goldstein
2024-02-28 20:21   ` Theodore Ts'o
2024-02-28 14:11 ` Matthew Wilcox
2024-02-28 23:33   ` Theodore Ts'o
2024-02-29  1:07     ` Dave Chinner
2024-02-28 16:06 ` John Garry [this message]
2024-02-28 23:24   ` Theodore Ts'o
2024-02-29 16:28     ` John Garry
2024-02-29 21:21       ` Ritesh Harjani
2024-02-29  0:52 ` Dave Chinner
2024-03-11  8:42 ` John Garry
2024-05-15 19:54 ` John Garry
2024-05-22 21:56   ` Luis Chamberlain
2024-05-23 11:59     ` John Garry
2024-06-01  9:33       ` Theodore Ts'o
2024-06-11 15:23         ` John Garry
2024-05-23 12:59   ` Christoph Hellwig
2024-05-28  9:21     ` John Garry
2024-05-28 10:57       ` Christoph Hellwig
2024-05-28 11:09         ` John Garry

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b184a072-86ef-462b-a6da-c2537299aa59@oracle.com \
    --to=john.g.garry@oracle.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).