From: John Garry <john.g.garry@oracle.com>
To: Theodore Ts'o <tytso@mit.edu>, lsf-pc@lists.linux-foundation.org
Cc: linux-fsdevel@vger.kernel.org, linux-mm <linux-mm@kvack.org>
Subject: Re: [LSF/MM/BPF TOPIC] untorn buffered writes
Date: Mon, 11 Mar 2024 08:42:25 +0000 [thread overview]
Message-ID: <238b05a9-6d25-4721-93f7-d15b6c0d2620@oracle.com> (raw)
In-Reply-To: <20240228061257.GA106651@mit.edu>
On 28/02/2024 06:12, Theodore Ts'o wrote:
> However, this proposed interface is highly problematic when it comes
> to buffered writes, and Postgress database uses buffered, not direct
> I/O writes. Suppose the database performs a 16k write, followed by a
> 64k write, followed by a 128k write --- and these writes are done
> using a file descriptor that does not have O_DIRECT enable, and let's
> suppose they are written using the proposed RWF_ATOMIC flag. In
> order to provide the (stronger than we need) RWF_ATOMIC guarantee, the
> kernel would need to store the fact that certain pages in the page
> cache were dirtied as part of a 16k RWF_ATOMIC write, and other pages
> were dirtied as part of a 32k RWF_ATOMIC write, etc, so that the
> writeback code knows what the "atomic" guarantee that was made at
> write time. This very quickly becomes a mess.
Having done some research, postgres has a fixed "page" size per file and
this is typically 8KB. This is configured at compile time. Page size may
be different between certain file types, but it is possible to have all
file types be configured for the same page size. This all seems like
standard DB stuff.
So, as I mentioned in response to Matthew here:
https://lore.kernel.org/linux-scsi/47d264c2-bc97-4313-bce0-737557312106@oracle.com/
.. for untorn buffered writes support, we could just set
atomic_write_unit_min = atomic_write_unit_max = FS file alignment
granule = DB page size. That would seem easier to support in the page
cache and still provide the RWF_ATOMIC guarantee. For ext4, bigalloc
cluster size could be this FS file alignment granule. For XFS, it would
be the extsize with forcealign.
It might be argued that we would like to submit larger untorn write IOs
from userspace for performance benefit and allow the kernel to split on
some page boundary, but I doubt that this will be utilised by userspace.
On the other hand, the block atomic writes kernel series does support
block layer merging (of atomic writes).
About advertising untorn buffered write capability, current statx fields
update for atomic writes is here:
https://lore.kernel.org/linux-api/20240124112731.28579-2-john.g.garry@oracle.com/
Only direct IO support is mentioned there. For supporting buffered IO, I
suppose an additional flag can be added for getting buffered IO info,
like STATX_ATTR_WRITE_ATOMIC_BUFFERED, and reuse atomic_write_unit_{min,
max, segments_max} fields for buffered IO. Setting the direct IO and
buffered IO flags would be mutually exclusive.
Is there any anticipated problem with this idea?
On another topic, there is some development to allow postgres to use
direct IO, see:
https://wiki.postgresql.org/wiki/AIO
Assuming all info there is accurate and up to date, it does still seem
to be lagging kernel untorn write support.
John
next prev parent reply other threads:[~2024-03-11 8:42 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-28 6:12 [LSF/MM/BPF TOPIC] untorn buffered writes Theodore Ts'o
2024-02-28 11:38 ` [Lsf-pc] " Amir Goldstein
2024-02-28 20:21 ` Theodore Ts'o
2024-02-28 14:11 ` Matthew Wilcox
2024-02-28 23:33 ` Theodore Ts'o
2024-02-29 1:07 ` Dave Chinner
2024-02-28 16:06 ` John Garry
2024-02-28 23:24 ` Theodore Ts'o
2024-02-29 16:28 ` John Garry
2024-02-29 21:21 ` Ritesh Harjani
2024-02-29 0:52 ` Dave Chinner
2024-03-11 8:42 ` John Garry [this message]
2024-05-15 19:54 ` John Garry
2024-05-22 21:56 ` Luis Chamberlain
2024-05-23 11:59 ` John Garry
2024-06-01 9:33 ` Theodore Ts'o
2024-06-11 15:23 ` John Garry
2024-05-23 12:59 ` Christoph Hellwig
2024-05-28 9:21 ` John Garry
2024-05-28 10:57 ` Christoph Hellwig
2024-05-28 11:09 ` John Garry
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=238b05a9-6d25-4721-93f7-d15b6c0d2620@oracle.com \
--to=john.g.garry@oracle.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).