linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: John Garry <john.g.garry@oracle.com>
To: Theodore Ts'o <tytso@mit.edu>, lsf-pc@lists.linux-foundation.org
Cc: linux-fsdevel@vger.kernel.org, linux-mm <linux-mm@kvack.org>
Subject: Re: [LSF/MM/BPF TOPIC] untorn buffered writes
Date: Mon, 11 Mar 2024 08:42:25 +0000	[thread overview]
Message-ID: <238b05a9-6d25-4721-93f7-d15b6c0d2620@oracle.com> (raw)
In-Reply-To: <20240228061257.GA106651@mit.edu>

On 28/02/2024 06:12, Theodore Ts'o wrote:
> However, this proposed interface is highly problematic when it comes
> to buffered writes, and Postgress database uses buffered, not direct
> I/O writes.   Suppose the database performs a 16k write, followed by a
> 64k write, followed by a 128k write --- and these writes are done
> using a file descriptor that does not have O_DIRECT enable, and let's
> suppose they are written using the proposed RWF_ATOMIC flag.   In
> order to provide the (stronger than we need) RWF_ATOMIC guarantee, the
> kernel would need to store the fact that certain pages in the page
> cache were dirtied as part of a 16k RWF_ATOMIC write, and other pages
> were dirtied as part of a 32k RWF_ATOMIC write, etc, so that the
> writeback code knows what the "atomic" guarantee that was made at
> write time.   This very quickly becomes a mess.

Having done some research, postgres has a fixed "page" size per file and 
this is typically 8KB. This is configured at compile time. Page size may 
be different between certain file types, but it is possible to have all 
file types be configured for the same page size. This all seems like 
standard DB stuff.

So, as I mentioned in response to Matthew here:
https://lore.kernel.org/linux-scsi/47d264c2-bc97-4313-bce0-737557312106@oracle.com/

.. for untorn buffered writes support, we could just set 
atomic_write_unit_min = atomic_write_unit_max = FS file alignment 
granule = DB page size. That would seem easier to support in the page 
cache and still provide the RWF_ATOMIC guarantee. For ext4, bigalloc 
cluster size could be this FS file alignment granule. For XFS, it would 
be the extsize with forcealign.

It might be argued that we would like to submit larger untorn write IOs 
from userspace for performance benefit and allow the kernel to split on 
some page boundary, but I doubt that this will be utilised by userspace. 
On the other hand, the block atomic writes kernel series does support 
block layer merging (of atomic writes).

About advertising untorn buffered write capability, current statx fields 
update for atomic writes is here:
https://lore.kernel.org/linux-api/20240124112731.28579-2-john.g.garry@oracle.com/

Only direct IO support is mentioned there. For supporting buffered IO, I 
suppose an additional flag can be added for getting buffered IO info, 
like STATX_ATTR_WRITE_ATOMIC_BUFFERED, and reuse atomic_write_unit_{min, 
max, segments_max} fields for buffered IO. Setting the direct IO and 
buffered IO flags would be mutually exclusive.

Is there any anticipated problem with this idea?

On another topic, there is some development to allow postgres to use 
direct IO, see:
https://wiki.postgresql.org/wiki/AIO

Assuming all info there is accurate and up to date, it does still seem 
to be lagging kernel untorn write support.

John

  parent reply	other threads:[~2024-03-11  8:42 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-28  6:12 [LSF/MM/BPF TOPIC] untorn buffered writes Theodore Ts'o
2024-02-28 11:38 ` [Lsf-pc] " Amir Goldstein
2024-02-28 20:21   ` Theodore Ts'o
2024-02-28 14:11 ` Matthew Wilcox
2024-02-28 23:33   ` Theodore Ts'o
2024-02-29  1:07     ` Dave Chinner
2024-02-28 16:06 ` John Garry
2024-02-28 23:24   ` Theodore Ts'o
2024-02-29 16:28     ` John Garry
2024-02-29 21:21       ` Ritesh Harjani
2024-02-29  0:52 ` Dave Chinner
2024-03-11  8:42 ` John Garry [this message]
2024-05-15 19:54 ` John Garry
2024-05-22 21:56   ` Luis Chamberlain
2024-05-23 11:59     ` John Garry
2024-06-01  9:33       ` Theodore Ts'o
2024-06-11 15:23         ` John Garry
2024-05-23 12:59   ` Christoph Hellwig
2024-05-28  9:21     ` John Garry
2024-05-28 10:57       ` Christoph Hellwig
2024-05-28 11:09         ` John Garry

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=238b05a9-6d25-4721-93f7-d15b6c0d2620@oracle.com \
    --to=john.g.garry@oracle.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).