public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed
From: "Theodore Ts'o" <tytso@mit.edu>
To: Chuck Lever <chuck.lever@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>,
	Dave Chinner <david@fromorbit.com>,
	Anna Schumaker <anna.schumaker@oracle.com>,
	lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Subject: Re: [LSF/MM/BPF TOPIC] Implementing the NFS v4.2 WRITE_SAME operation: VFS or NFS ioctl() ?
Date: Thu, 16 Jan 2025 12:30:00 -0500	[thread overview]
Message-ID: <20250116173000.GA2479310@mit.edu> (raw)
In-Reply-To: <5fdc7575-aa3d-4b37-9848-77ecf8f0b7d6@oracle.com>

On Thu, Jan 16, 2025 at 10:45:01AM -0500, Chuck Lever wrote:
> 
> Any database that uses a block size that is larger than the block
> size of the underlying storage media is at risk of a torn write.
> The purpose of WRITE_SAME is to demark the database blocks with
> sentinels on each end of the database block containing a time
> stamp or hash.

There are alternate solutions which various databases to address the
torn write problem:

   * DIF/DIX (although this is super expensive, so this has fallen out
        of favor)
   * In-line checksums in the database block; this approach is fairly
        common for enterprise databases (interestingly, Google's cluster
	file systems, which don't need to support mmap, do this as well)
   * Double-buffered writes using a journal (this is what open source
         databases tend to use)
   * For software-defined cloud block devices (such as Google's
       Persistent Disk, Amazon EBS, etc.) and some NVMe devices,
       aligned writes can be guaranteed up to some write granularity
       (typically up to 32k to 64k, although pretty much all database
       pages today are 16k).  This is actively fielded as
       customer-available products and/or in development in at least
       two first-party cloud database products based on MySQL and/or
       Postgres; and there are some active patches which John Garry
       has been working on so that users can use this technique
       without having to rely on first party cloud product teams
       knowing implementation details of their cloud block devices.
       (This has been discussed in past LSF/MM sessions.)

> If, when read back, the sentinels match, the whole database
> block is good to go. If they do not, then the block is torn
> and recovery is necessary.

Are there some database teams that are actively working on a scheme
based on WRITE SAME?  I have talked to open source developers on the
MySQL and Postgres teams, as well as the first party cloud product
teams at my company and some storage architects at competitor cloud
companies, and no one has mentioned any efforts involving WRITE SAME.
Of course, maybe I simply haven't come across such plans, especially
if they are under some deep, dark NDA.  :-)

However, given that support for WRITE SAME is fairly rare (like
DIF/DIX it's only available if you are willing to pay $$$$ for your
storage, because it's a specialized feature that storage vendors like
to change a lot for), I'm bit surprised that there are database groups
that would be intersted in relying on such a feature, since it tends
not be commonly available.

If there are real-world potential users, go wild, but at least for the
use cases and databases that I'm aware of, the FALLOC_FL_WRITE_ZEROS
and atomic writes patch series (it's really untorn writes but we seem
to have lost that naming battle) is all that we need.

Cheers,

						- Ted

  reply	other threads:[~2025-01-16 17:30 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-14 21:38 [LSF/MM/BPF TOPIC] Implementing the NFS v4.2 WRITE_SAME operation: VFS or NFS ioctl() ? Anna Schumaker
2025-01-14 23:14 ` Dave Chinner
2025-01-16  5:42   ` Christoph Hellwig
2025-01-16 13:37     ` Theodore Ts'o
2025-01-16 13:59       ` Chuck Lever
2025-01-16 15:36         ` Theodore Ts'o
2025-01-16 15:45           ` Chuck Lever
2025-01-16 17:30             ` Theodore Ts'o [this message]
2025-01-16 22:11               ` [Lsf-pc] " Martin K. Petersen
2025-01-16 21:54             ` Martin K. Petersen
2025-01-15  2:10 ` Darrick J. Wong
2025-01-15 14:24 ` Jeff Layton
2025-01-15 15:06 ` Matthew Wilcox
2025-01-15 15:31   ` Chuck Lever
2025-01-15 16:19     ` Matthew Wilcox
2025-01-15 18:20       ` Darrick J. Wong
2025-01-15 18:43       ` Chuck Lever
2025-01-16  5:40 ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250116173000.GA2479310@mit.edu \
    --to=tytso@mit.edu \
    --cc=anna.schumaker@oracle.com \
    --cc=chuck.lever@oracle.com \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox