From: "Theodore Ts'o" <tytso@mit.edu>
To: Chuck Lever <chuck.lever@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>,
Dave Chinner <david@fromorbit.com>,
Anna Schumaker <anna.schumaker@oracle.com>,
lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org,
Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Subject: Re: [LSF/MM/BPF TOPIC] Implementing the NFS v4.2 WRITE_SAME operation: VFS or NFS ioctl() ?
Date: Thu, 16 Jan 2025 10:36:49 -0500 [thread overview]
Message-ID: <20250116153649.GC2446278@mit.edu> (raw)
In-Reply-To: <21c7789f-2d59-42ce-8fcc-fd4c08bcb06f@oracle.com>
On Thu, Jan 16, 2025 at 08:59:19AM -0500, Chuck Lever wrote:
>
> See my previous reply in this thread: WRITE_SAME has a long-standing
> existing use case in the database world. The NFSv4.2 WRITE_SAME
> operation was designed around this use case.
>
> You remember database workloads, right? ;-)
My understanding is that the database use case maps onto BLKZEROOUT
--- specifically, databases want to be able to extend a tablespace
file, and what they want to be able to do is to allocate a contiguous
range using fallocate(2), but then want to make sure that the blocks
in the block are marked as initialized so that future writes to the
file do not require metadata updates when fsync(2) is called.
Enterprise databases like Oracle and db2 have been doing this for
decades; and just in the past two months recently I've had
representatives from certain open source databases ask for something
like the FALLOC_FL_WRITE_ZEROES.
So yes, I'm very much aware of database workloads --- but all they
need is to write zeros to mark a file range that was freshly allocated
using fallocate to be initialized. They do not need the more
expansive features which as defined by the SCSI or NFSv4.2. All of
the use cases done by enterprise Oracle, db2, and various open source
databases which have approached me are typically allocating a chunk
of aligned space (say, 32MiB) and then they want to initalize this
range of blocks.
This then doesn't require poison sentinals, since it's strictly
speaking an optimization. The extent tree doesn't get marked as
initalized until the zero-write has been commited to the block device
via a CACHE FLUSH. If we crash before this happens, reads from the
file will get zeros, and writes to the blocks that didn't get
initialized will still work, but the fsync(2) might trigger a
filesystem-level journal commit. This isn't a disaster....
Now, there might be some database that needs something more
complicated, but I'm not aware of them. If you know of any, is that
something that you are able to share?
Cheers,
- Ted
next prev parent reply other threads:[~2025-01-16 15:37 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-14 21:38 [LSF/MM/BPF TOPIC] Implementing the NFS v4.2 WRITE_SAME operation: VFS or NFS ioctl() ? Anna Schumaker
2025-01-14 23:14 ` Dave Chinner
2025-01-16 5:42 ` Christoph Hellwig
2025-01-16 13:37 ` Theodore Ts'o
2025-01-16 13:59 ` Chuck Lever
2025-01-16 15:36 ` Theodore Ts'o [this message]
2025-01-16 15:45 ` Chuck Lever
2025-01-16 17:30 ` Theodore Ts'o
2025-01-16 22:11 ` [Lsf-pc] " Martin K. Petersen
2025-01-16 21:54 ` Martin K. Petersen
2025-01-15 2:10 ` Darrick J. Wong
2025-01-15 14:24 ` Jeff Layton
2025-01-15 15:06 ` Matthew Wilcox
2025-01-15 15:31 ` Chuck Lever
2025-01-15 16:19 ` Matthew Wilcox
2025-01-15 18:20 ` Darrick J. Wong
2025-01-15 18:43 ` Chuck Lever
2025-01-16 5:40 ` Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250116153649.GC2446278@mit.edu \
--to=tytso@mit.edu \
--cc=anna.schumaker@oracle.com \
--cc=chuck.lever@oracle.com \
--cc=david@fromorbit.com \
--cc=hch@infradead.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-nfs@vger.kernel.org \
--cc=lsf-pc@lists.linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox