* Re: [LSF/MM/BPF TOPIC] Implementing the NFS v4.2 WRITE_SAME operation: VFS or NFS ioctl() ?
2025-01-14 21:38 [LSF/MM/BPF TOPIC] Implementing the NFS v4.2 WRITE_SAME operation: VFS or NFS ioctl() ? Anna Schumaker
@ 2025-01-14 23:14 ` Dave Chinner
2025-01-16 5:42 ` Christoph Hellwig
2025-01-15 2:10 ` Darrick J. Wong
` (3 subsequent siblings)
4 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2025-01-14 23:14 UTC (permalink / raw)
To: Anna Schumaker; +Cc: lsf-pc, linux-fsdevel, Linux NFS Mailing List
[Please word wrap email text at 68-72 columns]
Anna, I think we need to consider how to integrate this
functionality across then entire storage stack, not just for NFS
client/server optimisation. My comments are made with this in mind.
On Tue, Jan 14, 2025 at 04:38:03PM -0500, Anna Schumaker wrote:
> I've seen a few requests for implementing the NFS v4.2 WRITE_SAME
> [1] operation over the last few months [2][3] to accelerate
> writing patterns of data on the server, so it's been in the back
> of my mind for a future project. I'll need to write some code
> somewhere so NFS & NFSD can handle this request. I could keep any
> implementation internal to NFS / NFSD, but I'd like to find out if
> local filesystems would find this sort of feature useful and if I
> should put it in the VFS instead.
How closely does this match to the block device WRITE_SAME
(SCSI/NVMe) commands? I note there is a reference to this in the
RFC, but there are no details given.
i.e. is this NFS request something we can pass straight through to
the server side storage hardware if it supports hardware WRITE_SAME
commands, or do they have incompatible semantics?
If the two are compatible, then I think we really want server side
hardware offload to be possible. That requires the filesystem to
allocate/map the physical storage and then call into the block layer
to either offload it to the hardware or emulate it in software
(similar to how blkdev_issue_zeroout() works).
> I was thinking I could keep it simple, and model a function call
> based on write(3) / pwrite(3) to write some pattern N times
> starting at either the file's current offset or at a user-provide
> offset. Something like:
>
> write_pattern(int filedes, const void *pattern, size_t nbytes, size_t count);
> pwrite_pattern(int filedes, const void *pattern, size_t nbytes, size_t count, offset_t offset);
Apart from noting that pwritev2(RWF_ENCODED) would have been able to
support this, I'll let other people decide what the best
user/syscall API will be for this.
> I could then construct a WRITE_SAME call in the NFS client using
> this information. This seems "good enough" to me for what people
> have asked for, at least as a client-side interface. It wouldn't
> really help the server, which would still need to do several
> writes in a loop to be spec-compliant with writing the pattern to
> an offset inside the "application data block" [4] structure.
Right, so we need both NFS client side and server side local fs
support for the WRITE_SAME operation.
That implies we should implement it at the VFS as a file method.
i.e. ->write_same() at a similar layer to ->write_iter().
If we do that, then both the NFS client and the NFS server can use
the same VFS interface, and applications can use WRITE_SAME on both
NFS and local filesystems directly...
> But maybe I'm simplifying this too much, and others would find the
> additional application data block fields useful? Or should I keep
> it all inside NFS, and call it with an ioctl instead of putting it
> into the VFS?
I think a file method for VFS implementation is the right way to do
this because it allows both client side server offload and server
side hardware offload through the local filesystem. It also provides
a simple way to check if the filesystem supports the functionality
or not...
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [LSF/MM/BPF TOPIC] Implementing the NFS v4.2 WRITE_SAME operation: VFS or NFS ioctl() ?
2025-01-14 23:14 ` Dave Chinner
@ 2025-01-16 5:42 ` Christoph Hellwig
2025-01-16 13:37 ` Theodore Ts'o
0 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2025-01-16 5:42 UTC (permalink / raw)
To: Dave Chinner
Cc: Anna Schumaker, lsf-pc, linux-fsdevel, Linux NFS Mailing List
On Wed, Jan 15, 2025 at 10:14:56AM +1100, Dave Chinner wrote:
> How closely does this match to the block device WRITE_SAME
> (SCSI/NVMe) commands? I note there is a reference to this in the
> RFC, but there are no details given.
There is no write same in NVMe. In one of the few wiѕe choices in
NVMe the protocol only does a write zeroes for zeroing instead of the
overly complex write zeroes. And no one has complained about that so
far.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Implementing the NFS v4.2 WRITE_SAME operation: VFS or NFS ioctl() ?
2025-01-16 5:42 ` Christoph Hellwig
@ 2025-01-16 13:37 ` Theodore Ts'o
2025-01-16 13:59 ` Chuck Lever
0 siblings, 1 reply; 18+ messages in thread
From: Theodore Ts'o @ 2025-01-16 13:37 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Dave Chinner, Anna Schumaker, lsf-pc, linux-fsdevel,
Linux NFS Mailing List
On Wed, Jan 15, 2025 at 09:42:29PM -0800, Christoph Hellwig wrote:
> On Wed, Jan 15, 2025 at 10:14:56AM +1100, Dave Chinner wrote:
> > How closely does this match to the block device WRITE_SAME
> > (SCSI/NVMe) commands? I note there is a reference to this in the
> > RFC, but there are no details given.
>
> There is no write same in NVMe. In one of the few wiѕe choices in
> NVMe the protocol only does a write zeroes for zeroing instead of the
> overly complex write zeroes. And no one has complained about that so
> far.
It should be noted that there is currently a patch proposing to add to
fallocate support for the operation FALLOC_FL_WRITE_ZEROS:
https://lore.kernel.org/all/20250115114637.2705887-1-yi.zhang@huaweicloud.com/
For those use cases where this is all the user requires, perhaps this
is something that Linux's nfs4 client should consider implementing?
In any case I'd suggest that interested file system developers comment
on this patch series.
Personally, I have no interest in using or implementing in a
WRITE_SAME operation which implements the all-singing, all-dancing
WRITE_SAME as envisioned by the SCSI and NFSv4.2 specifications.
I will also note that many Cloud vendors (AWS, GCE, Azure) are moving
to using NVMe instead of SCSI, especially for the higher performance
VM and software-defined block devices. So, I would suspect that a
customer would have to wave a **very** large amount of money under my
employer's nose before this would be something that would be funded by
$WORK for block-based file systems (and even then, it appears that
NVMe is so much better at higher performance storage, such that I'm
not sure how many customers would really be all that interested).
But hey, if someone knows of some AI-related workload that needs to
write the same non-zero block a very large number of times, let me
know. :-)
Cheers,
- Ted
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Implementing the NFS v4.2 WRITE_SAME operation: VFS or NFS ioctl() ?
2025-01-16 13:37 ` Theodore Ts'o
@ 2025-01-16 13:59 ` Chuck Lever
2025-01-16 15:36 ` Theodore Ts'o
0 siblings, 1 reply; 18+ messages in thread
From: Chuck Lever @ 2025-01-16 13:59 UTC (permalink / raw)
To: Theodore Ts'o, Christoph Hellwig
Cc: Dave Chinner, Anna Schumaker, lsf-pc, linux-fsdevel,
Linux NFS Mailing List
On 1/16/25 8:37 AM, Theodore Ts'o wrote:
> On Wed, Jan 15, 2025 at 09:42:29PM -0800, Christoph Hellwig wrote:
>> On Wed, Jan 15, 2025 at 10:14:56AM +1100, Dave Chinner wrote:
>>> How closely does this match to the block device WRITE_SAME
>>> (SCSI/NVMe) commands? I note there is a reference to this in the
>>> RFC, but there are no details given.
>>
>> There is no write same in NVMe. In one of the few wiѕe choices in
>> NVMe the protocol only does a write zeroes for zeroing instead of the
>> overly complex write zeroes. And no one has complained about that so
>> far.
>
> It should be noted that there is currently a patch proposing to add to
> fallocate support for the operation FALLOC_FL_WRITE_ZEROS:
>
> https://lore.kernel.org/all/20250115114637.2705887-1-yi.zhang@huaweicloud.com/
>
> For those use cases where this is all the user requires, perhaps this
> is something that Linux's nfs4 client should consider implementing?
I've seen one or two other mentions of "let's make the NFS client do
such and such" in this thread.
To be clear: The proposal includes client and server implementation of
the NFSv4.2 WRITE_SAME operation. This is not a client-only thing.
In fact, the most recent requester mentioned only a server
implementation because they have a client that already implements
WRITE_SAME and want this feature in NFSD.
> In any case I'd suggest that interested file system developers comment
> on this patch series.
>
> Personally, I have no interest in using or implementing in a
> WRITE_SAME operation which implements the all-singing, all-dancing
> WRITE_SAME as envisioned by the SCSI and NFSv4.2 specifications.
I think we need to consider a weak generic implementation that resides
in the VFS or a library for file systems that choose not to implement.
> I will also note that many Cloud vendors (AWS, GCE, Azure) are moving
> to using NVMe instead of SCSI, especially for the higher performance
> VM and software-defined block devices. So, I would suspect that a
> customer would have to wave a **very** large amount of money under my
> employer's nose before this would be something that would be funded by
> $WORK for block-based file systems (and even then, it appears that
> NVMe is so much better at higher performance storage, such that I'm
> not sure how many customers would really be all that interested).
>
> But hey, if someone knows of some AI-related workload that needs to
> write the same non-zero block a very large number of times, let me
> know. :-)
See my previous reply in this thread: WRITE_SAME has a long-standing
existing use case in the database world. The NFSv4.2 WRITE_SAME
operation was designed around this use case.
You remember database workloads, right? ;-)
--
Chuck Lever
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Implementing the NFS v4.2 WRITE_SAME operation: VFS or NFS ioctl() ?
2025-01-16 13:59 ` Chuck Lever
@ 2025-01-16 15:36 ` Theodore Ts'o
2025-01-16 15:45 ` Chuck Lever
0 siblings, 1 reply; 18+ messages in thread
From: Theodore Ts'o @ 2025-01-16 15:36 UTC (permalink / raw)
To: Chuck Lever
Cc: Christoph Hellwig, Dave Chinner, Anna Schumaker, lsf-pc,
linux-fsdevel, Linux NFS Mailing List
On Thu, Jan 16, 2025 at 08:59:19AM -0500, Chuck Lever wrote:
>
> See my previous reply in this thread: WRITE_SAME has a long-standing
> existing use case in the database world. The NFSv4.2 WRITE_SAME
> operation was designed around this use case.
>
> You remember database workloads, right? ;-)
My understanding is that the database use case maps onto BLKZEROOUT
--- specifically, databases want to be able to extend a tablespace
file, and what they want to be able to do is to allocate a contiguous
range using fallocate(2), but then want to make sure that the blocks
in the block are marked as initialized so that future writes to the
file do not require metadata updates when fsync(2) is called.
Enterprise databases like Oracle and db2 have been doing this for
decades; and just in the past two months recently I've had
representatives from certain open source databases ask for something
like the FALLOC_FL_WRITE_ZEROES.
So yes, I'm very much aware of database workloads --- but all they
need is to write zeros to mark a file range that was freshly allocated
using fallocate to be initialized. They do not need the more
expansive features which as defined by the SCSI or NFSv4.2. All of
the use cases done by enterprise Oracle, db2, and various open source
databases which have approached me are typically allocating a chunk
of aligned space (say, 32MiB) and then they want to initalize this
range of blocks.
This then doesn't require poison sentinals, since it's strictly
speaking an optimization. The extent tree doesn't get marked as
initalized until the zero-write has been commited to the block device
via a CACHE FLUSH. If we crash before this happens, reads from the
file will get zeros, and writes to the blocks that didn't get
initialized will still work, but the fsync(2) might trigger a
filesystem-level journal commit. This isn't a disaster....
Now, there might be some database that needs something more
complicated, but I'm not aware of them. If you know of any, is that
something that you are able to share?
Cheers,
- Ted
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Implementing the NFS v4.2 WRITE_SAME operation: VFS or NFS ioctl() ?
2025-01-16 15:36 ` Theodore Ts'o
@ 2025-01-16 15:45 ` Chuck Lever
2025-01-16 17:30 ` Theodore Ts'o
2025-01-16 21:54 ` Martin K. Petersen
0 siblings, 2 replies; 18+ messages in thread
From: Chuck Lever @ 2025-01-16 15:45 UTC (permalink / raw)
To: Theodore Ts'o
Cc: Christoph Hellwig, Dave Chinner, Anna Schumaker, lsf-pc,
linux-fsdevel, Linux NFS Mailing List
On 1/16/25 10:36 AM, Theodore Ts'o wrote:
> On Thu, Jan 16, 2025 at 08:59:19AM -0500, Chuck Lever wrote:
>>
>> See my previous reply in this thread: WRITE_SAME has a long-standing
>> existing use case in the database world. The NFSv4.2 WRITE_SAME
>> operation was designed around this use case.
>>
>> You remember database workloads, right? ;-)
>
> My understanding is that the database use case maps onto BLKZEROOUT
> --- specifically, databases want to be able to extend a tablespace
> file, and what they want to be able to do is to allocate a contiguous
> range using fallocate(2), but then want to make sure that the blocks
> in the block are marked as initialized so that future writes to the
> file do not require metadata updates when fsync(2) is called.
> Enterprise databases like Oracle and db2 have been doing this for
> decades; and just in the past two months recently I've had
> representatives from certain open source databases ask for something
> like the FALLOC_FL_WRITE_ZEROES.
>
> So yes, I'm very much aware of database workloads --- but all they
> need is to write zeros to mark a file range that was freshly allocated
> using fallocate to be initialized. They do not need the more
> expansive features which as defined by the SCSI or NFSv4.2. All of
> the use cases done by enterprise Oracle, db2, and various open source
> databases which have approached me are typically allocating a chunk
> of aligned space (say, 32MiB) and then they want to initalize this
> range of blocks.
>
> This then doesn't require poison sentinals, since it's strictly
> speaking an optimization. The extent tree doesn't get marked as
> initalized until the zero-write has been commited to the block device
> via a CACHE FLUSH. If we crash before this happens, reads from the
> file will get zeros, and writes to the blocks that didn't get
> initialized will still work, but the fsync(2) might trigger a
> filesystem-level journal commit. This isn't a disaster....
>
> Now, there might be some database that needs something more
> complicated, but I'm not aware of them. If you know of any, is that
> something that you are able to share?
Any database that uses a block size that is larger than the block
size of the underlying storage media is at risk of a torn write.
The purpose of WRITE_SAME is to demark the database blocks with
sentinels on each end of the database block containing a time
stamp or hash.
If, when read back, the sentinels match, the whole database
block is good to go. If they do not, then the block is torn
and recovery is necessary.
--
Chuck Lever
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Implementing the NFS v4.2 WRITE_SAME operation: VFS or NFS ioctl() ?
2025-01-16 15:45 ` Chuck Lever
@ 2025-01-16 17:30 ` Theodore Ts'o
2025-01-16 22:11 ` [Lsf-pc] " Martin K. Petersen
2025-01-16 21:54 ` Martin K. Petersen
1 sibling, 1 reply; 18+ messages in thread
From: Theodore Ts'o @ 2025-01-16 17:30 UTC (permalink / raw)
To: Chuck Lever
Cc: Christoph Hellwig, Dave Chinner, Anna Schumaker, lsf-pc,
linux-fsdevel, Linux NFS Mailing List
On Thu, Jan 16, 2025 at 10:45:01AM -0500, Chuck Lever wrote:
>
> Any database that uses a block size that is larger than the block
> size of the underlying storage media is at risk of a torn write.
> The purpose of WRITE_SAME is to demark the database blocks with
> sentinels on each end of the database block containing a time
> stamp or hash.
There are alternate solutions which various databases to address the
torn write problem:
* DIF/DIX (although this is super expensive, so this has fallen out
of favor)
* In-line checksums in the database block; this approach is fairly
common for enterprise databases (interestingly, Google's cluster
file systems, which don't need to support mmap, do this as well)
* Double-buffered writes using a journal (this is what open source
databases tend to use)
* For software-defined cloud block devices (such as Google's
Persistent Disk, Amazon EBS, etc.) and some NVMe devices,
aligned writes can be guaranteed up to some write granularity
(typically up to 32k to 64k, although pretty much all database
pages today are 16k). This is actively fielded as
customer-available products and/or in development in at least
two first-party cloud database products based on MySQL and/or
Postgres; and there are some active patches which John Garry
has been working on so that users can use this technique
without having to rely on first party cloud product teams
knowing implementation details of their cloud block devices.
(This has been discussed in past LSF/MM sessions.)
> If, when read back, the sentinels match, the whole database
> block is good to go. If they do not, then the block is torn
> and recovery is necessary.
Are there some database teams that are actively working on a scheme
based on WRITE SAME? I have talked to open source developers on the
MySQL and Postgres teams, as well as the first party cloud product
teams at my company and some storage architects at competitor cloud
companies, and no one has mentioned any efforts involving WRITE SAME.
Of course, maybe I simply haven't come across such plans, especially
if they are under some deep, dark NDA. :-)
However, given that support for WRITE SAME is fairly rare (like
DIF/DIX it's only available if you are willing to pay $$$$ for your
storage, because it's a specialized feature that storage vendors like
to change a lot for), I'm bit surprised that there are database groups
that would be intersted in relying on such a feature, since it tends
not be commonly available.
If there are real-world potential users, go wild, but at least for the
use cases and databases that I'm aware of, the FALLOC_FL_WRITE_ZEROS
and atomic writes patch series (it's really untorn writes but we seem
to have lost that naming battle) is all that we need.
Cheers,
- Ted
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Implementing the NFS v4.2 WRITE_SAME operation: VFS or NFS ioctl() ?
2025-01-16 17:30 ` Theodore Ts'o
@ 2025-01-16 22:11 ` Martin K. Petersen
0 siblings, 0 replies; 18+ messages in thread
From: Martin K. Petersen @ 2025-01-16 22:11 UTC (permalink / raw)
To: Theodore Ts'o
Cc: Chuck Lever, Christoph Hellwig, Dave Chinner, Anna Schumaker,
lsf-pc, linux-fsdevel, Linux NFS Mailing List
Hi Ted!
> * DIF/DIX (although this is super expensive, so this has fallen out
> of favor)
Several cloud providers use T10 PI-capable storage in their backend. The
interface is rarely exposed to customers, though.
> * In-line checksums in the database block; this approach is fairly
> common for enterprise databases
Yep.
Also note that DIX/T10 PI are intended to prevent writing corrupted
buffers or misdirected data to media. I.e. at WRITE time. Neither DIX,
nor T10 PI offer any torn write guarantees. That's what the dedicated
atomic write operations are for (and those do support PI).
In-line application block checksums are a solution for the problem of
determining whether a database block read from media is intact. I.e.
in-line checksums are effective at READ time.
--
Martin K. Petersen Oracle Linux Engineering
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Implementing the NFS v4.2 WRITE_SAME operation: VFS or NFS ioctl() ?
2025-01-16 15:45 ` Chuck Lever
2025-01-16 17:30 ` Theodore Ts'o
@ 2025-01-16 21:54 ` Martin K. Petersen
1 sibling, 0 replies; 18+ messages in thread
From: Martin K. Petersen @ 2025-01-16 21:54 UTC (permalink / raw)
To: Chuck Lever via Lsf-pc
Cc: Theodore Ts'o, Chuck Lever, Christoph Hellwig, Dave Chinner,
Anna Schumaker, linux-fsdevel, Linux NFS Mailing List
Hi Chuck!
> The purpose of WRITE_SAME is to demark the database blocks with
> sentinels on each end of the database block containing a time
> stamp or hash.
SCSI WRITE SAME writes a contiguous range of logical blocks. Each block
will be filled with the contents of the single logical block data buffer
provided as payload.
So with SCSI WRITE SAME it's not possible to write a 512-byte sentinel,
followed by 15KB of zeroes, followed by a 512-byte sentinel in a single
operation. You'd have to do a 16KB WRITE SAME with a zeroed payload
followed by a two individual WRITEs for the sentinels. Or fill the
entire 16KB application block with the same repeating 512-byte pattern.
I'm not familiar with NFS v4.2 WRITE SAME. But it sounds like it allows
the application to define a block larger than the logical block size of
the underlying storage. Is that correct?
If so, there would not be a direct mapping between NFS WRITE SAME and
SCSI ditto. As Christoph pointed out, NVMe doesn't have WRITE SAME. And
we removed support in the block layer a while back.
That doesn't prevent implementing WRITE SAME capability in NFS, of
course. It just sounds like the NFS semantics are different enough that
aligning to SCSI is not applicable.
--
Martin K. Petersen Oracle Linux Engineering
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Implementing the NFS v4.2 WRITE_SAME operation: VFS or NFS ioctl() ?
2025-01-14 21:38 [LSF/MM/BPF TOPIC] Implementing the NFS v4.2 WRITE_SAME operation: VFS or NFS ioctl() ? Anna Schumaker
2025-01-14 23:14 ` Dave Chinner
@ 2025-01-15 2:10 ` Darrick J. Wong
2025-01-15 14:24 ` Jeff Layton
` (2 subsequent siblings)
4 siblings, 0 replies; 18+ messages in thread
From: Darrick J. Wong @ 2025-01-15 2:10 UTC (permalink / raw)
To: Anna Schumaker; +Cc: lsf-pc, linux-fsdevel, Linux NFS Mailing List
On Tue, Jan 14, 2025 at 04:38:03PM -0500, Anna Schumaker wrote:
> I've seen a few requests for implementing the NFS v4.2 WRITE_SAME [1]
> operation over the last few months [2][3] to accelerate writing
> patterns of data on the server, so it's been in the back of my mind
> for a future project. I'll need to write some code somewhere so NFS &
> NFSD can handle this request. I could keep any implementation internal
> to NFS / NFSD, but I'd like to find out if local filesystems would
> find this sort of feature useful and if I should put it in the VFS
> instead.
It would help to know more about what exactly write same does on NFS.
Is it like scsi's where you can pass a buffer and it'll write the same
buffer over and over across the device?
> I was thinking I could keep it simple, and model a function call based
> on write(3) / pwrite(3) to write some pattern N times starting at
> either the file's current offset or at a user-provide offset.
> Something like:
> write_pattern(int filedes, const void *pattern, size_t nbytes, size_t count);
> pwrite_pattern(int filedes, const void *pattern, size_t nbytes, size_t count, offset_t offset);
So yeah, it sounds similar. Assuming nbytes is the size of *pattern,
and offset/count are the range to be pwritten?
> I could then construct a WRITE_SAME call in the NFS client using this
> information. This seems "good enough" to me for what people have asked
> for, at least as a client-side interface. It wouldn't really help the
> server, which would still need to do several writes in a loop to be
> spec-compliant with writing the pattern to an offset inside the
> "application data block" [4] structure.
I disagree, I think you just volunteered to plumb this pattern writing
all the way through to the block layer. ;)
> But maybe I'm simplifying this too much, and others would find the
> additional application data block fields useful? Or should I keep it
> all inside NFS, and call it with an ioctl instead of putting it into
> the VFS?
io_uring subcommand?
But I'd want to know more about what people want to use this for.
Assuming you don't just hook up FALLOC_FL_ZERO_RANGE to it and call it a
day. :)
--D
> Thoughts?
> Anna
>
> [1]: https://datatracker.ietf.org/doc/html/rfc7862#section-15.12
> [2]: https://lore.kernel.org/linux-nfs/CAAvCNcByQhbxh9aq_z7GfHx+_=S8zVcr9-04zzdRVLpLbhxxSg@mail.gmail.com/
> [3]: https://lore.kernel.org/linux-nfs/CALWcw=Gg33HWRLCrj9QLXMPME=pnuZx_tE4+Pw8gwutQM4M=vw@mail.gmail.com/
> [4]: https://datatracker.ietf.org/doc/html/rfc7862#section-8.1
>
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [LSF/MM/BPF TOPIC] Implementing the NFS v4.2 WRITE_SAME operation: VFS or NFS ioctl() ?
2025-01-14 21:38 [LSF/MM/BPF TOPIC] Implementing the NFS v4.2 WRITE_SAME operation: VFS or NFS ioctl() ? Anna Schumaker
2025-01-14 23:14 ` Dave Chinner
2025-01-15 2:10 ` Darrick J. Wong
@ 2025-01-15 14:24 ` Jeff Layton
2025-01-15 15:06 ` Matthew Wilcox
2025-01-16 5:40 ` Christoph Hellwig
4 siblings, 0 replies; 18+ messages in thread
From: Jeff Layton @ 2025-01-15 14:24 UTC (permalink / raw)
To: Anna Schumaker, lsf-pc; +Cc: linux-fsdevel, Linux NFS Mailing List
On Tue, 2025-01-14 at 16:38 -0500, Anna Schumaker wrote:
> I've seen a few requests for implementing the NFS v4.2 WRITE_SAME [1] operation over the last few months [2][3] to accelerate writing patterns of data on the server, so it's been in the back of my mind for a future project. I'll need to write some code somewhere so NFS & NFSD can handle this request. I could keep any implementation internal to NFS / NFSD, but I'd like to find out if local filesystems would find this sort of feature useful and if I should put it in the VFS instead.
>
> I was thinking I could keep it simple, and model a function call based on write(3) / pwrite(3) to write some pattern N times starting at either the file's current offset or at a user-provide offset. Something like:
> write_pattern(int filedes, const void *pattern, size_t nbytes, size_t count);
> pwrite_pattern(int filedes, const void *pattern, size_t nbytes, size_t count, offset_t offset);
>
These should also get flags fields, for sure.
> I could then construct a WRITE_SAME call in the NFS client using this information. This seems "good enough" to me for what people have asked for, at least as a client-side interface. It wouldn't really help the server, which would still need to do several writes in a loop to be spec-compliant with writing the pattern to an offset inside the "application data block" [4] structure.
>
> But maybe I'm simplifying this too much, and others would find the additional application data block fields useful? Or should I keep it all inside NFS, and call it with an ioctl instead of putting it into the VFS?
>
> Thoughts?
> Anna
>
> [1]: https://datatracker.ietf.org/doc/html/rfc7862#section-15.12
> [2]: https://lore.kernel.org/linux-nfs/CAAvCNcByQhbxh9aq_z7GfHx+_=S8zVcr9-04zzdRVLpLbhxxSg@mail.gmail.com/
> [3]: https://lore.kernel.org/linux-nfs/CALWcw=Gg33HWRLCrj9QLXMPME=pnuZx_tE4+Pw8gwutQM4M=vw@mail.gmail.com/
> [4]: https://datatracker.ietf.org/doc/html/rfc7862#section-8.1
>
I'd say keep it as an ioctl for now until we have at least one other
filesystem (smb/client? fs/ceph?) that can implement this natively
somehow.
My worry here is that we would build these new syscalls, and then find
that other filesystems need subtly different semantics for this, and
then we have to scramble to shoehorn those in later.
Are there are already other existing ioctls on other filesystems for
similar operations?
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Implementing the NFS v4.2 WRITE_SAME operation: VFS or NFS ioctl() ?
2025-01-14 21:38 [LSF/MM/BPF TOPIC] Implementing the NFS v4.2 WRITE_SAME operation: VFS or NFS ioctl() ? Anna Schumaker
` (2 preceding siblings ...)
2025-01-15 14:24 ` Jeff Layton
@ 2025-01-15 15:06 ` Matthew Wilcox
2025-01-15 15:31 ` Chuck Lever
2025-01-16 5:40 ` Christoph Hellwig
4 siblings, 1 reply; 18+ messages in thread
From: Matthew Wilcox @ 2025-01-15 15:06 UTC (permalink / raw)
To: Anna Schumaker; +Cc: lsf-pc, linux-fsdevel, Linux NFS Mailing List
On Tue, Jan 14, 2025 at 04:38:03PM -0500, Anna Schumaker wrote:
> I've seen a few requests for implementing the NFS v4.2 WRITE_SAME [1] operation over the last few months [2][3] to accelerate writing patterns of data on the server, so it's been in the back of my mind for a future project. I'll need to write some code somewhere so NFS & NFSD can handle this request. I could keep any implementation internal to NFS / NFSD, but I'd like to find out if local filesystems would find this sort of feature useful and if I should put it in the VFS instead.
I think we need more information. I read over the [2] and [3] threads
and the spec. It _seems like_ the intent in the spec is to expose the
underlying SCSI WRITE SAME command over NFS, but at least one other
response in this thread has been to design an all-singing, all-dancing
superset that can write arbitrary sized blocks to arbitrary locations
in every file on every filesystem, and I think we're going to design
ourselves into an awful implementation if we do that.
Can we confirm with the people who actually want to use this that all
they really want is to be able to do WRITE SAME as if they were on a
local disc, and then we can implement that in a matter of weeks instead
of taking a trip via Uranus.
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [LSF/MM/BPF TOPIC] Implementing the NFS v4.2 WRITE_SAME operation: VFS or NFS ioctl() ?
2025-01-15 15:06 ` Matthew Wilcox
@ 2025-01-15 15:31 ` Chuck Lever
2025-01-15 16:19 ` Matthew Wilcox
0 siblings, 1 reply; 18+ messages in thread
From: Chuck Lever @ 2025-01-15 15:31 UTC (permalink / raw)
To: Matthew Wilcox, Anna Schumaker
Cc: lsf-pc, linux-fsdevel, Linux NFS Mailing List
On 1/15/25 10:06 AM, Matthew Wilcox wrote:
> On Tue, Jan 14, 2025 at 04:38:03PM -0500, Anna Schumaker wrote:
>> I've seen a few requests for implementing the NFS v4.2 WRITE_SAME [1] operation over the last few months [2][3] to accelerate writing patterns of data on the server, so it's been in the back of my mind for a future project. I'll need to write some code somewhere so NFS & NFSD can handle this request. I could keep any implementation internal to NFS / NFSD, but I'd like to find out if local filesystems would find this sort of feature useful and if I should put it in the VFS instead.
>
> I think we need more information. I read over the [2] and [3] threads
> and the spec. It _seems like_ the intent in the spec is to expose the
> underlying SCSI WRITE SAME command over NFS, but at least one other
> response in this thread has been to design an all-singing, all-dancing
> superset that can write arbitrary sized blocks to arbitrary locations
> in every file on every filesystem, and I think we're going to design
> ourselves into an awful implementation if we do that.
>
> Can we confirm with the people who actually want to use this that all
> they really want is to be able to do WRITE SAME as if they were on a
> local disc, and then we can implement that in a matter of weeks instead
> of taking a trip via Uranus.
IME it's been very difficult to get such requesters to provide the
detail we need to build to their requirements. Providing them with a
limited prototype and letting them comment is likely the fastest way to
converge on something useful. Press the Easy Button, then evolve.
Trond has suggested starting with clone_file_range, providing it with a
pattern and then have the VFS or file system fill exponentially larger
segments of the file by replicating that pattern. The question is
whether to let consumers simply use that API as it is, or shall we
provide some kind of generic infrastructure over that that provides
segment replication?
With my NFSD hat on, I would prefer to have the file version of "write
same" implemented outside of the NFS stack so that other consumers can
benefit from using the very same implementation. NFSD (and the NFS
client) should simply act as a conduit for these requests via the
NFSv4.2 WRITE_SAME operation.
I kinda like Dave's ideas too. Enabling offload will be critical to
making this feature efficient and thus valuable.
--
Chuck Lever
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Implementing the NFS v4.2 WRITE_SAME operation: VFS or NFS ioctl() ?
2025-01-15 15:31 ` Chuck Lever
@ 2025-01-15 16:19 ` Matthew Wilcox
2025-01-15 18:20 ` Darrick J. Wong
2025-01-15 18:43 ` Chuck Lever
0 siblings, 2 replies; 18+ messages in thread
From: Matthew Wilcox @ 2025-01-15 16:19 UTC (permalink / raw)
To: Chuck Lever; +Cc: Anna Schumaker, lsf-pc, linux-fsdevel, Linux NFS Mailing List
On Wed, Jan 15, 2025 at 10:31:51AM -0500, Chuck Lever wrote:
> On 1/15/25 10:06 AM, Matthew Wilcox wrote:
> > On Tue, Jan 14, 2025 at 04:38:03PM -0500, Anna Schumaker wrote:
> > > I've seen a few requests for implementing the NFS v4.2 WRITE_SAME [1] operation over the last few months [2][3] to accelerate writing patterns of data on the server, so it's been in the back of my mind for a future project. I'll need to write some code somewhere so NFS & NFSD can handle this request. I could keep any implementation internal to NFS / NFSD, but I'd like to find out if local filesystems would find this sort of feature useful and if I should put it in the VFS instead.
> >
> > I think we need more information. I read over the [2] and [3] threads
> > and the spec. It _seems like_ the intent in the spec is to expose the
> > underlying SCSI WRITE SAME command over NFS, but at least one other
> > response in this thread has been to design an all-singing, all-dancing
> > superset that can write arbitrary sized blocks to arbitrary locations
> > in every file on every filesystem, and I think we're going to design
> > ourselves into an awful implementation if we do that.
> >
> > Can we confirm with the people who actually want to use this that all
> > they really want is to be able to do WRITE SAME as if they were on a
> > local disc, and then we can implement that in a matter of weeks instead
> > of taking a trip via Uranus.
>
> IME it's been very difficult to get such requesters to provide the
> detail we need to build to their requirements. Providing them with a
> limited prototype and letting them comment is likely the fastest way to
> converge on something useful. Press the Easy Button, then evolve.
>
> Trond has suggested starting with clone_file_range, providing it with a
> pattern and then have the VFS or file system fill exponentially larger
> segments of the file by replicating that pattern. The question is
> whether to let consumers simply use that API as it is, or shall we
> provide some kind of generic infrastructure over that that provides
> segment replication?
>
> With my NFSD hat on, I would prefer to have the file version of "write
> same" implemented outside of the NFS stack so that other consumers can
> benefit from using the very same implementation. NFSD (and the NFS
> client) should simply act as a conduit for these requests via the
> NFSv4.2 WRITE_SAME operation.
>
> I kinda like Dave's ideas too. Enabling offload will be critical to
> making this feature efficient and thus valuable.
So I have some experience with designing an API like this one which may
prove either relevant or misleading.
We have bzero() and memset(). If you want to fill with a larger pattern
than a single byte, POSIX does not provide. Various people have proposed
extensions, eg
https://github.com/ajkaijanaho/publib/blob/master/strutil/memfill.c
But what people really want is the ability to use the x86 rep
movsw/movsl/movsq instructions. And so in Linux we now have
memset16/memset32/memset64/memset_l/memset_p which will map to one
of those hardware calls. Sure, we could implement memfill() and then
specialcase 2/4/8 byte implementations, but nobody actually wants to
use that.
So what API actually makes sense to provide? I suggest an ioctl,
implemented at the VFS layer:
struct write_same {
loff_t pos; /* Where to start writing */
size_t len; /* Length of memory pointed to by buf */
char *buf; /* Pattern to fill with */
};
ioctl(fd, FIWRITESAME, struct write_same *arg)
'pos' must be block size aligned.
'len' must be a power of two, or 0. If 0, fill with zeroes.
If len is shorter than the block size of the file, the kernel
replicates the pattern in 'buf' within the single block. If len
is larger than block size, we're doing a multi-block WRITE_SAME.
We can implement this for block devices and any filesystem that
cares. The kernel will have to shoot down any page cache, just
like for PUNCH_HOLE and similar.
For a prototype, we can implement this in the NFS client, then hoist it
to the VFS once the users have actually agreed this serves their needs.
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [LSF/MM/BPF TOPIC] Implementing the NFS v4.2 WRITE_SAME operation: VFS or NFS ioctl() ?
2025-01-15 16:19 ` Matthew Wilcox
@ 2025-01-15 18:20 ` Darrick J. Wong
2025-01-15 18:43 ` Chuck Lever
1 sibling, 0 replies; 18+ messages in thread
From: Darrick J. Wong @ 2025-01-15 18:20 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Chuck Lever, Anna Schumaker, lsf-pc, linux-fsdevel,
Linux NFS Mailing List
On Wed, Jan 15, 2025 at 04:19:28PM +0000, Matthew Wilcox wrote:
> On Wed, Jan 15, 2025 at 10:31:51AM -0500, Chuck Lever wrote:
> > On 1/15/25 10:06 AM, Matthew Wilcox wrote:
> > > On Tue, Jan 14, 2025 at 04:38:03PM -0500, Anna Schumaker wrote:
> > > > I've seen a few requests for implementing the NFS v4.2 WRITE_SAME [1] operation over the last few months [2][3] to accelerate writing patterns of data on the server, so it's been in the back of my mind for a future project. I'll need to write some code somewhere so NFS & NFSD can handle this request. I could keep any implementation internal to NFS / NFSD, but I'd like to find out if local filesystems would find this sort of feature useful and if I should put it in the VFS instead.
> > >
> > > I think we need more information. I read over the [2] and [3] threads
> > > and the spec. It _seems like_ the intent in the spec is to expose the
> > > underlying SCSI WRITE SAME command over NFS, but at least one other
> > > response in this thread has been to design an all-singing, all-dancing
> > > superset that can write arbitrary sized blocks to arbitrary locations
> > > in every file on every filesystem, and I think we're going to design
> > > ourselves into an awful implementation if we do that.
> > >
> > > Can we confirm with the people who actually want to use this that all
> > > they really want is to be able to do WRITE SAME as if they were on a
> > > local disc, and then we can implement that in a matter of weeks instead
> > > of taking a trip via Uranus.
> >
> > IME it's been very difficult to get such requesters to provide the
> > detail we need to build to their requirements. Providing them with a
> > limited prototype and letting them comment is likely the fastest way to
> > converge on something useful. Press the Easy Button, then evolve.
> >
> > Trond has suggested starting with clone_file_range, providing it with a
> > pattern and then have the VFS or file system fill exponentially larger
> > segments of the file by replicating that pattern. The question is
> > whether to let consumers simply use that API as it is, or shall we
> > provide some kind of generic infrastructure over that that provides
> > segment replication?
> >
> > With my NFSD hat on, I would prefer to have the file version of "write
> > same" implemented outside of the NFS stack so that other consumers can
> > benefit from using the very same implementation. NFSD (and the NFS
> > client) should simply act as a conduit for these requests via the
> > NFSv4.2 WRITE_SAME operation.
> >
> > I kinda like Dave's ideas too. Enabling offload will be critical to
> > making this feature efficient and thus valuable.
>
> So I have some experience with designing an API like this one which may
> prove either relevant or misleading.
>
> We have bzero() and memset(). If you want to fill with a larger pattern
> than a single byte, POSIX does not provide. Various people have proposed
> extensions, eg
> https://github.com/ajkaijanaho/publib/blob/master/strutil/memfill.c
>
> But what people really want is the ability to use the x86 rep
> movsw/movsl/movsq instructions. And so in Linux we now have
> memset16/memset32/memset64/memset_l/memset_p which will map to one
> of those hardware calls. Sure, we could implement memfill() and then
> specialcase 2/4/8 byte implementations, but nobody actually wants to
> use that.
>
>
> So what API actually makes sense to provide? I suggest an ioctl,
> implemented at the VFS layer:
>
> struct write_same {
> loff_t pos; /* Where to start writing */
You probably need at least a:
u64 count; /* Number of bytes to write */
Since I think the point is that you write buf[len] to the file/disk over
and over again until count bytes have been written, correct?
> size_t len; /* Length of memory pointed to by buf */
(and maybe call this buflen)
--D
> char *buf; /* Pattern to fill with */
> };
>
> ioctl(fd, FIWRITESAME, struct write_same *arg)
>
> 'pos' must be block size aligned.
> 'len' must be a power of two, or 0. If 0, fill with zeroes.
> If len is shorter than the block size of the file, the kernel
> replicates the pattern in 'buf' within the single block. If len
> is larger than block size, we're doing a multi-block WRITE_SAME.
>
> We can implement this for block devices and any filesystem that
> cares. The kernel will have to shoot down any page cache, just
> like for PUNCH_HOLE and similar.
>
>
> For a prototype, we can implement this in the NFS client, then hoist it
> to the VFS once the users have actually agreed this serves their needs.
>
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [LSF/MM/BPF TOPIC] Implementing the NFS v4.2 WRITE_SAME operation: VFS or NFS ioctl() ?
2025-01-15 16:19 ` Matthew Wilcox
2025-01-15 18:20 ` Darrick J. Wong
@ 2025-01-15 18:43 ` Chuck Lever
1 sibling, 0 replies; 18+ messages in thread
From: Chuck Lever @ 2025-01-15 18:43 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Anna Schumaker, lsf-pc, linux-fsdevel, Linux NFS Mailing List
On 1/15/25 11:19 AM, Matthew Wilcox wrote:
> On Wed, Jan 15, 2025 at 10:31:51AM -0500, Chuck Lever wrote:
>> On 1/15/25 10:06 AM, Matthew Wilcox wrote:
>>> On Tue, Jan 14, 2025 at 04:38:03PM -0500, Anna Schumaker wrote:
>>>> I've seen a few requests for implementing the NFS v4.2 WRITE_SAME [1] operation over the last few months [2][3] to accelerate writing patterns of data on the server, so it's been in the back of my mind for a future project. I'll need to write some code somewhere so NFS & NFSD can handle this request. I could keep any implementation internal to NFS / NFSD, but I'd like to find out if local filesystems would find this sort of feature useful and if I should put it in the VFS instead.
>>>
>>> I think we need more information. I read over the [2] and [3] threads
>>> and the spec. It _seems like_ the intent in the spec is to expose the
>>> underlying SCSI WRITE SAME command over NFS, but at least one other
>>> response in this thread has been to design an all-singing, all-dancing
>>> superset that can write arbitrary sized blocks to arbitrary locations
>>> in every file on every filesystem, and I think we're going to design
>>> ourselves into an awful implementation if we do that.
>>>
>>> Can we confirm with the people who actually want to use this that all
>>> they really want is to be able to do WRITE SAME as if they were on a
>>> local disc, and then we can implement that in a matter of weeks instead
>>> of taking a trip via Uranus.
>>
>> IME it's been very difficult to get such requesters to provide the
>> detail we need to build to their requirements. Providing them with a
>> limited prototype and letting them comment is likely the fastest way to
>> converge on something useful. Press the Easy Button, then evolve.
>>
>> Trond has suggested starting with clone_file_range, providing it with a
>> pattern and then have the VFS or file system fill exponentially larger
>> segments of the file by replicating that pattern. The question is
>> whether to let consumers simply use that API as it is, or shall we
>> provide some kind of generic infrastructure over that that provides
>> segment replication?
>>
>> With my NFSD hat on, I would prefer to have the file version of "write
>> same" implemented outside of the NFS stack so that other consumers can
>> benefit from using the very same implementation. NFSD (and the NFS
>> client) should simply act as a conduit for these requests via the
>> NFSv4.2 WRITE_SAME operation.
>>
>> I kinda like Dave's ideas too. Enabling offload will be critical to
>> making this feature efficient and thus valuable.
>
> So I have some experience with designing an API like this one which may
> prove either relevant or misleading.
>
> We have bzero() and memset(). If you want to fill with a larger pattern
> than a single byte, POSIX does not provide. Various people have proposed
> extensions, eg
> https://github.com/ajkaijanaho/publib/blob/master/strutil/memfill.c
>
> But what people really want is the ability to use the x86 rep
> movsw/movsl/movsq instructions. And so in Linux we now have
> memset16/memset32/memset64/memset_l/memset_p which will map to one
> of those hardware calls. Sure, we could implement memfill() and then
> specialcase 2/4/8 byte implementations, but nobody actually wants to
> use that.
>
>
> So what API actually makes sense to provide? I suggest an ioctl,
> implemented at the VFS layer:
>
> struct write_same {
> loff_t pos; /* Where to start writing */
> size_t len; /* Length of memory pointed to by buf */
> char *buf; /* Pattern to fill with */
> };
>
> ioctl(fd, FIWRITESAME, struct write_same *arg)
This might be a controversial opinion, but a new ioctl() seems OK to me.
> 'pos' must be block size aligned.
> 'len' must be a power of two, or 0. If 0, fill with zeroes.
> If len is shorter than the block size of the file, the kernel
> replicates the pattern in 'buf' within the single block. If len
> is larger than block size, we're doing a multi-block WRITE_SAME.
NFS WRITE_SAME has no alignment restrictions that I'm aware of. Also, I
think it allows the pattern to comb through a file, writing, say, every
other byte, and leaving the unwritten bytes unchanged.
Win32-API has a similar facility with no alignment restrictions and the
ability to comb; in addition it does not seem to set a limit on the size
of the pattern.
So, if we start with a simple struct write_same, I would say we want to
provide some API extensibility guarantees, or simply agree that this
form of the API will exist only in prototype.
Fwiw, use cases here are typically databases that want to quickly
initialize files that will store tables. The head and tail of each
ADB are sentinels for detecting torn writes, and the middle
segment is typically zeroes or a poison pattern.
> We can implement this for block devices and any filesystem that
> cares. The kernel will have to shoot down any page cache, just
> like for PUNCH_HOLE and similar.
>
>
> For a prototype, we can implement this in the NFS client, then hoist it
> to the VFS once the users have actually agreed this serves their needs.
To be clear, NFSD also needs to do handle WRITE_SAME. Would the
prototype server handle that using clone_file_range ?
--
Chuck Lever
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Implementing the NFS v4.2 WRITE_SAME operation: VFS or NFS ioctl() ?
2025-01-14 21:38 [LSF/MM/BPF TOPIC] Implementing the NFS v4.2 WRITE_SAME operation: VFS or NFS ioctl() ? Anna Schumaker
` (3 preceding siblings ...)
2025-01-15 15:06 ` Matthew Wilcox
@ 2025-01-16 5:40 ` Christoph Hellwig
4 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2025-01-16 5:40 UTC (permalink / raw)
To: Anna Schumaker; +Cc: lsf-pc, linux-fsdevel, Linux NFS Mailing List
On Tue, Jan 14, 2025 at 04:38:03PM -0500, Anna Schumaker wrote:
> I've seen a few requests for implementing the NFS v4.2 WRITE_SAME [1] operation over the last few months [2][3] to accelerate writing patterns of data on the server, so it's been in the back of my mind for a future project. I'll need to write some code somewhere so NFS & NFSD can handle this request. I could keep any implementation internal to NFS / NFSD, but I'd like to find out if local filesystems would find this sort of feature useful and if I should put it in the VFS instead.
Well, one actual not very detailed request, and one question from a poster
who just asks random questions on the list all the time for no good reason.
If you care about it prototype first, check is feasible and see if it
gives the expected results. After that you can report back with the
findings and have an architectural discussion. But unless that gets
stuck you should be easily do that on the list instead of wasting
meetings slots on hand wavy stuff.
^ permalink raw reply [flat|nested] 18+ messages in thread