From: Chuck Lever <chuck.lever@oracle.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: Anna Schumaker <anna.schumaker@oracle.com>,
lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org,
Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Subject: Re: [LSF/MM/BPF TOPIC] Implementing the NFS v4.2 WRITE_SAME operation: VFS or NFS ioctl() ?
Date: Wed, 15 Jan 2025 13:43:14 -0500 [thread overview]
Message-ID: <fa80c96b-91e6-408d-8ada-751a992d677b@oracle.com> (raw)
In-Reply-To: <Z4fgENA-045TFLOh@casper.infradead.org>
On 1/15/25 11:19 AM, Matthew Wilcox wrote:
> On Wed, Jan 15, 2025 at 10:31:51AM -0500, Chuck Lever wrote:
>> On 1/15/25 10:06 AM, Matthew Wilcox wrote:
>>> On Tue, Jan 14, 2025 at 04:38:03PM -0500, Anna Schumaker wrote:
>>>> I've seen a few requests for implementing the NFS v4.2 WRITE_SAME [1] operation over the last few months [2][3] to accelerate writing patterns of data on the server, so it's been in the back of my mind for a future project. I'll need to write some code somewhere so NFS & NFSD can handle this request. I could keep any implementation internal to NFS / NFSD, but I'd like to find out if local filesystems would find this sort of feature useful and if I should put it in the VFS instead.
>>>
>>> I think we need more information. I read over the [2] and [3] threads
>>> and the spec. It _seems like_ the intent in the spec is to expose the
>>> underlying SCSI WRITE SAME command over NFS, but at least one other
>>> response in this thread has been to design an all-singing, all-dancing
>>> superset that can write arbitrary sized blocks to arbitrary locations
>>> in every file on every filesystem, and I think we're going to design
>>> ourselves into an awful implementation if we do that.
>>>
>>> Can we confirm with the people who actually want to use this that all
>>> they really want is to be able to do WRITE SAME as if they were on a
>>> local disc, and then we can implement that in a matter of weeks instead
>>> of taking a trip via Uranus.
>>
>> IME it's been very difficult to get such requesters to provide the
>> detail we need to build to their requirements. Providing them with a
>> limited prototype and letting them comment is likely the fastest way to
>> converge on something useful. Press the Easy Button, then evolve.
>>
>> Trond has suggested starting with clone_file_range, providing it with a
>> pattern and then have the VFS or file system fill exponentially larger
>> segments of the file by replicating that pattern. The question is
>> whether to let consumers simply use that API as it is, or shall we
>> provide some kind of generic infrastructure over that that provides
>> segment replication?
>>
>> With my NFSD hat on, I would prefer to have the file version of "write
>> same" implemented outside of the NFS stack so that other consumers can
>> benefit from using the very same implementation. NFSD (and the NFS
>> client) should simply act as a conduit for these requests via the
>> NFSv4.2 WRITE_SAME operation.
>>
>> I kinda like Dave's ideas too. Enabling offload will be critical to
>> making this feature efficient and thus valuable.
>
> So I have some experience with designing an API like this one which may
> prove either relevant or misleading.
>
> We have bzero() and memset(). If you want to fill with a larger pattern
> than a single byte, POSIX does not provide. Various people have proposed
> extensions, eg
> https://github.com/ajkaijanaho/publib/blob/master/strutil/memfill.c
>
> But what people really want is the ability to use the x86 rep
> movsw/movsl/movsq instructions. And so in Linux we now have
> memset16/memset32/memset64/memset_l/memset_p which will map to one
> of those hardware calls. Sure, we could implement memfill() and then
> specialcase 2/4/8 byte implementations, but nobody actually wants to
> use that.
>
>
> So what API actually makes sense to provide? I suggest an ioctl,
> implemented at the VFS layer:
>
> struct write_same {
> loff_t pos; /* Where to start writing */
> size_t len; /* Length of memory pointed to by buf */
> char *buf; /* Pattern to fill with */
> };
>
> ioctl(fd, FIWRITESAME, struct write_same *arg)
This might be a controversial opinion, but a new ioctl() seems OK to me.
> 'pos' must be block size aligned.
> 'len' must be a power of two, or 0. If 0, fill with zeroes.
> If len is shorter than the block size of the file, the kernel
> replicates the pattern in 'buf' within the single block. If len
> is larger than block size, we're doing a multi-block WRITE_SAME.
NFS WRITE_SAME has no alignment restrictions that I'm aware of. Also, I
think it allows the pattern to comb through a file, writing, say, every
other byte, and leaving the unwritten bytes unchanged.
Win32-API has a similar facility with no alignment restrictions and the
ability to comb; in addition it does not seem to set a limit on the size
of the pattern.
So, if we start with a simple struct write_same, I would say we want to
provide some API extensibility guarantees, or simply agree that this
form of the API will exist only in prototype.
Fwiw, use cases here are typically databases that want to quickly
initialize files that will store tables. The head and tail of each
ADB are sentinels for detecting torn writes, and the middle
segment is typically zeroes or a poison pattern.
> We can implement this for block devices and any filesystem that
> cares. The kernel will have to shoot down any page cache, just
> like for PUNCH_HOLE and similar.
>
>
> For a prototype, we can implement this in the NFS client, then hoist it
> to the VFS once the users have actually agreed this serves their needs.
To be clear, NFSD also needs to do handle WRITE_SAME. Would the
prototype server handle that using clone_file_range ?
--
Chuck Lever
next prev parent reply other threads:[~2025-01-15 18:43 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-14 21:38 [LSF/MM/BPF TOPIC] Implementing the NFS v4.2 WRITE_SAME operation: VFS or NFS ioctl() ? Anna Schumaker
2025-01-14 23:14 ` Dave Chinner
2025-01-16 5:42 ` Christoph Hellwig
2025-01-16 13:37 ` Theodore Ts'o
2025-01-16 13:59 ` Chuck Lever
2025-01-16 15:36 ` Theodore Ts'o
2025-01-16 15:45 ` Chuck Lever
2025-01-16 17:30 ` Theodore Ts'o
2025-01-16 22:11 ` [Lsf-pc] " Martin K. Petersen
2025-01-16 21:54 ` Martin K. Petersen
2025-01-15 2:10 ` Darrick J. Wong
2025-01-15 14:24 ` Jeff Layton
2025-01-15 15:06 ` Matthew Wilcox
2025-01-15 15:31 ` Chuck Lever
2025-01-15 16:19 ` Matthew Wilcox
2025-01-15 18:20 ` Darrick J. Wong
2025-01-15 18:43 ` Chuck Lever [this message]
2025-01-16 5:40 ` Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=fa80c96b-91e6-408d-8ada-751a992d677b@oracle.com \
--to=chuck.lever@oracle.com \
--cc=anna.schumaker@oracle.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-nfs@vger.kernel.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox