From: "Darrick J. Wong" <djwong@kernel.org>
To: Matthew Wilcox <willy@infradead.org>
Cc: Chuck Lever <chuck.lever@oracle.com>,
Anna Schumaker <anna.schumaker@oracle.com>,
lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org,
Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Subject: Re: [LSF/MM/BPF TOPIC] Implementing the NFS v4.2 WRITE_SAME operation: VFS or NFS ioctl() ?
Date: Wed, 15 Jan 2025 10:20:02 -0800 [thread overview]
Message-ID: <20250115182002.GG3561231@frogsfrogsfrogs> (raw)
In-Reply-To: <Z4fgENA-045TFLOh@casper.infradead.org>
On Wed, Jan 15, 2025 at 04:19:28PM +0000, Matthew Wilcox wrote:
> On Wed, Jan 15, 2025 at 10:31:51AM -0500, Chuck Lever wrote:
> > On 1/15/25 10:06 AM, Matthew Wilcox wrote:
> > > On Tue, Jan 14, 2025 at 04:38:03PM -0500, Anna Schumaker wrote:
> > > > I've seen a few requests for implementing the NFS v4.2 WRITE_SAME [1] operation over the last few months [2][3] to accelerate writing patterns of data on the server, so it's been in the back of my mind for a future project. I'll need to write some code somewhere so NFS & NFSD can handle this request. I could keep any implementation internal to NFS / NFSD, but I'd like to find out if local filesystems would find this sort of feature useful and if I should put it in the VFS instead.
> > >
> > > I think we need more information. I read over the [2] and [3] threads
> > > and the spec. It _seems like_ the intent in the spec is to expose the
> > > underlying SCSI WRITE SAME command over NFS, but at least one other
> > > response in this thread has been to design an all-singing, all-dancing
> > > superset that can write arbitrary sized blocks to arbitrary locations
> > > in every file on every filesystem, and I think we're going to design
> > > ourselves into an awful implementation if we do that.
> > >
> > > Can we confirm with the people who actually want to use this that all
> > > they really want is to be able to do WRITE SAME as if they were on a
> > > local disc, and then we can implement that in a matter of weeks instead
> > > of taking a trip via Uranus.
> >
> > IME it's been very difficult to get such requesters to provide the
> > detail we need to build to their requirements. Providing them with a
> > limited prototype and letting them comment is likely the fastest way to
> > converge on something useful. Press the Easy Button, then evolve.
> >
> > Trond has suggested starting with clone_file_range, providing it with a
> > pattern and then have the VFS or file system fill exponentially larger
> > segments of the file by replicating that pattern. The question is
> > whether to let consumers simply use that API as it is, or shall we
> > provide some kind of generic infrastructure over that that provides
> > segment replication?
> >
> > With my NFSD hat on, I would prefer to have the file version of "write
> > same" implemented outside of the NFS stack so that other consumers can
> > benefit from using the very same implementation. NFSD (and the NFS
> > client) should simply act as a conduit for these requests via the
> > NFSv4.2 WRITE_SAME operation.
> >
> > I kinda like Dave's ideas too. Enabling offload will be critical to
> > making this feature efficient and thus valuable.
>
> So I have some experience with designing an API like this one which may
> prove either relevant or misleading.
>
> We have bzero() and memset(). If you want to fill with a larger pattern
> than a single byte, POSIX does not provide. Various people have proposed
> extensions, eg
> https://github.com/ajkaijanaho/publib/blob/master/strutil/memfill.c
>
> But what people really want is the ability to use the x86 rep
> movsw/movsl/movsq instructions. And so in Linux we now have
> memset16/memset32/memset64/memset_l/memset_p which will map to one
> of those hardware calls. Sure, we could implement memfill() and then
> specialcase 2/4/8 byte implementations, but nobody actually wants to
> use that.
>
>
> So what API actually makes sense to provide? I suggest an ioctl,
> implemented at the VFS layer:
>
> struct write_same {
> loff_t pos; /* Where to start writing */
You probably need at least a:
u64 count; /* Number of bytes to write */
Since I think the point is that you write buf[len] to the file/disk over
and over again until count bytes have been written, correct?
> size_t len; /* Length of memory pointed to by buf */
(and maybe call this buflen)
--D
> char *buf; /* Pattern to fill with */
> };
>
> ioctl(fd, FIWRITESAME, struct write_same *arg)
>
> 'pos' must be block size aligned.
> 'len' must be a power of two, or 0. If 0, fill with zeroes.
> If len is shorter than the block size of the file, the kernel
> replicates the pattern in 'buf' within the single block. If len
> is larger than block size, we're doing a multi-block WRITE_SAME.
>
> We can implement this for block devices and any filesystem that
> cares. The kernel will have to shoot down any page cache, just
> like for PUNCH_HOLE and similar.
>
>
> For a prototype, we can implement this in the NFS client, then hoist it
> to the VFS once the users have actually agreed this serves their needs.
>
next prev parent reply other threads:[~2025-01-15 18:20 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-14 21:38 [LSF/MM/BPF TOPIC] Implementing the NFS v4.2 WRITE_SAME operation: VFS or NFS ioctl() ? Anna Schumaker
2025-01-14 23:14 ` Dave Chinner
2025-01-16 5:42 ` Christoph Hellwig
2025-01-16 13:37 ` Theodore Ts'o
2025-01-16 13:59 ` Chuck Lever
2025-01-16 15:36 ` Theodore Ts'o
2025-01-16 15:45 ` Chuck Lever
2025-01-16 17:30 ` Theodore Ts'o
2025-01-16 22:11 ` [Lsf-pc] " Martin K. Petersen
2025-01-16 21:54 ` Martin K. Petersen
2025-01-15 2:10 ` Darrick J. Wong
2025-01-15 14:24 ` Jeff Layton
2025-01-15 15:06 ` Matthew Wilcox
2025-01-15 15:31 ` Chuck Lever
2025-01-15 16:19 ` Matthew Wilcox
2025-01-15 18:20 ` Darrick J. Wong [this message]
2025-01-15 18:43 ` Chuck Lever
2025-01-16 5:40 ` Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250115182002.GG3561231@frogsfrogsfrogs \
--to=djwong@kernel.org \
--cc=anna.schumaker@oracle.com \
--cc=chuck.lever@oracle.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-nfs@vger.kernel.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox