From: David Howells <dhowells@redhat.com>
To: Christoph Hellwig <hch@infradead.org>
Cc: dhowells@redhat.com, Matthew Wilcox <willy@infradead.org>,
Dave Chinner <david@fromorbit.com>,
"Ritesh Harjani (IBM)" <ritesh.list@gmail.com>,
linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
"Darrick J . Wong" <djwong@kernel.org>,
Aravinda Herle <araherle@in.ibm.com>
Subject: Re: [RFC 2/2] iomap: Support subpage size dirty tracking to improve write performance
Date: Thu, 03 Nov 2022 14:51:10 +0000 [thread overview]
Message-ID: <7699.1667487070@warthog.procyon.org.uk> (raw)
In-Reply-To: <Y2IyTx0VwXMxzs0G@infradead.org>
Christoph Hellwig <hch@infradead.org> wrote:
> > filesystems right now. Dave Howells' netfs infrastructure is trying
> > to solve the problem for everyone (and he's been looking at iomap as
> > inspiration for what he's doing).
>
> Btw, I never understod why the network file systems don't just use
> iomap. There is nothing block specific in the core iomap code.
It calls creates and submits bio structs all over the place. This seems to
require a blockdev.
Anyway, netfs lib supports, or hopefully will support in the future, the
following:
(1) Fscache. netfslib will construct a read you're asking for from cached
data and data from the server and stitch them together (where a folio may
comprise pieces from more than once source), and then write the bits it
read from the server out to the cache... And handle content encryption
for you such that the data stored in the cache is content-encrypted.
On writeback, the dirty data must be written to both the cache (if you
have one) and the server (if you're not in disconnected operation).
(2) Disconnected operation. netfslib will, in the future, handle storing
data and changes in the cache and then sync'ing on reconnection of an
object.
(3) I want to hand persistent (for the life of an op) iov_iters to the
filesystem so that the filesystem can, if it wants to, pass these to the
kernel_sendmsg() and kernel_recvmsg() in the bottom.
The aim is to get knowledge of pages out of the network filesystem
entirely. A network filesystem would then provide two basic hooks to the
server: async direct read and as async direct write. netfslib will use
these to access the pagecache on behalf of the filesystem.
(4) Reads and writes might want to/need to be non-block-size aligned. If we
have a byte-range file lock, for example, or if we have a max block size
(eg. rsize/wsize) set that's not a multiple of 512, say.
(5) Compressed I/O. You get back more data than you asked for and you want
to paste the rest into the pagecache (if buffered) or discard it (if
DIO). Further, to make this work on write, we may need to hold on to
pages on the sides of the one we modified to make sure we keep the right
size blob of data to recompress and send back.
(6) Larger cache block granularity. One thing I want to explore is the
ability to have blocks in the cache that are larger than PAGE_SIZE. If I
can't use the backing filesystem's knowledge of holes in a file, then I
have to store my own metadata (ie. effectively build a filesystem on top
of a filesystem). To reduce that amount of metadata that I need, I can
make the cache granule size larger.
In both 5 and 6, netfslib gets to tell the VM layer to increase the size
of the blob in readahead() - and then may have to forcibly keep the pages
surrounding the page of interest if it gets modified in order to be able
to write to the cache correctly, depending on how much integrity I want
to try and keep in the cache.
(7) Not-quite-direct-I/O. cifs, for example, has a number of variations on
read and write modes that are kind of but not quite direct I/O.
David
next prev parent reply other threads:[~2022-11-03 14:52 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-10-28 4:30 [RFC 0/2] iomap: Add support for subpage dirty state tracking to improve write performance Ritesh Harjani (IBM)
2022-10-28 4:30 ` [RFC 1/2] iomap: Change uptodate variable name to state Ritesh Harjani (IBM)
2022-10-28 16:31 ` Darrick J. Wong
2022-10-29 3:09 ` Ritesh Harjani (IBM)
2022-10-28 4:30 ` [RFC 2/2] iomap: Support subpage size dirty tracking to improve write performance Ritesh Harjani (IBM)
2022-10-28 12:42 ` Matthew Wilcox
2022-10-29 3:05 ` Ritesh Harjani (IBM)
2022-10-28 17:01 ` Darrick J. Wong
2022-10-28 18:15 ` Matthew Wilcox
2022-10-29 3:25 ` Ritesh Harjani (IBM)
2022-10-28 21:04 ` Dave Chinner
2022-10-30 3:27 ` Ritesh Harjani (IBM)
2022-10-30 22:31 ` Dave Chinner
2022-10-31 3:43 ` Matthew Wilcox
2022-10-31 7:08 ` Dave Chinner
2022-10-31 10:27 ` Matthew Wilcox
2022-11-02 8:57 ` Christoph Hellwig
2022-11-03 0:38 ` Dave Chinner
2022-11-02 9:03 ` Christoph Hellwig
2022-11-02 17:35 ` Darrick J. Wong
2022-11-04 7:27 ` Christoph Hellwig
2022-11-04 14:15 ` Ritesh Harjani (IBM)
2022-11-03 14:51 ` David Howells [this message]
2022-11-04 7:30 ` Christoph Hellwig
2022-11-07 13:03 ` David Howells
2022-11-03 14:12 ` David Howells
2022-11-04 11:28 ` Ritesh Harjani (IBM)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=7699.1667487070@warthog.procyon.org.uk \
--to=dhowells@redhat.com \
--cc=araherle@in.ibm.com \
--cc=david@fromorbit.com \
--cc=djwong@kernel.org \
--cc=hch@infradead.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=ritesh.list@gmail.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox