From: Omar Sandoval <osandov@osandov.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com,
linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH 2/2] btrfs: add ioctl for directly writing compressed data
Date: Fri, 6 Sep 2019 14:27:10 -0700 [thread overview]
Message-ID: <20190906212710.GI7452@vader> (raw)
In-Reply-To: <20190906210717.GN7777@dread.disaster.area>
On Sat, Sep 07, 2019 at 07:07:17AM +1000, Dave Chinner wrote:
> On Fri, Sep 06, 2019 at 11:19:49AM -0700, Omar Sandoval wrote:
> > On Thu, Sep 05, 2019 at 12:10:12PM +1000, Dave Chinner wrote:
> > > On Wed, Sep 04, 2019 at 12:13:26PM -0700, Omar Sandoval wrote:
> > > > From: Omar Sandoval <osandov@fb.com>
> > > >
> > > > This adds an API for writing compressed data directly to the filesystem.
> > > > The use case that I have in mind is send/receive: currently, when
> > > > sending data from one compressed filesystem to another, the sending side
> > > > decompresses the data and the receiving side recompresses it before
> > > > writing it out. This is wasteful and can be avoided if we can just send
> > > > and write compressed extents. The send part will be implemented in a
> > > > separate series, as this ioctl can stand alone.
> > > >
> > > > The interface is essentially pwrite(2) with some extra information:
> > > >
> > > > - The input buffer contains the compressed data.
> > > > - Both the compressed and decompressed sizes of the data are given.
> > > > - The compression type (zlib, lzo, or zstd) is given.
> >
> > Hi, Dave,
> >
> > > So why can't you do this with pwritev2()? Heaps of flags, and
> > > use a second iovec to hold the decompressed size of the previous
> > > iovec. i.e.
> > >
> > > iov[0].iov_base = compressed_data;
> > > iov[0].iov_len = compressed_size;
> > > iov[1].iov_base = NULL;
> > > iov[1].iov_len = uncompressed_size;
> > > pwritev2(fd, iov, 2, offset, RWF_COMPRESSED_ZLIB);
> > >
> > > And you don't need to reinvent pwritev() with some whacky ioctl that
> > > is bound to be completely screwed up is ways not noticed until
> > > someone else tries to use it...
> >
> > This is a good suggestion, thanks. I hadn't considered (ab?)using iovecs
> > in this way.
>
> Yeah, it is a bit of API abuse to pass per-iovec context in the next
> iovec, but ISTR it being proposed in past times for other
> mechanisms. I think it's far better than a whole new filesystem
> private ioctl interface and structure to do what is effectively
> direct IO...
>
> > One modification I'd make would be to put the encoding into the second
> > iovec and use a single RWF_ENCODED flag so that we don't have to keep
> > stealing from RWF_* every time we add a new compression
> > algorithm/encryption type/whatever:
> >
> > iov[0].iov_base = compressed_data;
> > iov[0].iov_len = compressed_size;
> > iov[1].iov_base = (void *)IOV_ENCODING_ZLIB;
> > iov[1].iov_len = uncompressed_size;
> > pwritev2(fd, iov, 2, offset, RWF_ENCODED);
> >
> > Making every other iovec a metadata iovec in this way would be a major
> > pain to plumb through the iov_iter and VFS code, though. Instead, we
> > could put the metadata in iov[0] and the encoded data in iov[1..iovcnt -
> > 1]:
> >
> > iov[0].iov_base = (void *)IOV_ENCODING_ZLIB;
> > iov[0].iov_len = unencoded_len;
> > iov[1].iov_base = encoded_data1;
> > iov[1].iov_len = encoded_size1;
> > iov[2].iov_base = encoded_data2;
> > iov[2].iov_len = encoded_size2;
> > pwritev2(fd, iov, 3, offset, RWF_ENCODED);
> >
> > In my opinion, these are both reasonable interfaces. The former allows
> > the user to write multiple encoded "extents" at once, while the latter
> > allows writing a single encoded extent from scattered buffers. The
> > latter is much simpler to implement ;) Thoughts?
>
> Both reasonable, and I have no real concern about how it is done as
> long as the format is well documented and works for both read and
> write.
>
> The only other thing I think we need to be careful of is that
> interface works with AIO (via the RWF flag) and the new uioring async
> interface - I think thw RWF flag is all that is needed there). I
> think that's another good reason for taking the preadv2/pwritev2
> path, as that should all largely just work with the right iocb
> frobbing in the syscall context...
A symmetric interface for preadv2 would look something like this:
iov[1].iov_base = encoded_data1;
iov[1].iov_len = encoded_size1;
iov[2].iov_base = encoded_data2;
iov[2].iov_len = encoded_size2;
preadv2(fd, iov, 3, offset, RWF_ENCODED);
/*
* iov[0].iov_base gets filled in with the encoding flags,
* iov[0].iov_len gets filled in with unencoded length.
*/
But, iov is passed as a const struct iovec *, so it'd be nasty to write
to it in the RWF_ENCODED case. Maybe we actually want to pass the
encoding information through an extra indirection. Something along the
lines of this for writes:
struct encoded_rw {
size_t unencoded_len;
int compression;
int encryption;
...
};
struct encoded_rw encoded = {
unencoded_len,
ENCODED_RW_ZLIB,
};
iov[0].iov_base = &encoded;
iov[0].iov_len = sizeof(encoded);
iov[1].iov_base = encoded_data1;
iov[1].iov_len = encoded_size1;
iov[2].iov_base = encoded_data2;
iov[2].iov_len = encoded_size2;
pwritev2(fd, iov, 3, offset, RWF_ENCODED);
And similar for reads:
struct encoded_rw encoded;
iov[0].iov_base = &encoded;
iov[0].iov_len = sizeof(encoded);
iov[1].iov_base = encoded_data1;
iov[1].iov_len = encoded_size1;
iov[2].iov_base = encoded_data2;
iov[2].iov_len = encoded_size2;
preadv2(fd, iov, 3, offset, RWF_ENCODED);
/* encoded gets filled in with the encoding information. */
I'll draft something with this interface.
next prev parent reply other threads:[~2019-09-06 21:27 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-09-04 19:13 [PATCH 0/2] Btrfs: add interface for writing compressed extents directly Omar Sandoval
2019-09-04 19:13 ` [PATCH 1/2] fs: export rw_verify_area() Omar Sandoval
2019-09-04 19:13 ` [PATCH 2/2] btrfs: add ioctl for directly writing compressed data Omar Sandoval
2019-09-05 2:10 ` Dave Chinner
2019-09-05 12:16 ` Johannes Thumshirn
2019-09-05 12:51 ` Johannes Thumshirn
2019-09-06 7:46 ` Dave Chinner
2019-09-06 18:19 ` Omar Sandoval
2019-09-06 21:07 ` Dave Chinner
2019-09-06 21:27 ` Omar Sandoval [this message]
2019-09-05 10:33 ` Nikolay Borisov
2019-09-19 6:14 ` Omar Sandoval
2019-09-19 7:46 ` Nikolay Borisov
2019-09-19 7:59 ` Omar Sandoval
2019-09-24 10:29 ` Nikolay Borisov
2019-09-10 11:39 ` Filipe Manana
2019-09-19 6:23 ` Omar Sandoval
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190906212710.GI7452@vader \
--to=osandov@osandov.com \
--cc=david@fromorbit.com \
--cc=kernel-team@fb.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.