All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ming Lei <ming.lei@redhat.com>
To: Jens Axboe <axboe@kernel.dk>
Cc: io-uring@vger.kernel.org, linux-block@vger.kernel.org,
	linux-fsdevel@vger.kernel.org,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Stefan Hajnoczi <stefanha@redhat.com>,
	Miklos Szeredi <mszeredi@redhat.com>,
	Bernd Schubert <bschubert@ddn.com>,
	Nitesh Shetty <nj.shetty@samsung.com>,
	Christoph Hellwig <hch@lst.de>,
	Ziyang Zhang <ZiyangZhang@linux.alibaba.com>,
	ming.lei@redhat.com
Subject: Re: [PATCH 3/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF
Date: Sun, 12 Feb 2023 11:22:42 +0800	[thread overview]
Message-ID: <Y+hbggDCm9wViPAv@T590> (raw)
In-Reply-To: <22772531-bf55-f610-be93-3d53c9ce1c6d@kernel.dk>

On Sat, Feb 11, 2023 at 09:52:58AM -0700, Jens Axboe wrote:
> On 2/11/23 9:12?AM, Ming Lei wrote:
> > On Sat, Feb 11, 2023 at 08:45:18AM -0700, Jens Axboe wrote:
> >> On 2/10/23 8:32?AM, Ming Lei wrote:
> >>> IORING_OP_READ_SPLICE_BUF: read to buffer which is built from
> >>> ->read_splice() of specified fd, so user needs to provide (splice_fd, offset, len)
> >>> for building buffer.
> >>>
> >>> IORING_OP_WRITE_SPLICE_BUF: write from buffer which is built from
> >>> ->read_splice() of specified fd, so user needs to provide (splice_fd, offset, len)
> >>> for building buffer.
> >>>
> >>> The typical use case is for supporting ublk/fuse io_uring zero copy,
> >>> and READ/WRITE OP retrieves ublk/fuse request buffer via direct pipe
> >>> from device->read_splice(), then READ/WRITE can be done to/from this
> >>> buffer directly.
> >>
> >> Main question here - would this be better not plumbed up through the rw
> >> path? Might be cleaner, even if it either requires a bit of helper
> >> refactoring or accepting a bit of duplication. But would still be better
> >> than polluting the rw fast path imho.
> > 
> > The buffer is actually IO buffer, which has to be plumbed up in IO path,
> > and it can't be done like the registered buffer.
> > 
> > The only affect on fast path is :
> > 
> > 		if (io_rw_splice_buf(req))	//which just check opcode
> >               return io_prep_rw_splice_buf(req, sqe);
> > 
> > and the cleanup code which is only done for the two new OPs.
> > 
> > Or maybe I misunderstand your point? Or any detailed suggestion?
> > 
> > Actually the code should be factored into generic helper, since net.c
> > need to use them too. Probably it needs to move to rsrc.c?
> 
> Yep, just refactoring out those bits as a prep thing. rsrc could work,
> or perhaps a new file for that.

OK.

> 
> >> Also seems like this should be separately testable. We can't add new
> >> opcodes that don't have a feature test at least, and should also have
> >> various corner case tests. A bit of commenting outside of this below.
> > 
> > OK, I will write/add one very simple ublk userspace to liburing for
> > test purpose.
> 
> Thanks!

Thinking of further, if we use ublk for liburing test purpose, root is
often needed, even though we support un-privileged mode, which needs
administrator to grant access, so is it still good to do so?

It could be easier to add ->splice_read() on /dev/zero for test
purpose, just allocate zeroed pages in ->splice_read(), and add
them to pipe like ublk->splice_read(), and sink side can read
from or write to these pages, but zero's read_iter_zero() won't
be affected. And normal splice/tee won't connect to zero too
because we only allow it from kernel use.

> 
> >>> diff --git a/io_uring/opdef.c b/io_uring/opdef.c
> >>> index 5238ecd7af6a..91e8d8f96134 100644
> >>> --- a/io_uring/opdef.c
> >>> +++ b/io_uring/opdef.c
> >>> @@ -427,6 +427,31 @@ const struct io_issue_def io_issue_defs[] = {
> >>>  		.prep			= io_eopnotsupp_prep,
> >>>  #endif
> >>>  	},
> >>> +	[IORING_OP_READ_SPLICE_BUF] = {
> >>> +		.needs_file		= 1,
> >>> +		.unbound_nonreg_file	= 1,
> >>> +		.pollin			= 1,
> >>> +		.plug			= 1,
> >>> +		.audit_skip		= 1,
> >>> +		.ioprio			= 1,
> >>> +		.iopoll			= 1,
> >>> +		.iopoll_queue		= 1,
> >>> +		.prep			= io_prep_rw,
> >>> +		.issue			= io_read,
> >>> +	},
> >>> +	[IORING_OP_WRITE_SPLICE_BUF] = {
> >>> +		.needs_file		= 1,
> >>> +		.hash_reg_file		= 1,
> >>> +		.unbound_nonreg_file	= 1,
> >>> +		.pollout		= 1,
> >>> +		.plug			= 1,
> >>> +		.audit_skip		= 1,
> >>> +		.ioprio			= 1,
> >>> +		.iopoll			= 1,
> >>> +		.iopoll_queue		= 1,
> >>> +		.prep			= io_prep_rw,
> >>> +		.issue			= io_write,
> >>> +	},
> >>
> >> Are these really safe with iopoll?
> > 
> > Yeah, after the buffer is built, the handling is basically
> > same with IORING_OP_WRITE_FIXED, so I think it is safe.
> 
> Yeah, on a second look, as these are just using the normal read/write
> path after that should be fine indeed.
> 
> >>
> >>> +static int io_prep_rw_splice_buf(struct io_kiocb *req,
> >>> +				 const struct io_uring_sqe *sqe)
> >>> +{
> >>> +	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
> >>> +	unsigned nr_pages = io_rw_splice_buf_nr_bvecs(rw->len);
> >>> +	loff_t splice_off = READ_ONCE(sqe->splice_off_in);
> >>> +	struct io_rw_splice_buf_data data;
> >>> +	struct io_mapped_ubuf *imu;
> >>> +	struct fd splice_fd;
> >>> +	int ret;
> >>> +
> >>> +	splice_fd = fdget(READ_ONCE(sqe->splice_fd_in));
> >>> +	if (!splice_fd.file)
> >>> +		return -EBADF;
> >>
> >> Seems like this should check for SPLICE_F_FD_IN_FIXED, and also use
> >> io_file_get_normal() for the non-fixed case in case someone passed in an
> >> io_uring fd.
> > 
> > SPLICE_F_FD_IN_FIXED needs one extra word for holding splice flags, if
> > we can use sqe->addr3, I think it is doable.
> 
> I haven't checked the rest, but you can't just use ->splice_flags for
> this?

->splice_flags shares memory with rwflags, so can't be used.

I think it is fine to use ->addr3, given io_getxattr()/io_setxattr()/
io_msg_ring() has used that.

> 
> In any case, the get path needs to look like io_tee() here, and:
> 
> >>> +out_put_fd:
> >>> +	if (splice_fd.file)
> >>> +		fdput(splice_fd);
> 
> this put needs to be gated on whether it's a fixed file or not.

Yeah.

> 
> >> If the operation is done, clear NEED_CLEANUP and do the cleanup here?
> >> That'll be faster.
> > 
> > The buffer has to be cleaned up after req is completed, since bvec
> > table is needed for bio, and page reference need to be dropped after
> > IO is done too.
> 
> I mean when you clear that flag, call the cleanup bits you otherwise
> would've called on later cleanup.

Got it.

Thanks,
Ming


  reply	other threads:[~2023-02-12  3:23 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-10 15:32 [PATCH 0/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF Ming Lei
2023-02-10 15:32 ` [PATCH 1/4] fs/splice: enhance direct pipe & splice for moving pages in kernel Ming Lei
2023-02-11 15:42   ` Ming Lei
2023-02-11 18:57     ` Linus Torvalds
2023-02-12  1:39       ` Ming Lei
2023-02-13 20:04         ` Linus Torvalds
2023-02-14  0:52           ` Ming Lei
2023-02-14  2:35             ` Ming Lei
2023-02-14 11:03           ` Miklos Szeredi
2023-02-14 14:35             ` Ming Lei
2023-02-14 15:39               ` Miklos Szeredi
2023-02-15  0:11                 ` Ming Lei
2023-02-15 10:36                   ` Miklos Szeredi
2023-02-10 15:32 ` [PATCH 2/4] fs/splice: allow to ignore signal in __splice_from_pipe Ming Lei
2023-02-10 15:32 ` [PATCH 3/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF Ming Lei
2023-02-11 15:45   ` Jens Axboe
2023-02-11 16:12     ` Ming Lei
2023-02-11 16:52       ` Jens Axboe
2023-02-12  3:22         ` Ming Lei [this message]
2023-02-12  3:55           ` Jens Axboe
2023-02-13  1:06             ` Ming Lei
2023-02-11 17:13   ` Jens Axboe
2023-02-12  1:48     ` Ming Lei
2023-02-12  2:42       ` Jens Axboe
2023-02-10 15:32 ` [PATCH 4/4] ublk_drv: support splice based read/write zero copy Ming Lei
2023-02-10 21:54 ` [PATCH 0/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF Jens Axboe
2023-02-10 22:19   ` Jens Axboe
2023-02-11  5:13   ` Ming Lei
2023-02-11 15:45     ` Jens Axboe
2023-02-14 16:36 ` Stefan Hajnoczi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y+hbggDCm9wViPAv@T590 \
    --to=ming.lei@redhat.com \
    --cc=ZiyangZhang@linux.alibaba.com \
    --cc=axboe@kernel.dk \
    --cc=bschubert@ddn.com \
    --cc=hch@lst.de \
    --cc=io-uring@vger.kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=mszeredi@redhat.com \
    --cc=nj.shetty@samsung.com \
    --cc=stefanha@redhat.com \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.