From: Ming Lei <ming.lei@redhat.com>
To: Caleb Sander Mateos <csander@purestorage.com>
Cc: Jens Axboe <axboe@kernel.dk>,
linux-block@vger.kernel.org,
Uday Shankar <ushankar@purestorage.com>
Subject: Re: [PATCH 3/4] ublk: add feature UBLK_F_AUTO_ZERO_COPY
Date: Sun, 27 Apr 2025 11:15:27 +0800 [thread overview]
Message-ID: <aA2hT7t7eShRPFbh@fedora> (raw)
In-Reply-To: <CADUfDZr6ezr=1_qTOm0xUJ1YfXVH_z57V4fNnQm6D-856T6A_g@mail.gmail.com>
On Sat, Apr 26, 2025 at 08:09:38PM -0700, Caleb Sander Mateos wrote:
> On Sat, Apr 26, 2025 at 7:07 PM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > On Sat, Apr 26, 2025 at 03:42:59PM -0700, Caleb Sander Mateos wrote:
> > > On Sat, Apr 26, 2025 at 2:41 AM Ming Lei <ming.lei@redhat.com> wrote:
> > > >
> > > > UBLK_F_SUPPORT_ZERO_COPY requires ublk server to issue explicit buffer
> > > > register/unregister uring_cmd for each IO, this way is not only inefficient,
> > > > but also introduce dependency between buffer consumer and buffer register/
> > > > unregister uring_cmd, please see tools/testing/selftests/ublk/stripe.c
> > > > in which backing file IO has to be issued one by one by IOSQE_IO_LINK.
> > >
> > > This is a great idea!
> > >
> > > >
> > > > Add feature UBLK_F_AUTO_ZERO_COPY for addressing the existed zero copy
> > > > limit:
> > >
> > > nit: "existed" -> "existing", "limit" -> "limitation"
> > >
> > > >
> > > > - register request buffer automatically before delivering io command to
> > > > ublk server
> > > >
> > > > - unregister request buffer automatically when completing the request
> > > >
> > > > - io_uring will unregister the buffer automatically when uring is
> > > > exiting, so we needn't worry about accident exit
> > > >
> > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > ---
> > > > drivers/block/ublk_drv.c | 87 +++++++++++++++++++++++++++--------
> > > > include/uapi/linux/ublk_cmd.h | 20 ++++++++
> > > > 2 files changed, 89 insertions(+), 18 deletions(-)
> > > >
> > > > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > > > index 347790b3a633..453767f91431 100644
> > > > --- a/drivers/block/ublk_drv.c
> > > > +++ b/drivers/block/ublk_drv.c
> > > > @@ -64,7 +64,8 @@
> > > > | UBLK_F_CMD_IOCTL_ENCODE \
> > > > | UBLK_F_USER_COPY \
> > > > | UBLK_F_ZONED \
> > > > - | UBLK_F_USER_RECOVERY_FAIL_IO)
> > > > + | UBLK_F_USER_RECOVERY_FAIL_IO \
> > > > + | UBLK_F_AUTO_ZERO_COPY)
> > > >
> > > > #define UBLK_F_ALL_RECOVERY_FLAGS (UBLK_F_USER_RECOVERY \
> > > > | UBLK_F_USER_RECOVERY_REISSUE \
> > > > @@ -131,6 +132,14 @@ struct ublk_uring_cmd_pdu {
> > > > */
> > > > #define UBLK_IO_FLAG_NEED_GET_DATA 0x08
> > > >
> > > > +/*
> > > > + * request buffer is registered automatically, so we have to unregister it
> > > > + * before completing this request.
> > > > + *
> > > > + * io_uring will unregister buffer automatically for us during exiting.
> > > > + */
> > > > +#define UBLK_IO_FLAG_UNREG_BUF 0x10
> > > > +
> > > > /* atomic RW with ubq->cancel_lock */
> > > > #define UBLK_IO_FLAG_CANCELED 0x80000000
> > > >
> > > > @@ -207,7 +216,8 @@ static inline struct ublksrv_io_desc *ublk_get_iod(struct ublk_queue *ubq,
> > > > int tag);
> > > > static inline bool ublk_dev_is_user_copy(const struct ublk_device *ub)
> > > > {
> > > > - return ub->dev_info.flags & (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY);
> > > > + return ub->dev_info.flags & (UBLK_F_USER_COPY |
> > > > + UBLK_F_SUPPORT_ZERO_COPY | UBLK_F_AUTO_ZERO_COPY);
> > > > }
> > > >
> > > > static inline bool ublk_dev_is_zoned(const struct ublk_device *ub)
> > > > @@ -614,6 +624,11 @@ static inline bool ublk_support_zero_copy(const struct ublk_queue *ubq)
> > > > return ubq->flags & UBLK_F_SUPPORT_ZERO_COPY;
> > > > }
> > > >
> > > > +static inline bool ublk_support_auto_zc(const struct ublk_queue *ubq)
> > > > +{
> > > > + return ubq->flags & UBLK_F_AUTO_ZERO_COPY;
> > > > +}
> > > > +
> > > > static inline bool ublk_support_user_copy(const struct ublk_queue *ubq)
> > > > {
> > > > return ubq->flags & (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY);
> > > > @@ -621,7 +636,7 @@ static inline bool ublk_support_user_copy(const struct ublk_queue *ubq)
> > > >
> > > > static inline bool ublk_need_map_io(const struct ublk_queue *ubq)
> > > > {
> > > > - return !ublk_support_user_copy(ubq);
> > > > + return !ublk_support_user_copy(ubq) && !ublk_support_auto_zc(ubq);
> > > > }
> > > >
> > > > static inline bool ublk_need_req_ref(const struct ublk_queue *ubq)
> > > > @@ -629,17 +644,21 @@ static inline bool ublk_need_req_ref(const struct ublk_queue *ubq)
> > > > /*
> > > > * read()/write() is involved in user copy, so request reference
> > > > * has to be grabbed
> > > > + *
> > > > + * For auto zc, ublk server still may issue UBLK_IO_COMMIT_AND_FETCH_REQ
> > > > + * before one registered buffer is used up, so reference is
> > > > + * required too.
> > > > */
> > > > - return ublk_support_user_copy(ubq);
> > > > + return ublk_support_user_copy(ubq) || ublk_support_auto_zc(ubq);
> > >
> > > Since both places where ublk_support_user_copy() is used are being
> > > adjusted to also check ublk_support_auto_zc(), maybe
> > > ublk_support_user_copy() should just be changed to check
> > > UBLK_F_AUTO_ZERO_COPY too?
> >
> > I think that isn't good, we should have let each flag to cover its feature only.
> >
> > That said it was not good to add zero copy check in both ublk_support_user_copy()
> > and ublk_dev_is_user_copy().
>
> Fair point.
>
> >
> > If ublk server needs user copy, it should have enabled it explicitly.
> >
> > >
> > > > }
> > > >
> > > > static inline void ublk_init_req_ref(const struct ublk_queue *ubq,
> > > > - struct request *req)
> > > > + struct request *req, int init_ref)
> > > > {
> > > > if (ublk_need_req_ref(ubq)) {
> > > > struct ublk_rq_data *data = blk_mq_rq_to_pdu(req);
> > > >
> > > > - kref_init(&data->ref);
> > > > + refcount_set(&data->ref.refcount, init_ref);
> > >
> > > It might be nicer not to mix kref and raw reference count operations.
> > > Maybe you could add a prep patch that switches from struct kref to
> > > refcount_t?
> >
> > That is fine, or probably kref need to provide one variant of __kref_init(nr).
> >
> > >
> > > > }
> > > > }
> > > >
> > > > @@ -667,6 +686,15 @@ static inline void ublk_put_req_ref(const struct ublk_queue *ubq,
> > > > }
> > > > }
> > > >
> > > > +/* for ublk zero copy */
> > > > +static void ublk_io_release(void *priv)
> > > > +{
> > > > + struct request *rq = priv;
> > > > + struct ublk_queue *ubq = rq->mq_hctx->driver_data;
> > > > +
> > > > + ublk_put_req_ref(ubq, rq);
> > > > +}
> > > > +
> > > > static inline bool ublk_need_get_data(const struct ublk_queue *ubq)
> > > > {
> > > > return ubq->flags & UBLK_F_NEED_GET_DATA;
> > > > @@ -1231,7 +1259,22 @@ static void ublk_dispatch_req(struct ublk_queue *ubq,
> > > > mapped_bytes >> 9;
> > > > }
> > > >
> > > > - ublk_init_req_ref(ubq, req);
> > > > + if (ublk_support_auto_zc(ubq) && ublk_rq_has_data(req)) {
> > > > + int ret;
> > > > +
> > > > + /* one extra reference is dropped by ublk_io_release */
> > > > + ublk_init_req_ref(ubq, req, 2);
> > > > + ret = io_buffer_register_bvec(io->cmd, req, ublk_io_release,
> > > > + tag, issue_flags);
> > >
> > > Using the ublk request tag as the registered buffer index may be too
> > > limiting. It would cause collisions if there are multiple ublk devices
> > > or queues handled on a single io_uring. It also requires offsetting
> > > any registered user buffers so their indices come after all the ublk
> > > ones, which may be difficult if ublk devices are added dynamically.
> > > Perhaps the UBLK_IO_FETCH_REQ operation could specify the registered
> > > buffer index to use for the request?
> > >
> > > This also requires the io_uring issuing the zero-copy I/Os to be the
> > > same as the one submitting the ublk fetch requests. That would also
> > > make it difficult for us to adopt for our ublk server, which uses
> > > separate io_urings. Not sure if there is a simple way the ublk server
> > > could specify what io_uring to register the ublk zero-copy buffers
> > > with.
> >
> > I think it can be done by passing `io_uring fd` and buffer 'index' via
> > uring_cmd_header, there is still one u64 (->addr) left for zero copy,
> > then the buffer's `io_uring fd/fixed_fd` and 'index' can be provided
> > to io_uring core for registering buffer, this way should be flexible
> > enough for covering any case.
> >
> > >
> > > > + if (ret) {
> > > > + blk_mq_end_request(req, BLK_STS_IOERR);
> > > > + return;
> > >
> > > Does this leak the ublk fetch request? Seems like it should be
> >
> > It won't leak ublk uring_cmd which is just for delivering ublk
> > io command to ublk server.
>
> But where does the uring_cmd complete? The early return means this
> will skip the call to ubq_complete_io_cmd(). Won't that leave the
> uring_cmd stuck until the io_uring exits?
It is like nothing coming from /dev/ublkbN, we always submitted
one uring_cmd beforehand, when no one knows there will be request
available now or future.
>
> >
> > > completed to the ublk server with an error code.
> >
> > It could be hard for ublk server to deal with, I'd suggest to
> > start with one simple implementation first.
> >
> > It is usually one bug, and user will get notified from client's
> > failure, then complain and ublk server gets fixed.
> >
> > >
> > > > + }
> > > > + io->flags |= UBLK_IO_FLAG_UNREG_BUF;
> > > > + } else {
> > > > + ublk_init_req_ref(ubq, req, 1);
> > > > + }
> > > > +
> > > > ubq_complete_io_cmd(io, UBLK_IO_RES_OK, issue_flags);
> > > > }
> > > >
> > > > @@ -1593,7 +1636,8 @@ static int ublk_ch_mmap(struct file *filp, struct vm_area_struct *vma)
> > > > }
> > > >
> > > > static void ublk_commit_completion(struct ublk_device *ub,
> > > > - const struct ublksrv_io_cmd *ub_cmd)
> > > > + const struct ublksrv_io_cmd *ub_cmd,
> > > > + unsigned int issue_flags)
> > > > {
> > > > u32 qid = ub_cmd->q_id, tag = ub_cmd->tag;
> > > > struct ublk_queue *ubq = ublk_get_queue(ub, qid);
> > > > @@ -1604,6 +1648,15 @@ static void ublk_commit_completion(struct ublk_device *ub,
> > > > io->flags &= ~UBLK_IO_FLAG_OWNED_BY_SRV;
> > > > io->res = ub_cmd->result;
> > > >
> > > > + if (io->flags & UBLK_IO_FLAG_UNREG_BUF) {
> > > > + int ret;
> > > > +
> > > > + ret = io_buffer_unregister_bvec(io->cmd, tag, issue_flags);
> > > > + if (ret)
> > > > + io->res = ret;
> > >
> > > I think it would be confusing to report an error to the application
> > > submitting the I/O if unregistration fails. The only scenario where it
> > > seems possible for this to fail is if userspace issues an
> > > IORING_REGISTER_BUFFERS_UPDATE that overwrites the registered buffer
> > > slot while the ublk request is being handled by the ublk server. I
> > > would either ignore any error from io_buffer_unregister_bvec() or
> > > return it to the ublk server.
> >
> > Fair enough, maybe WARN_ON_ONCE() is enough.
>
> I think it's technically triggerable from userspace, but certainly
> very unexpected.
Yeah, I agree.
Thanks,
Ming
next prev parent reply other threads:[~2025-04-27 3:15 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-26 9:41 [PATCH 0/4] ublk: two fixes and support UBLK_F_AUTO_ZERO_COPY Ming Lei
2025-04-26 9:41 ` [PATCH 1/4] selftests: ublk: fix UBLK_F_NEED_GET_DATA Ming Lei
2025-04-26 20:15 ` Caleb Sander Mateos
2025-04-27 1:26 ` Ming Lei
2025-04-26 9:41 ` [PATCH 2/4] ublk: enhance check for register/unregister io buffer command Ming Lei
2025-04-26 20:38 ` Caleb Sander Mateos
2025-04-27 1:37 ` Ming Lei
2025-04-27 3:14 ` Caleb Sander Mateos
2025-04-27 3:49 ` Ming Lei
2025-04-26 9:41 ` [PATCH 3/4] ublk: add feature UBLK_F_AUTO_ZERO_COPY Ming Lei
2025-04-26 22:42 ` Caleb Sander Mateos
2025-04-27 2:06 ` Ming Lei
2025-04-27 3:09 ` Caleb Sander Mateos
2025-04-27 3:15 ` Ming Lei [this message]
2025-04-27 2:34 ` Keith Busch
2025-04-27 3:10 ` Ming Lei
2025-04-27 4:04 ` Keith Busch
2025-04-27 7:32 ` Ming Lei
2025-04-26 9:41 ` [PATCH 4/4] selftests: ublk: support UBLK_F_AUTO_ZERO_COPY Ming Lei
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aA2hT7t7eShRPFbh@fedora \
--to=ming.lei@redhat.com \
--cc=axboe@kernel.dk \
--cc=csander@purestorage.com \
--cc=linux-block@vger.kernel.org \
--cc=ushankar@purestorage.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.