From: Ming Lei <ming.lei@redhat.com>
To: Caleb Sander Mateos <csander@purestorage.com>
Cc: Jens Axboe <axboe@kernel.dk>,
linux-block@vger.kernel.org,
Uday Shankar <ushankar@purestorage.com>,
Keith Busch <kbusch@kernel.org>
Subject: Re: [PATCH V2 1/2] ublk: avoid ublk_io_release() called after ublk char dev is closed
Date: Tue, 26 Aug 2025 16:32:44 +0800 [thread overview]
Message-ID: <aK1xLM9TLUBxEmmC@fedora> (raw)
In-Reply-To: <CADUfDZp1-Tg+Gp+9kaudFNYyz3mLjNn+v=H3wKLEoKDftcX1wg@mail.gmail.com>
On Mon, Aug 25, 2025 at 10:49:49PM -0700, Caleb Sander Mateos wrote:
> On Mon, Aug 25, 2025 at 5:49 AM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > When running test_stress_04.sh, the following warning is triggered:
> >
> > WARNING: CPU: 1 PID: 135 at drivers/block/ublk_drv.c:1933 ublk_ch_release+0x423/0x4b0 [ublk_drv]
> >
> > This happens when the daemon is abruptly killed:
> >
> > - some references may still be held, because registering IO buffer
> > doesn't grab ublk char device reference
>
> Ah, good observation. That's definitely a problem.
>
> >
> > OR
> >
> > - io->task_registered_buffers won't be cleared because io buffer is
> > released from non-daemon context
>
> I don't think the task_registered_buffers optimization is really
> involved here; that's just a different way of tracking the reference
> count. Regardless of what task the buffer is registered or
> unregistered on, the buffer still counts as 1 reference on the
> ublk_io. Summing up the reference counts and making sure they are both
> reset to 0 seems like a good approach to me.
The warning in ublk_queue_reinit() is triggered because
- the reference is not dropped
OR
- the io buf unregister is done from another task context(kernel wq), so
both io->ref and io->task_registered_buffers are not zero, which is started
from task_registered_buffers optimization
>
> >
> > For zero-copy and auto buffer register modes, I/O reference crosses
> > syscalls, so IO reference may not be dropped naturally when ublk server is
> > killed abruptly. However, when releasing io_uring context, it is guaranteed
> > that the reference is dropped finally, see io_sqe_buffers_unregister() from
> > io_ring_ctx_free().
> >
> > Fix this by adding ublk_drain_io_references() that:
> > - Waits for active I/O references dropped in async way by scheduling
> > work function, for avoiding ublk dev and io_uring file's release
> > dependency
> > - Reinitializes io->ref and io->task_registered_buffers to clean state
> >
> > This ensures the reference count state is clean when ublk_queue_reinit()
> > is called, preventing the warning and potential use-after-free.
>
> One scenario I worry about is if the ublk server has already issued
> UBLK_IO_COMMIT_AND_FETCH_REQ for an I/O but is killed while it still
> has registered buffer(s). It's possible the ublk server hasn't
> finished performing I/O to/from the registered buffers and so the I/O
> isn't really complete yet. But when io_uring automatically releases
> the registered buffers, the reference count will hit 0 and the ublk
> I/O will be completed successfully. There seems to be some data
> corruption risk in this scenario.
The final io buffer unregister is from freeing io_uring conext in
io_uring_release(), any user of this uring context has been done,
so it isn't possible that ublk server is performing io with the
un-registered buffer.
> But maybe it doesn't make sense for
> a ublk server to issue UBLK_IO_COMMIT_AND_FETCH_REQ with a result code
> before knowing whether the zero-copy I/Os succeeded?
>
> >
> > Fixes: 1f6540e2aabb ("ublk: zc register/unregister bvec")
> > Fixes: 1ceeedb59749 ("ublk: optimize UBLK_IO_UNREGISTER_IO_BUF on daemon task")
> > Fixes: 8a8fe42d765b ("ublk: optimize UBLK_IO_REGISTER_IO_BUF on daemon task")
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> > drivers/block/ublk_drv.c | 94 +++++++++++++++++++++++++++++++++++++++-
> > 1 file changed, 92 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > index 99abd67b708b..f608c7efdc05 100644
> > --- a/drivers/block/ublk_drv.c
> > +++ b/drivers/block/ublk_drv.c
> > @@ -239,6 +239,7 @@ struct ublk_device {
> > struct mutex cancel_mutex;
> > bool canceling;
> > pid_t ublksrv_tgid;
> > + struct delayed_work exit_work;
> > };
> >
> > /* header of ublk_params */
> > @@ -1595,12 +1596,84 @@ static void ublk_set_canceling(struct ublk_device *ub, bool canceling)
> > ublk_get_queue(ub, i)->canceling = canceling;
> > }
> >
> > -static int ublk_ch_release(struct inode *inode, struct file *filp)
> > +static void ublk_reset_io_ref(struct ublk_device *ub)
> > {
> > - struct ublk_device *ub = filp->private_data;
> > + int i, j;
> > +
> > + if (!(ub->dev_info.flags & (UBLK_F_SUPPORT_ZERO_COPY |
> > + UBLK_F_AUTO_BUF_REG)))
> > + return;
> > +
> > + for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
> > + struct ublk_queue *ubq = ublk_get_queue(ub, i);
> > +
> > + for (j = 0; j < ubq->q_depth; j++) {
> > + struct ublk_io *io = &ubq->ios[j];
> > + /*
> > + * Reinitialize reference counting fields after
> > + * draining. This ensures clean state for queue
> > + * reinitialization.
> > + */
> > + refcount_set(&io->ref, 0);
> > + io->task_registered_buffers = 0;
> > + }
> > + }
> > +}
> > +
> > +static bool ublk_has_io_with_active_ref(struct ublk_device *ub)
> > +{
> > + int i, j;
> > +
> > + if (!(ub->dev_info.flags & (UBLK_F_SUPPORT_ZERO_COPY |
> > + UBLK_F_AUTO_BUF_REG)))
> > + return false;
> > +
> > + for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
> > + struct ublk_queue *ubq = ublk_get_queue(ub, i);
> > +
> > + for (j = 0; j < ubq->q_depth; j++) {
> > + struct ublk_io *io = &ubq->ios[j];
> > + unsigned int refs = refcount_read(&io->ref) +
> > + io->task_registered_buffers;
> > +
> > + /*
> > + * UBLK_REFCOUNT_INIT or zero means no active
> > + * reference
> > + */
> > + if (refs != UBLK_REFCOUNT_INIT && refs != 0)
> > + return true;
>
> It's technically possible to hit refs == UBLK_REFCOUNT_INIT by having
> UBLK_REFCOUNT_INIT active references. It would be safer to check
> UBLK_IO_FLAG_OWNED_BY_SRV: if the flag is set, the reference count
> needs to reach UBLK_REFCOUNT_INIT; if the flag is unset, the reference
> count needs to reach 0.
It is actually one invariant that the two's sum is zero or UBLK_REFCOUNT_INIT
any time if the io buffer isn't registered, so it is enough and
simpler.
>
> > + }
> > + }
> > + return false;
> > +}
> > +
> > +static void ublk_ch_release_work_fn(struct work_struct *work)
> > +{
> > + struct ublk_device *ub =
> > + container_of(work, struct ublk_device, exit_work.work);
> > struct gendisk *disk;
> > int i;
> >
> > + /*
> > + * For zero-copy and auto buffer register modes, I/O references
> > + * might not be dropped naturally when the daemon is killed, but
> > + * io_uring guarantees that registered bvec kernel buffers are
> > + * unregistered finally when freeing io_uring context, then the
> > + * active references are dropped.
> > + *
> > + * Wait until active references are dropped for avoiding use-after-free
> > + *
> > + * registered buffer may be unregistered in io_ring's release hander,
> > + * so have to wait by scheduling work function for avoiding the two
> > + * file release dependency.
> > + */
> > + if (ublk_has_io_with_active_ref(ub)) {
> > + schedule_delayed_work(&ub->exit_work, 1);
> > + return;
> > + }
> > +
> > + ublk_reset_io_ref(ub);
>
> Why the 2 separate loops over nr_hw_queues and q_depth? Could they be
> combined into a single nested loop that waits for each ublk_io's
> references to be released and then resets its reference counts to 0?
> Looks like the ub->dev_info.flags checks could also be consolidated.
This way looks more readable, otherwise ublk_has_io_with_active_ref()
has to check and reset. Not a problem, I can move it to one single helper.
>
> > +
> > /*
> > * disk isn't attached yet, either device isn't live, or it has
> > * been removed already, so we needn't to do anything
> > @@ -1673,6 +1746,23 @@ static int ublk_ch_release(struct inode *inode, struct file *filp)
> > ublk_reset_ch_dev(ub);
> > out:
> > clear_bit(UB_STATE_OPEN, &ub->state);
> > +
> > + /* put the reference grabbed in ublk_ch_release() */
> > + ublk_put_device(ub);
> > +}
> > +
> > +static int ublk_ch_release(struct inode *inode, struct file *filp)
> > +{
> > + struct ublk_device *ub = filp->private_data;
> > +
> > + /*
> > + * Grab ublk device reference, so it won't be gone until we are
> > + * really released from work function.
> > + */
> > + ub = ublk_get_device(ub);
>
> Can taking a reference fail here? If so, the NULL return value would
> need to be handled. If not, the "ub =" could be dropped.
Of course it has to get a reference, otherwise it isn't safe to use `ub`
is the release handler, I will drop "ub =".
Thanks,
Ming
next prev parent reply other threads:[~2025-08-26 8:33 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-08-25 12:48 [PATCH V2 0/2] ublk: avoid ublk_io_release() called after ublk char dev is closed Ming Lei
2025-08-25 12:48 ` [PATCH V2 1/2] " Ming Lei
2025-08-26 5:49 ` Caleb Sander Mateos
2025-08-26 8:32 ` Ming Lei [this message]
2025-08-29 15:37 ` Caleb Sander Mateos
2025-09-01 8:44 ` Ming Lei
2025-08-25 12:48 ` [PATCH V2 2/2] ublk selftests: add --no_ublk_fixed_fd for not using registered ublk char device Ming Lei
2025-08-26 4:53 ` Caleb Sander Mateos
2025-08-26 8:36 ` Ming Lei
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aK1xLM9TLUBxEmmC@fedora \
--to=ming.lei@redhat.com \
--cc=axboe@kernel.dk \
--cc=csander@purestorage.com \
--cc=kbusch@kernel.org \
--cc=linux-block@vger.kernel.org \
--cc=ushankar@purestorage.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.