From: Ming Lei <ming.lei@redhat.com>
To: Caleb Sander Mateos <csander@purestorage.com>
Cc: Jens Axboe <axboe@kernel.dk>,
linux-block@vger.kernel.org,
Uday Shankar <ushankar@purestorage.com>,
Keith Busch <kbusch@kernel.org>
Subject: Re: [PATCH V2 1/2] ublk: avoid ublk_io_release() called after ublk char dev is closed
Date: Tue, 26 Aug 2025 16:32:44 +0800 [thread overview]
Message-ID: <aK1xLM9TLUBxEmmC@fedora> (raw)
In-Reply-To: <CADUfDZp1-Tg+Gp+9kaudFNYyz3mLjNn+v=H3wKLEoKDftcX1wg@mail.gmail.com>
On Mon, Aug 25, 2025 at 10:49:49PM -0700, Caleb Sander Mateos wrote:
> On Mon, Aug 25, 2025 at 5:49 AM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > When running test_stress_04.sh, the following warning is triggered:
> >
> > WARNING: CPU: 1 PID: 135 at drivers/block/ublk_drv.c:1933 ublk_ch_release+0x423/0x4b0 [ublk_drv]
> >
> > This happens when the daemon is abruptly killed:
> >
> > - some references may still be held, because registering IO buffer
> > doesn't grab ublk char device reference
>
> Ah, good observation. That's definitely a problem.
>
> >
> > OR
> >
> > - io->task_registered_buffers won't be cleared because io buffer is
> > released from non-daemon context
>
> I don't think the task_registered_buffers optimization is really
> involved here; that's just a different way of tracking the reference
> count. Regardless of what task the buffer is registered or
> unregistered on, the buffer still counts as 1 reference on the
> ublk_io. Summing up the reference counts and making sure they are both
> reset to 0 seems like a good approach to me.
The warning in ublk_queue_reinit() is triggered because
- the reference is not dropped
OR
- the io buf unregister is done from another task context(kernel wq), so
both io->ref and io->task_registered_buffers are not zero, which is started
from task_registered_buffers optimization
>
> >
> > For zero-copy and auto buffer register modes, I/O reference crosses
> > syscalls, so IO reference may not be dropped naturally when ublk server is
> > killed abruptly. However, when releasing io_uring context, it is guaranteed
> > that the reference is dropped finally, see io_sqe_buffers_unregister() from
> > io_ring_ctx_free().
> >
> > Fix this by adding ublk_drain_io_references() that:
> > - Waits for active I/O references dropped in async way by scheduling
> > work function, for avoiding ublk dev and io_uring file's release
> > dependency
> > - Reinitializes io->ref and io->task_registered_buffers to clean state
> >
> > This ensures the reference count state is clean when ublk_queue_reinit()
> > is called, preventing the warning and potential use-after-free.
>
> One scenario I worry about is if the ublk server has already issued
> UBLK_IO_COMMIT_AND_FETCH_REQ for an I/O but is killed while it still
> has registered buffer(s). It's possible the ublk server hasn't
> finished performing I/O to/from the registered buffers and so the I/O
> isn't really complete yet. But when io_uring automatically releases
> the registered buffers, the reference count will hit 0 and the ublk
> I/O will be completed successfully. There seems to be some data
> corruption risk in this scenario.
The final io buffer unregister is from freeing io_uring conext in
io_uring_release(), any user of this uring context has been done,
so it isn't possible that ublk server is performing io with the
un-registered buffer.
> But maybe it doesn't make sense for
> a ublk server to issue UBLK_IO_COMMIT_AND_FETCH_REQ with a result code
> before knowing whether the zero-copy I/Os succeeded?
>
> >
> > Fixes: 1f6540e2aabb ("ublk: zc register/unregister bvec")
> > Fixes: 1ceeedb59749 ("ublk: optimize UBLK_IO_UNREGISTER_IO_BUF on daemon task")
> > Fixes: 8a8fe42d765b ("ublk: optimize UBLK_IO_REGISTER_IO_BUF on daemon task")
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> > drivers/block/ublk_drv.c | 94 +++++++++++++++++++++++++++++++++++++++-
> > 1 file changed, 92 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > index 99abd67b708b..f608c7efdc05 100644
> > --- a/drivers/block/ublk_drv.c
> > +++ b/drivers/block/ublk_drv.c
> > @@ -239,6 +239,7 @@ struct ublk_device {
> > struct mutex cancel_mutex;
> > bool canceling;
> > pid_t ublksrv_tgid;
> > + struct delayed_work exit_work;
> > };
> >
> > /* header of ublk_params */
> > @@ -1595,12 +1596,84 @@ static void ublk_set_canceling(struct ublk_device *ub, bool canceling)
> > ublk_get_queue(ub, i)->canceling = canceling;
> > }
> >
> > -static int ublk_ch_release(struct inode *inode, struct file *filp)
> > +static void ublk_reset_io_ref(struct ublk_device *ub)
> > {
> > - struct ublk_device *ub = filp->private_data;
> > + int i, j;
> > +
> > + if (!(ub->dev_info.flags & (UBLK_F_SUPPORT_ZERO_COPY |
> > + UBLK_F_AUTO_BUF_REG)))
> > + return;
> > +
> > + for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
> > + struct ublk_queue *ubq = ublk_get_queue(ub, i);
> > +
> > + for (j = 0; j < ubq->q_depth; j++) {
> > + struct ublk_io *io = &ubq->ios[j];
> > + /*
> > + * Reinitialize reference counting fields after
> > + * draining. This ensures clean state for queue
> > + * reinitialization.
> > + */
> > + refcount_set(&io->ref, 0);
> > + io->task_registered_buffers = 0;
> > + }
> > + }
> > +}
> > +
> > +static bool ublk_has_io_with_active_ref(struct ublk_device *ub)
> > +{
> > + int i, j;
> > +
> > + if (!(ub->dev_info.flags & (UBLK_F_SUPPORT_ZERO_COPY |
> > + UBLK_F_AUTO_BUF_REG)))
> > + return false;
> > +
> > + for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
> > + struct ublk_queue *ubq = ublk_get_queue(ub, i);
> > +
> > + for (j = 0; j < ubq->q_depth; j++) {
> > + struct ublk_io *io = &ubq->ios[j];
> > + unsigned int refs = refcount_read(&io->ref) +
> > + io->task_registered_buffers;
> > +
> > + /*
> > + * UBLK_REFCOUNT_INIT or zero means no active
> > + * reference
> > + */
> > + if (refs != UBLK_REFCOUNT_INIT && refs != 0)
> > + return true;
>
> It's technically possible to hit refs == UBLK_REFCOUNT_INIT by having
> UBLK_REFCOUNT_INIT active references. It would be safer to check
> UBLK_IO_FLAG_OWNED_BY_SRV: if the flag is set, the reference count
> needs to reach UBLK_REFCOUNT_INIT; if the flag is unset, the reference
> count needs to reach 0.
It is actually one invariant that the two's sum is zero or UBLK_REFCOUNT_INIT
any time if the io buffer isn't registered, so it is enough and
simpler.
>
> > + }
> > + }
> > + return false;
> > +}
> > +
> > +static void ublk_ch_release_work_fn(struct work_struct *work)
> > +{
> > + struct ublk_device *ub =
> > + container_of(work, struct ublk_device, exit_work.work);
> > struct gendisk *disk;
> > int i;
> >
> > + /*
> > + * For zero-copy and auto buffer register modes, I/O references
> > + * might not be dropped naturally when the daemon is killed, but
> > + * io_uring guarantees that registered bvec kernel buffers are
> > + * unregistered finally when freeing io_uring context, then the
> > + * active references are dropped.
> > + *
> > + * Wait until active references are dropped for avoiding use-after-free
> > + *
> > + * registered buffer may be unregistered in io_ring's release hander,
> > + * so have to wait by scheduling work function for avoiding the two
> > + * file release dependency.
> > + */
> > + if (ublk_has_io_with_active_ref(ub)) {
> > + schedule_delayed_work(&ub->exit_work, 1);
> > + return;
> > + }
> > +
> > + ublk_reset_io_ref(ub);
>
> Why the 2 separate loops over nr_hw_queues and q_depth? Could they be
> combined into a single nested loop that waits for each ublk_io's
> references to be released and then resets its reference counts to 0?
> Looks like the ub->dev_info.flags checks could also be consolidated.
This way looks more readable, otherwise ublk_has_io_with_active_ref()
has to check and reset. Not a problem, I can move it to one single helper.
>
> > +
> > /*
> > * disk isn't attached yet, either device isn't live, or it has
> > * been removed already, so we needn't to do anything
> > @@ -1673,6 +1746,23 @@ static int ublk_ch_release(struct inode *inode, struct file *filp)
> > ublk_reset_ch_dev(ub);
> > out:
> > clear_bit(UB_STATE_OPEN, &ub->state);
> > +
> > + /* put the reference grabbed in ublk_ch_release() */
> > + ublk_put_device(ub);
> > +}
> > +
> > +static int ublk_ch_release(struct inode *inode, struct file *filp)
> > +{
> > + struct ublk_device *ub = filp->private_data;
> > +
> > + /*
> > + * Grab ublk device reference, so it won't be gone until we are
> > + * really released from work function.
> > + */
> > + ub = ublk_get_device(ub);
>
> Can taking a reference fail here? If so, the NULL return value would
> need to be handled. If not, the "ub =" could be dropped.
Of course it has to get a reference, otherwise it isn't safe to use `ub`
is the release handler, I will drop "ub =".
Thanks,
Ming
next prev parent reply other threads:[~2025-08-26 8:33 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-08-25 12:48 [PATCH V2 0/2] ublk: avoid ublk_io_release() called after ublk char dev is closed Ming Lei
2025-08-25 12:48 ` [PATCH V2 1/2] " Ming Lei
2025-08-26 5:49 ` Caleb Sander Mateos
2025-08-26 8:32 ` Ming Lei [this message]
2025-08-29 15:37 ` Caleb Sander Mateos
2025-09-01 8:44 ` Ming Lei
2025-08-25 12:48 ` [PATCH V2 2/2] ublk selftests: add --no_ublk_fixed_fd for not using registered ublk char device Ming Lei
2025-08-26 4:53 ` Caleb Sander Mateos
2025-08-26 8:36 ` Ming Lei
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aK1xLM9TLUBxEmmC@fedora \
--to=ming.lei@redhat.com \
--cc=axboe@kernel.dk \
--cc=csander@purestorage.com \
--cc=kbusch@kernel.org \
--cc=linux-block@vger.kernel.org \
--cc=ushankar@purestorage.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox