All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ming Lei <ming.lei@redhat.com>
To: Caleb Sander Mateos <csander@purestorage.com>
Cc: Jens Axboe <axboe@kernel.dk>,
	linux-block@vger.kernel.org,
	Uday Shankar <ushankar@purestorage.com>,
	Keith Busch <kbusch@kernel.org>
Subject: Re: [PATCH V2 1/2] ublk: avoid ublk_io_release() called after ublk char dev is closed
Date: Mon, 1 Sep 2025 16:44:55 +0800	[thread overview]
Message-ID: <aLVdB4q_Bk5Ypzcr@fedora> (raw)
In-Reply-To: <CADUfDZqJ8PN8FT50gg0grJ+Kj+gOaosRctS_dimWAaRC-myfRw@mail.gmail.com>

On Fri, Aug 29, 2025 at 08:37:28AM -0700, Caleb Sander Mateos wrote:
> On Tue, Aug 26, 2025 at 1:33 AM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > On Mon, Aug 25, 2025 at 10:49:49PM -0700, Caleb Sander Mateos wrote:
> > > On Mon, Aug 25, 2025 at 5:49 AM Ming Lei <ming.lei@redhat.com> wrote:
> > > >
> > > > When running test_stress_04.sh, the following warning is triggered:
> > > >
> > > > WARNING: CPU: 1 PID: 135 at drivers/block/ublk_drv.c:1933 ublk_ch_release+0x423/0x4b0 [ublk_drv]
> > > >
> > > > This happens when the daemon is abruptly killed:
> > > >
> > > > - some references may still be held, because registering IO buffer
> > > > doesn't grab ublk char device reference
> > >
> > > Ah, good observation. That's definitely a problem.
> > >
> > > >
> > > > OR
> > > >
> > > > - io->task_registered_buffers won't be cleared because io buffer is
> > > > released from non-daemon context
> > >
> > > I don't think the task_registered_buffers optimization is really
> > > involved here; that's just a different way of tracking the reference
> > > count. Regardless of what task the buffer is registered or
> > > unregistered on, the buffer still counts as 1 reference on the
> > > ublk_io. Summing up the reference counts and making sure they are both
> > > reset to 0 seems like a good approach to me.
> >
> > The warning in ublk_queue_reinit() is triggered because
> >
> > - the reference is not dropped
> >
> > OR
> >
> > - the io buf unregister is done from another task context(kernel wq), so
> > both io->ref and io->task_registered_buffers are not zero, which is started
> > from task_registered_buffers optimization
> 
> Right, but I would consider that to be an outstanding reference. I
> think we're on the same page.
> 
> >
> > >
> > > >
> > > > For zero-copy and auto buffer register modes, I/O reference crosses
> > > > syscalls, so IO reference may not be dropped naturally when ublk server is
> > > > killed abruptly. However, when releasing io_uring context, it is guaranteed
> > > > that the reference is dropped finally, see io_sqe_buffers_unregister() from
> > > > io_ring_ctx_free().
> > > >
> > > > Fix this by adding ublk_drain_io_references() that:
> > > > - Waits for active I/O references dropped in async way by scheduling
> > > >   work function, for avoiding ublk dev and io_uring file's release
> > > >   dependency
> > > > - Reinitializes io->ref and io->task_registered_buffers to clean state
> > > >
> > > > This ensures the reference count state is clean when ublk_queue_reinit()
> > > > is called, preventing the warning and potential use-after-free.
> > >
> > > One scenario I worry about is if the ublk server has already issued
> > > UBLK_IO_COMMIT_AND_FETCH_REQ for an I/O but is killed while it still
> > > has registered buffer(s). It's possible the ublk server hasn't
> > > finished performing I/O to/from the registered buffers and so the I/O
> > > isn't really complete yet. But when io_uring automatically releases
> > > the registered buffers, the reference count will hit 0 and the ublk
> > > I/O will be completed successfully. There seems to be some data
> > > corruption risk in this scenario.
> >
> > The final io buffer unregister is from freeing io_uring conext in
> > io_uring_release(), any user of this uring context has been done,
> > so it isn't possible that ublk server is performing io with the
> > un-registered buffer.
> 
> Well, the ublk server could have been killed before it could submit an
> I/O using the registered buffer. But it's an existing concern,
> certainly not introduced by your patch. I think you can make a
> reasonable argument that a ublk server shouldn't submit a
> UBLK_IO_COMMIT_AND_FETCH_REQ until it knows the I/O completed
> successfully or with an error, otherwise it won't be able to change
> the result code if the I/O using the registered buffer unexpectedly
> fails.

Yes, it was one trouble because ublk server can do whatever.

However, since your task_registered_buffers optimization, ublk request
won't be completed from freeing uring_ctx context any more, and it is always
aborted from ublk_abort_queue(), so it is failed in case of killed ublk
server.

The situation can be improved by failing request explicitly by setting io->res
as -EIO, then -EIO can be taken first in case of killed ublk server.

> 
> >
> > > But maybe it doesn't make sense for
> > > a ublk server to issue UBLK_IO_COMMIT_AND_FETCH_REQ with a result code
> > > before knowing whether the zero-copy I/Os succeeded?
> > >
> > > >
> > > > Fixes: 1f6540e2aabb ("ublk: zc register/unregister bvec")
> > > > Fixes: 1ceeedb59749 ("ublk: optimize UBLK_IO_UNREGISTER_IO_BUF on daemon task")
> > > > Fixes: 8a8fe42d765b ("ublk: optimize UBLK_IO_REGISTER_IO_BUF on daemon task")
> > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > ---
> > > >  drivers/block/ublk_drv.c | 94 +++++++++++++++++++++++++++++++++++++++-
> > > >  1 file changed, 92 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > > > index 99abd67b708b..f608c7efdc05 100644
> > > > --- a/drivers/block/ublk_drv.c
> > > > +++ b/drivers/block/ublk_drv.c
> > > > @@ -239,6 +239,7 @@ struct ublk_device {
> > > >         struct mutex cancel_mutex;
> > > >         bool canceling;
> > > >         pid_t   ublksrv_tgid;
> > > > +       struct delayed_work     exit_work;
> > > >  };
> > > >
> > > >  /* header of ublk_params */
> > > > @@ -1595,12 +1596,84 @@ static void ublk_set_canceling(struct ublk_device *ub, bool canceling)
> > > >                 ublk_get_queue(ub, i)->canceling = canceling;
> > > >  }
> > > >
> > > > -static int ublk_ch_release(struct inode *inode, struct file *filp)
> > > > +static void ublk_reset_io_ref(struct ublk_device *ub)
> > > >  {
> > > > -       struct ublk_device *ub = filp->private_data;
> > > > +       int i, j;
> > > > +
> > > > +       if (!(ub->dev_info.flags & (UBLK_F_SUPPORT_ZERO_COPY |
> > > > +                                       UBLK_F_AUTO_BUF_REG)))
> > > > +               return;
> > > > +
> > > > +       for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
> > > > +               struct ublk_queue *ubq = ublk_get_queue(ub, i);
> > > > +
> > > > +               for (j = 0; j < ubq->q_depth; j++) {
> > > > +                       struct ublk_io *io = &ubq->ios[j];
> > > > +                       /*
> > > > +                        * Reinitialize reference counting fields after
> > > > +                        * draining. This ensures clean state for queue
> > > > +                        * reinitialization.
> > > > +                        */
> > > > +                       refcount_set(&io->ref, 0);
> > > > +                       io->task_registered_buffers = 0;
> > > > +               }
> > > > +       }
> > > > +}
> > > > +
> > > > +static bool ublk_has_io_with_active_ref(struct ublk_device *ub)
> > > > +{
> > > > +       int i, j;
> > > > +
> > > > +       if (!(ub->dev_info.flags & (UBLK_F_SUPPORT_ZERO_COPY |
> > > > +                                       UBLK_F_AUTO_BUF_REG)))
> > > > +               return false;
> > > > +
> > > > +       for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
> > > > +               struct ublk_queue *ubq = ublk_get_queue(ub, i);
> > > > +
> > > > +               for (j = 0; j < ubq->q_depth; j++) {
> > > > +                       struct ublk_io *io = &ubq->ios[j];
> > > > +                       unsigned int refs = refcount_read(&io->ref) +
> > > > +                               io->task_registered_buffers;
> > > > +
> > > > +                       /*
> > > > +                        * UBLK_REFCOUNT_INIT or zero means no active
> > > > +                        * reference
> > > > +                        */
> > > > +                       if (refs != UBLK_REFCOUNT_INIT && refs != 0)
> > > > +                               return true;
> > >
> > > It's technically possible to hit refs == UBLK_REFCOUNT_INIT by having
> > > UBLK_REFCOUNT_INIT active references. It would be safer to check
> > > UBLK_IO_FLAG_OWNED_BY_SRV: if the flag is set, the reference count
> > > needs to reach UBLK_REFCOUNT_INIT; if the flag is unset, the reference
> > > count needs to reach 0.
> >
> > It is actually one invariant that the two's sum is zero or UBLK_REFCOUNT_INIT
> > any time if the io buffer isn't registered, so it is enough and
> > simpler.
> 
> I think it's theoretically possible to make the reference count
> arbitrarily high with ublk registered buffers, which can cause it to
> actually have UBLK_REFCOUNT_INIT references (not because all
> references are released) or the reference count to overflow back to 0.
> One way to do this would be to register the same I/O's buffer in many
> slots of many io_urings at the same time.

The reference is initialized as UBLK_REFCOUNT_INIT, the only way for
triggering it is to make the counter overflow first.


> Another possibility would be
> to repeatedly register the buffer, use it in an I/O that never
> completes (e.g. recv from an empty socket), and then unregister the
> buffer. But this reference count overflow issue is an existing issue,
> not introduced by the patch.

Yes, overflow can be detected & warned by refcount checking, also it is only
allowed for privileged user, so looks this way is just fine.


Thanks, 
Ming


  reply	other threads:[~2025-09-01  8:45 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-25 12:48 [PATCH V2 0/2] ublk: avoid ublk_io_release() called after ublk char dev is closed Ming Lei
2025-08-25 12:48 ` [PATCH V2 1/2] " Ming Lei
2025-08-26  5:49   ` Caleb Sander Mateos
2025-08-26  8:32     ` Ming Lei
2025-08-29 15:37       ` Caleb Sander Mateos
2025-09-01  8:44         ` Ming Lei [this message]
2025-08-25 12:48 ` [PATCH V2 2/2] ublk selftests: add --no_ublk_fixed_fd for not using registered ublk char device Ming Lei
2025-08-26  4:53   ` Caleb Sander Mateos
2025-08-26  8:36     ` Ming Lei

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aLVdB4q_Bk5Ypzcr@fedora \
    --to=ming.lei@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=csander@purestorage.com \
    --cc=kbusch@kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=ushankar@purestorage.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.