linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ming Lei <ming.lei@redhat.com>
To: Uday Shankar <ushankar@purestorage.com>
Cc: Jens Axboe <axboe@kernel.dk>,
	linux-block@vger.kernel.org,
	Caleb Sander Mateos <csander@purestorage.com>
Subject: Re: [PATCH 6/9] ublk: improve detection and handling of ublk server exit
Date: Tue, 15 Apr 2025 09:54:13 +0800	[thread overview]
Message-ID: <Z_28RbNHq1Knm6Ef@fedora> (raw)
In-Reply-To: <Z/1x19wmebu/RnPK@dev-ushankar.dev.purestorage.com>

On Mon, Apr 14, 2025 at 02:36:39PM -0600, Uday Shankar wrote:
> On Mon, Apr 14, 2025 at 07:25:47PM +0800, Ming Lei wrote:
> > From: Uday Shankar <ushankar@purestorage.com>
> > 
> > There are currently two ways in which ublk server exit is detected by
> > ublk_drv:
> > 
> > 1. uring_cmd cancellation. If there are any outstanding uring_cmds which
> >    have not been completed to the ublk server when it exits, io_uring
> >    calls the uring_cmd callback with a special cancellation flag as the
> >    issuing task is exiting.
> > 2. I/O timeout. This is needed in addition to the above to handle the
> >    "saturated queue" case, when all I/Os for a given queue are in the
> >    ublk server, and therefore there are no outstanding uring_cmds to
> >    cancel when the ublk server exits.
> > 
> > There are a couple of issues with this approach:
> > 
> > - It is complex and inelegant to have two methods to detect the same
> >   condition
> > - The second method detects ublk server exit only after a long delay
> >   (~30s, the default timeout assigned by the block layer). This delays
> >   the nosrv behavior from kicking in and potential subsequent recovery
> >   of the device.
> > 
> > The second issue is brought to light with the new test_generic_06 which
> > will be added in following patch. It fails before this fix:
> > 
> > selftests: ublk: test_generic_06.sh
> > dev id is 0
> > dd: error writing '/dev/ublkb0': Input/output error
> > 1+0 records in
> > 0+0 records out
> > 0 bytes copied, 30.0611 s, 0.0 kB/s
> > DEAD
> > dd took 31 seconds to exit (>= 5s tolerance)!
> > generic_06 : [FAIL]
> > 
> > Fix this by instead detecting and handling ublk server exit in the
> > character file release callback. This has several advantages:
> > 
> > - This one place can handle both saturated and unsaturated queues. Thus,
> >   it replaces both preexisting methods of detecting ublk server exit.
> > - It runs quickly on ublk server exit - there is no 30s delay.
> > - It starts the process of removing task references in ublk_drv. This is
> >   needed if we want to relax restrictions in the driver like letting
> >   only one thread serve each queue
> > 
> > There is also the disadvantage that the character file release callback
> > can also be triggered by intentional close of the file, which is a
> > significant behavior change. Preexisting ublk servers (libublksrv) are
> > dependent on the ability to open/close the file multiple times. To
> > address this, only transition to a nosrv state if the file is released
> > while the ublk device is live. This allows for programs to open/close
> > the file multiple times during setup. It is still a behavior change if a
> > ublk server decides to close/reopen the file while the device is LIVE
> > (i.e. while it is responsible for serving I/O), but that would be highly
> > unusual. This behavior is in line with what is done by FUSE, which is
> > very similar to ublk in that a userspace daemon is providing services
> > traditionally provided by the kernel.
> > 
> > With this change in, the new test (and all other selftests, and all
> > ublksrv tests) pass:
> > 
> > selftests: ublk: test_generic_06.sh
> > dev id is 0
> > dd: error writing '/dev/ublkb0': Input/output error
> > 1+0 records in
> > 0+0 records out
> > 0 bytes copied, 0.0376731 s, 0.0 kB/s
> > DEAD
> > generic_04 : [PASS]
> > 
> > Signed-off-by: Uday Shankar <ushankar@purestorage.com>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  drivers/block/ublk_drv.c | 231 ++++++++++++++++++++++-----------------
> >  1 file changed, 131 insertions(+), 100 deletions(-)
> > 
> > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > index b68bd4172fa8..e02f205f6da4 100644
> > --- a/drivers/block/ublk_drv.c
> > +++ b/drivers/block/ublk_drv.c
> > @@ -199,8 +199,6 @@ struct ublk_device {
> >  	struct completion	completion;
> >  	unsigned int		nr_queues_ready;
> >  	unsigned int		nr_privileged_daemon;
> > -
> > -	struct work_struct	nosrv_work;
> >  };
> >  
> >  /* header of ublk_params */
> > @@ -209,8 +207,9 @@ struct ublk_params_header {
> >  	__u32	types;
> >  };
> >  
> > -static bool ublk_abort_requests(struct ublk_device *ub, struct ublk_queue *ubq);
> > -
> > +static void ublk_stop_dev_unlocked(struct ublk_device *ub);
> > +static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq);
> > +static void __ublk_quiesce_dev(struct ublk_device *ub);
> >  static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
> >  		struct ublk_queue *ubq, int tag, size_t offset);
> >  static inline unsigned int ublk_req_build_flags(struct request *req);
> > @@ -1336,8 +1335,6 @@ static void ublk_queue_cmd_list(struct ublk_queue *ubq, struct rq_list *l)
> >  static enum blk_eh_timer_return ublk_timeout(struct request *rq)
> >  {
> >  	struct ublk_queue *ubq = rq->mq_hctx->driver_data;
> > -	unsigned int nr_inflight = 0;
> > -	int i;
> >  
> >  	if (ubq->flags & UBLK_F_UNPRIVILEGED_DEV) {
> >  		if (!ubq->timeout) {
> > @@ -1348,26 +1345,6 @@ static enum blk_eh_timer_return ublk_timeout(struct request *rq)
> >  		return BLK_EH_DONE;
> >  	}
> >  
> > -	if (!ubq_daemon_is_dying(ubq))
> > -		return BLK_EH_RESET_TIMER;
> > -
> > -	for (i = 0; i < ubq->q_depth; i++) {
> > -		struct ublk_io *io = &ubq->ios[i];
> > -
> > -		if (!(io->flags & UBLK_IO_FLAG_ACTIVE))
> > -			nr_inflight++;
> > -	}
> > -
> > -	/* cancelable uring_cmd can't help us if all commands are in-flight */
> > -	if (nr_inflight == ubq->q_depth) {
> > -		struct ublk_device *ub = ubq->dev;
> > -
> > -		if (ublk_abort_requests(ub, ubq)) {
> > -			schedule_work(&ub->nosrv_work);
> > -		}
> > -		return BLK_EH_DONE;
> > -	}
> > -
> >  	return BLK_EH_RESET_TIMER;
> >  }
> >  
> > @@ -1525,13 +1502,112 @@ static void ublk_reset_ch_dev(struct ublk_device *ub)
> >  	ub->nr_privileged_daemon = 0;
> >  }
> >  
> > +static struct gendisk *ublk_get_disk(struct ublk_device *ub)
> > +{
> > +	struct gendisk *disk;
> > +
> > +	spin_lock(&ub->lock);
> > +	disk = ub->ub_disk;
> > +	if (disk)
> > +		get_device(disk_to_dev(disk));
> > +	spin_unlock(&ub->lock);
> > +
> > +	return disk;
> > +}
> > +
> > +static void ublk_put_disk(struct gendisk *disk)
> > +{
> > +	if (disk)
> > +		put_device(disk_to_dev(disk));
> > +}
> > +
> >  static int ublk_ch_release(struct inode *inode, struct file *filp)
> >  {
> >  	struct ublk_device *ub = filp->private_data;
> > +	struct gendisk *disk;
> > +	int i;
> > +
> > +	/*
> > +	 * disk isn't attached yet, either device isn't live, or it has
> > +	 * been removed already, so we needn't to do anything
> > +	 */
> > +	disk = ublk_get_disk(ub);
> > +	if (!disk)
> > +		goto out;
> > +
> > +	/*
> > +	 * All uring_cmd are done now, so abort any request outstanding to
> > +	 * the ublk server
> > +	 *
> > +	 * This can be done in lockless way because ublk server has been
> > +	 * gone
> > +	 *
> > +	 * More importantly, we have to provide forward progress guarantee
> > +	 * without holding ub->mutex, otherwise control task grabbing
> > +	 * ub->mutex triggers deadlock
> > +	 *
> > +	 * All requests may be inflight, so ->canceling may not be set, set
> > +	 * it now.
> > +	 */
> > +	for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
> > +		struct ublk_queue *ubq = ublk_get_queue(ub, i);
> > +
> > +		ubq->canceling = true;
> > +		ublk_abort_queue(ub, ubq);
> > +	}
> > +	blk_mq_kick_requeue_list(disk->queue);
> > +
> > +	/*
> > +	 * All infligh requests have been completed or requeued and any new
> > +	 * request will be failed or requeued via `->canceling` now, so it is
> > +	 * fine to grab ub->mutex now.
> > +	 */
> > +	mutex_lock(&ub->mutex);
> > +
> > +	/* double check after grabbing lock */
> > +	if (!ub->ub_disk)
> > +		goto unlock;
> > +
> > +	/*
> > +	 * Transition the device to the nosrv state. What exactly this
> > +	 * means depends on the recovery flags
> > +	 */
> > +	blk_mq_quiesce_queue(disk->queue);
> > +	if (ublk_nosrv_should_stop_dev(ub)) {
> > +		/*
> > +		 * Allow any pending/future I/O to pass through quickly
> > +		 * with an error. This is needed because del_gendisk
> > +		 * waits for all pending I/O to complete
> > +		 */
> > +		for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
> > +			ublk_get_queue(ub, i)->force_abort = true;
> > +		blk_mq_unquiesce_queue(disk->queue);
> > +
> > +		ublk_stop_dev_unlocked(ub);
> > +	} else {
> > +		if (ublk_nosrv_dev_should_queue_io(ub)) {
> > +			/*
> > +			 * keep request queue as quiesced for queuing new IO
> > +			 * during QUIESCED state
> > +			 *
> > +			 * request queue will be unquiesced in ending
> > +			 * recovery, see ublk_ctrl_end_recovery
> > +			 */
> 
> Does this comment need an update after
> 
> [PATCH 4/9] ublk: rely on ->canceling for dealing with ublk_nosrv_dev_should_queue_io
> 
> If I read that one right, actually the request queue is not quiesced
> anymore during the UBLK_S_DEV_QUIESCED state

The comment is removed actually in the following patch, but it shouldn't have
been added here, will do it in V2.


Thanks, 
Ming


  reply	other threads:[~2025-04-15  1:54 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-14 11:25 [PATCH 0/9] ublk: simplify & improve IO canceling Ming Lei
2025-04-14 11:25 ` [PATCH 1/9] ublk: don't try to stop disk if ->ub_disk is NULL Ming Lei
2025-04-14 19:44   ` Uday Shankar
2025-04-15  1:32     ` Ming Lei
2025-04-14 11:25 ` [PATCH 2/9] ublk: properly serialize all FETCH_REQs Ming Lei
2025-04-14 19:58   ` Uday Shankar
2025-04-14 20:39     ` Jens Axboe
2025-04-14 20:52       ` Uday Shankar
2025-04-14 21:00         ` Jens Axboe
2025-04-15  1:40         ` Ming Lei
2025-04-16  1:13       ` Ming Lei
2025-04-16  1:17         ` Jens Axboe
2025-04-16  2:04           ` Ming Lei
2025-04-16  1:04     ` Uday Shankar
2025-04-14 11:25 ` [PATCH 3/9] ublk: add ublk_force_abort_dev() Ming Lei
2025-04-14 20:06   ` Uday Shankar
2025-04-14 11:25 ` [PATCH 4/9] ublk: rely on ->canceling for dealing with ublk_nosrv_dev_should_queue_io Ming Lei
2025-04-14 20:15   ` Uday Shankar
2025-04-15  1:48     ` Ming Lei
2025-04-14 11:25 ` [PATCH 5/9] ublk: move device reset into ublk_ch_release() Ming Lei
2025-04-14 20:29   ` Uday Shankar
2025-04-15  1:50     ` Ming Lei
2025-04-14 11:25 ` [PATCH 6/9] ublk: improve detection and handling of ublk server exit Ming Lei
2025-04-14 20:36   ` Uday Shankar
2025-04-15  1:54     ` Ming Lei [this message]
2025-04-14 11:25 ` [PATCH 7/9] ublk: remove __ublk_quiesce_dev() Ming Lei
2025-04-14 20:37   ` Uday Shankar
2025-04-14 11:25 ` [PATCH 8/9] ublk: simplify aborting ublk request Ming Lei
2025-04-14 20:42   ` Uday Shankar
2025-04-14 11:25 ` [PATCH 9/9] selftests: ublk: add generic_06 for covering fault inject Ming Lei
2025-04-14 20:44   ` Uday Shankar
2025-04-15  1:57     ` Ming Lei

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z_28RbNHq1Knm6Ef@fedora \
    --to=ming.lei@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=csander@purestorage.com \
    --cc=linux-block@vger.kernel.org \
    --cc=ushankar@purestorage.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).