Re: [RFC PATCH 4/9] ublk_drv: refactor __ublk_rq_task_work() and aborting machenism

public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed

From: Ming Lei <ming.lei@redhat.com>
To: Ziyang Zhang <ZiyangZhang@linux.alibaba.com>
Cc: axboe@kernel.dk, xiaoguang.wang@linux.alibaba.com,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	joseph.qi@linux.alibaba.com
Subject: Re: [RFC PATCH 4/9] ublk_drv: refactor __ublk_rq_task_work() and aborting machenism
Date: Mon, 29 Aug 2022 16:11:13 +0800	[thread overview]
Message-ID: <Ywx0oUcqvE9JcpnT@T590> (raw)
In-Reply-To: <35dd7300-12ea-62ba-393e-145eae318944@linux.alibaba.com>

On Mon, Aug 29, 2022 at 02:13:12PM +0800, Ziyang Zhang wrote:
> On 2022/8/29 13:40, Ming Lei wrote:
> > On Wed, Aug 24, 2022 at 01:47:39PM +0800, ZiyangZhang wrote:
> >> If one rq is handled by io_uring_cmd_complete_in_task(), after a crash
> >> this rq is actually handled by an io_uring fallback wq. We have to
> >> end(abort) this rq since this fallback wq is a task other than the
> >> crashed task. However, current code does not call io_uring_cmd_done()
> >> at the same time but do it in ublk_cancel_queue(). With current design,
> >> this does work because ublk_cancel_queue() is called AFTER del_gendisk(),
> >> which waits for the rq ended(aborted) in fallback wq. This implies that
> >> fallback wq on this rq is scheduled BEFORE calling io_uring_cmd_done()
> >> on the corresponding ioucmd in ublk_cancel_queue().
> > 
> > Right.
> > 
> >>
> >> However, while considering recovery feature, we cannot rely on
> >> del_gendisk() or blk_mq_freeze_queue() to wait for completion of all
> >> rqs because we may not want any aborted rq. Besides, io_uring does not
> >> provide "flush fallback" machenism so we cannot trace this ioucmd.
> > 
> > Why not?
> > 
> > If user recovery is enabled, del_gendisk() can be replaced with
> > blk_mq_quiesce_queue(), then let abort work function do:
> > 
> > - cancel all in-flight requests by holding them into requeue list
> >   instead of finishing them as before, and this way is safe because
> >   abort worker does know the ubq daemon is dying
> > - cancel pending commands as before, because the situation is same
> >   with disk deleted or queue frozen
> 
> The problem is: we cannot control when fallback wq is scheduled.
> So we are unsafe to call io_uring_cmd_done() in another process.

What is the other process?

It can't be fallback wq since any ublk request is aborted at the beginning
of __ublk_rq_task_work().

It shouldn't be the process calling ublk_cancel_dev(), since it is
safe to call io_uring_cmd_done() if ubq->nr_io_ready > 0.

Or others?

> Otherwise, there is a UAF, just as
> (5804987b7272f437299011c76b7363b8df6f8515: ublk_drv: do not add a
> re-issued request aborted previously to ioucmd's task_work).

As I mentioned, del_gendisk() can be replaced with
blk_mq_quiesce_queue() in case of user recovery, then no any new
request can be queued after blk_mq_quiesce_queue() returns.

> 
> Yeah I know the answer is very simple: flush the fallback wq.
> But here are two more questions:

I don't see why we need to flush fallback wq, care to provide some
details?

> 
> (1) Should ublk_drv rely on the fallback wq machenism?
>     IMO, ublk_drv should not know detail of io_uring_cmd_complete_in_task()
>     because its implementation may change in the future.
>     BTW, I think current ublk_rq_task_work_cb() is not correct because
>     it does not always call io_uring_cmd_done() before returning.
>     nvme_uring_cmd_end_io() always calls io_uring_cmd_done() for each ioucmd
>     no matter the rq succeeds or fails.
> 
> (2) Suppose io_uring does export the symbol 'flush_fallback_work', should we call
>     it before starting a new process(recovery)?
>     What if fallback wq is not scheduled immediately if there are many processes
>     running and the system overhead is heavy. In this case the recovery process
>     may wait for too long. Really we should not depend on fallback wq and please
>     let the fallback wq complete the ioucmd itself.
> 
> > 
> > With this way, the current abort logic won't be changed much.
> > 
> > And user recovery should only be started _after_ ublk device is found
> > as aborted.
> 
> START_RECOVERY will check if all ubq_daemons(the process) are PF_EXITING.

That is different. If START_RECOVERY is only run on aborted device, the
recovery handler could be simplified.

> 
> > 
> >>
> >> The recovery machenism needs to complete all ioucmds of a dying ubq
> >> to avoid leaking io_uring ctx. But as talked above, we are unsafe
> >> to call io_uring_cmd_done() in the recovery task if fallback wq happens
> >> to run simultaneously. This is a UAF case because io_uring ctx may be
> >> freed. Actually a similar case happens in
> >> (5804987b7272f437299011c76b7363b8df6f8515: ublk_drv: do not add a
> >> re-issued request aborted previously to ioucmd's task_work).
> > 
> > If you take the above approach, I guess there isn't such problem because
> > abort can handle the case well as before.
> 
> Ming, we did think this approach(quiesce, requeue rq/complete ioucmd)
> at the very beginning. But we decided to drop it because we don not want
> rely on 'flush fallback wq' machenism, which
> makes ublk_drv rely on io_uring's internal implementation.

Then the focus is 'flush fallback wq', please see my above question.


Thanks,
Ming

next prev parent reply	other threads:[~2022-08-29  8:11 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-24  5:47 [RFC PATCH 0/9] ublk_drv: add USER_RECOVERY support ZiyangZhang
2022-08-24  5:47 ` [RFC PATCH 1/9] ublk_drv: check 'current' instead of 'ubq_daemon' ZiyangZhang
2022-08-29  2:13   ` Ming Lei
2022-08-24  5:47 ` [RFC PATCH 2/9] ublk_drv: refactor ublk_cancel_queue() ZiyangZhang
2022-08-29  3:01   ` Ming Lei
2022-08-29  4:50     ` Ziyang Zhang
2022-08-24  5:47 ` [RFC PATCH 3/9] ublk_drv: add a helper to get ioucmd from pdu ZiyangZhang
2022-08-29  3:06   ` Ming Lei
2022-08-29  4:59     ` Ziyang Zhang
2022-08-24  5:47 ` [RFC PATCH 4/9] ublk_drv: refactor __ublk_rq_task_work() and aborting machenism ZiyangZhang
2022-08-29  5:40   ` Ming Lei
2022-08-29  6:13     ` Ziyang Zhang
2022-08-29  8:11       ` Ming Lei [this message]
2022-08-29  9:09         ` Ziyang Zhang
2022-08-29 10:02           ` Ming Lei
2022-08-24  5:47 ` [RFC PATCH 5/9] ublk_drv: refactor ublk_stop_dev() ZiyangZhang
2022-08-24  5:47 ` [RFC PATCH 6/9] ublk_drv: add pr_devel() to prepare for recovery feature ZiyangZhang
2022-08-24  5:47 ` [RFC PATCH 7/9] ublk_drv: define macros for recovery feature and check them ZiyangZhang
2022-08-24  5:47 ` [RFC PATCH 8/9] ublk_drv: add START_USER_RECOVERY and END_USER_RECOVERY support ZiyangZhang
2022-08-24  5:47 ` [RFC PATCH 9/9] ublk_drv: do not schedule monitor_work with recovery feature enabled ZiyangZhang
2022-08-29  2:08 ` [RFC PATCH 0/9] ublk_drv: add USER_RECOVERY support Ming Lei
2022-08-29  4:00   ` Ziyang Zhang
2022-08-29  5:56     ` Ming Lei
2022-08-29  7:29       ` Ziyang Zhang
2022-08-29  8:38         ` Ming Lei

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Ywx0oUcqvE9JcpnT@T590 \
    --to=ming.lei@redhat.com \
    --cc=ZiyangZhang@linux.alibaba.com \
    --cc=axboe@kernel.dk \
    --cc=joseph.qi@linux.alibaba.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=xiaoguang.wang@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox