From: Ming Lei <ming.lei@redhat.com>
To: Chengming Zhou <chengming.zhou@linux.dev>
Cc: axboe@kernel.dk, kch@nvidia.com, dhowells@redhat.com,
damien.lemoal@opensource.wdc.com, bvanassche@acm.org,
nj.shetty@samsung.com, kbusch@kernel.org,
zhouchengming@bytedance.com, akinobu.mita@gmail.com,
shinichiro.kawasaki@wdc.com, linux-block@vger.kernel.org,
linux-kernel@vger.kernel.org, ming.lei@redhat.com
Subject: Re: [PATCH] null_blk: fix poll request timeout handling
Date: Thu, 17 Aug 2023 17:02:40 +0800 [thread overview]
Message-ID: <ZN3iMEKfM0Srd7L/@fedora> (raw)
In-Reply-To: <6726fde4-a766-b61f-dac8-5ec8e3536f0f@linux.dev>
On Tue, Aug 15, 2023 at 05:45:10PM +0800, Chengming Zhou wrote:
> On 2023/8/15 16:31, Ming Lei wrote:
> > On Tue, Aug 15, 2023 at 02:04:42PM +0800, chengming.zhou@linux.dev wrote:
> >> From: Chengming Zhou <zhouchengming@bytedance.com>
> >>
> >> When doing io_uring benchmark on /dev/nullb0, it's easy to crash the
> >> kernel if poll requests timeout triggered, as reported by David. [1]
> >
> > Just be curious, how is the timeout triggered when running
> > "./fio/t/io_uring -r20 /dev/nullb0"?
>
> I tried "./fio/t/io_uring -r20 /dev/nullb0" multiple times, sometimes
> program exit ok, sometimes it dump many timeout messages and kernel BUG.
>
> I just used David's "./fio/t/io_uring -n4 /dev/nullb0", run a while
> then ctrl-C, the program will always dump many timeout messages.
>
> It seems that program exit is not clean, leave some requests in driver
> haven't been polled & completed? I don't know.
Just run a trace on this issue, looks it is caused by delayed __fput():
1) it is pretty easy to cause __fput(io_uring) too late in case of passing -n4
2) then io_ring_exit_work() is scheduled(for polling in-completed IOs) too late
since it is started in io_uring_release()
3) null_blk sets timeout as 5sec, so the issue can be triggered easier
But not see delayed __fput() triggered on nvme yet, still a bit
confused.
> >
> > David mentioned that the issue is triggered in 6.5-rc1, maybe one
> > regression?
> >
>
> I just tested using v6.4.9, found the same timeout and kernel BUG
> using "./fio/t/io_uring -n4 /dev/nullb0".
>
> >>
> >> BUG: kernel NULL pointer dereference, address: 0000000000000008
> >> Workqueue: kblockd blk_mq_timeout_work
> >> RIP: 0010:null_timeout_rq+0x4e/0x91
> >> Call Trace:
> >> ? __die_body+0x1a/0x5c
> >> ? page_fault_oops+0x6f/0x9c
> >> ? kernelmode_fixup_or_oops+0xc6/0xd6
> >> ? __bad_area_nosemaphore+0x44/0x1eb
> >> ? exc_page_fault+0xe2/0xf4
> >> ? asm_exc_page_fault+0x22/0x30
> >> ? null_timeout_rq+0x4e/0x91
> >> blk_mq_handle_expired+0x31/0x4b
> >> bt_iter+0x68/0x84
> >> ? bt_tags_iter+0x81/0x81
> >> __sbitmap_for_each_set.constprop.0+0xb0/0xf2
> >> ? __blk_mq_complete_request_remote+0xf/0xf
> >> bt_for_each+0x46/0x64
> >> ? __blk_mq_complete_request_remote+0xf/0xf
> >> ? percpu_ref_get_many+0xc/0x2a
> >> blk_mq_queue_tag_busy_iter+0x14d/0x18e
> >> blk_mq_timeout_work+0x95/0x127
> >> process_one_work+0x185/0x263
> >> worker_thread+0x1b5/0x227
> >> ? rescuer_thread+0x287/0x287
> >> kthread+0xfa/0x102
> >> ? kthread_complete_and_exit+0x1b/0x1b
> >> ret_from_fork+0x22/0x30
> >>
> >> This is indeed a race problem between null_timeout_rq() and null_poll().
> >>
> >> null_poll() null_timeout_rq()
> >> spin_lock(&nq->poll_lock)
> >> list_splice_init(&nq->poll_list, &list)
> >> spin_unlock(&nq->poll_lock)
> >>
> >> while (!list_empty(&list))
> >> req = list_first_entry()
> >> list_del_init()
> >> ...
> >> blk_mq_add_to_batch()
> >> // req->rq_next = NULL
> >> spin_lock(&nq->poll_lock)
> >>
> >> // rq->queuelist->next == NULL
> >> list_del_init(&rq->queuelist)
> >>
> >> spin_unlock(&nq->poll_lock)
> >>
> >> What's worse is that we don't call blk_mq_complete_request_remote()
> >> before blk_mq_add_to_batch(), so these completed requests have wrong
> >> rq->state == MQ_RQ_IN_FLIGHT. We can easily check this using bpftrace:
> >>
> >> ```
> >> bpftrace -e 'kretfunc:null_blk:null_poll {
> >> $iob=(struct io_comp_batch *)args->iob;
> >> @[$iob->req_list->state]=count();
> >> }'
> >>
> >> @[1]: 51708
> >> ```
> >>
> >> Fix these problems by setting requests state to MQ_RQ_COMPLETE under
> >> nq->poll_lock protection, in which null_timeout_rq() can safely detect
> >> this race and early return.
> >>
> >> [1] https://lore.kernel.org/all/3893581.1691785261@warthog.procyon.org.uk/
> >>
> >> Fixes: 0a593fbbc245 ("null_blk: poll queue support")
> >> Reported-by: David Howells <dhowells@redhat.com>
> >> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> >> ---
> >> drivers/block/null_blk/main.c | 12 ++++++++++--
> >> 1 file changed, 10 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
> >> index 864013019d6b..968090935eb2 100644
> >> --- a/drivers/block/null_blk/main.c
> >> +++ b/drivers/block/null_blk/main.c
> >> @@ -1643,9 +1643,12 @@ static int null_poll(struct blk_mq_hw_ctx *hctx, struct io_comp_batch *iob)
> >> struct nullb_queue *nq = hctx->driver_data;
> >> LIST_HEAD(list);
> >> int nr = 0;
> >> + struct request *rq;
> >>
> >> spin_lock(&nq->poll_lock);
> >> list_splice_init(&nq->poll_list, &list);
> >> + list_for_each_entry(rq, &list, queuelist)
> >> + blk_mq_set_request_complete(rq);
> >> spin_unlock(&nq->poll_lock);
> >>
> >> while (!list_empty(&list)) {
> >> @@ -1671,16 +1674,21 @@ static enum blk_eh_timer_return null_timeout_rq(struct request *rq)
> >> struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
> >> struct nullb_cmd *cmd = blk_mq_rq_to_pdu(rq);
> >>
> >> - pr_info("rq %p timed out\n", rq);
> >> -
> >> if (hctx->type == HCTX_TYPE_POLL) {
> >> struct nullb_queue *nq = hctx->driver_data;
> >>
> >> spin_lock(&nq->poll_lock);
> >> + /* The request may have completed meanwhile. */
> >> + if (blk_mq_request_completed(rq)) {
> >> + spin_unlock(&nq->poll_lock);
> >> + return BLK_EH_DONE;
> >> + }
> >> list_del_init(&rq->queuelist);
> >> spin_unlock(&nq->poll_lock);
> >> }
> >
> > I think null_process_cmd() is needed for un-completed request.
> >
>
> The end of function will set BLK_STS_TIMEOUT error and complete request
> using blk_mq_complete_request(), not sure if null_process_cmd() is
> needed in this error case?
The null_blk fix looks fine:
Reviewed-by: Ming Lei <ming.lei@redhat.com>
thanks,
Ming
prev parent reply other threads:[~2023-08-17 9:04 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-08-15 6:04 [PATCH] null_blk: fix poll request timeout handling chengming.zhou
2023-08-15 8:19 ` David Howells
2023-08-16 9:15 ` Chengming Zhou
2023-08-15 8:31 ` Ming Lei
2023-08-15 9:45 ` Chengming Zhou
2023-08-15 12:18 ` Ming Lei
2023-08-17 9:02 ` Ming Lei [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZN3iMEKfM0Srd7L/@fedora \
--to=ming.lei@redhat.com \
--cc=akinobu.mita@gmail.com \
--cc=axboe@kernel.dk \
--cc=bvanassche@acm.org \
--cc=chengming.zhou@linux.dev \
--cc=damien.lemoal@opensource.wdc.com \
--cc=dhowells@redhat.com \
--cc=kbusch@kernel.org \
--cc=kch@nvidia.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=nj.shetty@samsung.com \
--cc=shinichiro.kawasaki@wdc.com \
--cc=zhouchengming@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.