From: Chaitanya Kulkarni <chaitanyak@nvidia.com>
To: Bart Van Assche <bvanassche@acm.org>,
Chaitanya Kulkarni <chaitanyak@nvidia.com>,
"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Cc: "axboe@kernel.dk" <axboe@kernel.dk>,
"damien.lemoal@opensource.wdc.com"
<damien.lemoal@opensource.wdc.com>,
"johannes.thumshirn@wdc.com" <johannes.thumshirn@wdc.com>,
"ming.lei@redhat.com" <ming.lei@redhat.com>,
"shinichiro.kawasaki@wdc.com" <shinichiro.kawasaki@wdc.com>,
"vincent.fu@samsung.com" <vincent.fu@samsung.com>,
"yukuai3@huawei.com" <yukuai3@huawei.com>
Subject: Re: [PATCH] null_blk: allow teardown on request timeout
Date: Wed, 19 Oct 2022 04:19:50 +0000 [thread overview]
Message-ID: <d3e05b4e-466b-844c-b815-79233856e527@nvidia.com> (raw)
In-Reply-To: <f2baa3b4-81c9-a6d8-0c26-3e695dad5d10@acm.org>
On 10/17/22 07:21, Bart Van Assche wrote:
> On 10/15/22 22:20, Chaitanya Kulkarni wrote:
>> In current timeout implementation null_blk just completes the request
>> with error=BLK_STS_TIMEOUT without doing any cleanup, hence device
>> cleanup code including handling inflight requests on timeout and
>> teardown is never exercised.
>
> Hi Chaitanya,
>
> How about removing that code instead of adding a mechanism for
> triggering it?
>
Can you please elaborate on this ? which code needs to be removed?
>> Add a module parameter rq_abort_limit to allow null_blk perform device
>> cleanup when time out occurs. The non zero value of this parameter
>> allows user to set the number of timeouts to occur before triggering
>> cleanup/teardown work.
>
> As Ming Lei wrote, there are no other block drivers that destroy
> themselves if a certain number of timeouts occur. It seems weird to me
> to trigger self-removal from inside a timeout handler.
>
Ming thought I'm proposing first line of action to remove the device
in the timeout callback without having to look into the device if it
can be aborted and make it functional again, which is I'm not, new
module parameter allows to set multiple requests to be timed out
before engaging in teardown sequence.
nvme-rdma host (and I guess nvme-tcp host) does have a the similar
behavior to remove the device from the err_work issued from request
timeout callback:-
from nvme/host/rdma.c
nvme_rdma_timeout()
nvme_rdma_error_recovery()
nvme_err_work() -> nvme_reset_wq
nvme_rdma_error_recovery_work()
...
nvme_rdma_tear_down_io_queues()
nvme_start_freeze()
blk_freeze_queue_start()
nvme_stop_queues()
nvme_stop_ns_queue()
blk_mq_quiesce_queue() or blk_mq_wait_quiesce_done()
nvme_sync_io_queues()
blk_sync_queue()
nvme_start_queues()
nvme_start_ns_queue()
blk_mq_unquiesce_queue()
nvme_rdma_reconnect_or_remove()
Also, I've listed the problem that I've seen first hand for keeping the
device in the system that is non-responsive due to request timeouts, in
that case we should let user decide whether user wants to remove or keep
the device in the system instead of forcing user to keep the device in
the system bringing down whole system, and these problems are really
hard to debug even with Teledyne LeCroy [1]. This patch follows the same
philosophy where user can decide to opt in for removal with module
parameter. Once opt-in user knows what he is getting into.
-ck
[1]
https://teledynelecroy.com/protocolanalyzer/pci-express/interposers-and-probes
next prev parent reply other threads:[~2022-10-19 4:19 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-10-16 5:20 [PATCH] null_blk: allow teardown on request timeout Chaitanya Kulkarni
2022-10-17 1:26 ` Damien Le Moal
2022-10-17 8:42 ` Chaitanya Kulkarni
2022-10-17 9:25 ` Ming Lei
2022-10-17 9:30 ` Chaitanya Kulkarni
2022-10-17 9:50 ` Ming Lei
2022-10-17 10:04 ` Chaitanya Kulkarni
2022-10-17 10:16 ` Ming Lei
2022-10-17 10:46 ` Chaitanya Kulkarni
2022-10-17 14:21 ` Bart Van Assche
2022-10-19 4:19 ` Chaitanya Kulkarni [this message]
2022-10-19 17:41 ` Bart Van Assche
2022-11-02 1:09 ` Chaitanya Kulkarni
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d3e05b4e-466b-844c-b815-79233856e527@nvidia.com \
--to=chaitanyak@nvidia.com \
--cc=axboe@kernel.dk \
--cc=bvanassche@acm.org \
--cc=damien.lemoal@opensource.wdc.com \
--cc=johannes.thumshirn@wdc.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=ming.lei@redhat.com \
--cc=shinichiro.kawasaki@wdc.com \
--cc=vincent.fu@samsung.com \
--cc=yukuai3@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox