From: Damien Le Moal <dlemoal@kernel.org>
To: Bart Van Assche <bvanassche@acm.org>, Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org, linux-scsi@vger.kernel.org,
Christoph Hellwig <hch@lst.de>, Jaegeuk Kim <jaegeuk@kernel.org>
Subject: Re: [PATCH v16 05/26] blk-zoned: Fix a deadlock triggered by unaligned writes
Date: Mon, 25 Nov 2024 13:19:41 +0900 [thread overview]
Message-ID: <ec7b6eef-48e6-42f3-8e6f-54dfd3724393@kernel.org> (raw)
In-Reply-To: <2e55151f-2001-46b6-8f82-cf961bbdf7cf@kernel.org>
On 11/25/24 1:00 PM, Damien Le Moal wrote:
> On 11/22/24 2:51 AM, Bart Van Assche wrote:
>>
>> On 11/20/24 7:32 PM, Damien Le Moal wrote:
>>> On 11/20/24 06:04, Bart Van Assche wrote:
>>>> On 11/18/24 6:57 PM, Damien Le Moal wrote:
>>>>> On 11/19/24 9:27 AM, Bart Van Assche wrote:
>>>>>> If the queue is filled with unaligned writes then the following
>>>>>> deadlock occurs:
>>>>>>
>>>>>> Call Trace:
>>>>>> <TASK>
>>>>>> __schedule+0x8cc/0x2190
>>>>>> schedule+0xdd/0x2b0
>>>>>> blk_queue_enter+0x2ce/0x4f0
>>>>>> blk_mq_alloc_request+0x303/0x810
>>>>>> scsi_execute_cmd+0x3f4/0x7b0
>>>>>> sd_zbc_do_report_zones+0x19e/0x4c0
>>>>>> sd_zbc_report_zones+0x304/0x920
>>>>>> disk_zone_wplug_handle_error+0x237/0x920
>>>>>> disk_zone_wplugs_work+0x17e/0x430
>>>>>> process_one_work+0xdd0/0x1490
>>>>>> worker_thread+0x5eb/0x1010
>>>>>> kthread+0x2e5/0x3b0
>>>>>> ret_from_fork+0x3a/0x80
>>>>>> </TASK>
>>>>>>
>>>>>> Fix this deadlock by removing the disk->fops->report_zones() call and by
>>>>>> deriving the write pointer information from successfully completed zoned
>>>>>> writes.
>>>>>>
>>>>>> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
>>>>>
>>>>> Doesn't this need a Fixes tag and CC stable, and come earlier in the series
>>>>> (or
>>>>> sent separately) ?
>>>>
>>>> I will add Fixes: and Cc: stable tags.
>>>>
>>>> I'm not sure how to move this patch earlier since it depends on the
>>>> previous patch in this series ("blk-zoned: Only handle errors after
>>>> pending zoned writes have completed"). Without that patch, it is not
>>>> safe to use zwplug->wp_offset_compl in the error handler.
>>>>
>>>>> Overall, this patch seems wrong anyway as zone reset and zone finish may be
>>>>> done between 2 writes, failing the next one but the recovery done here will
>>>>> use
>>>>> the previous succeful write end position as the wp, which is NOT correct as
>>>>> reset or finish changed that...
>>>>
>>>> I will add support for the zone reset and zone finish commands in this
>>>> patch.
>>>>
>>>>> And we also have the possibility of torn writes
>>>>> (partial writes) with SAS SMR drives. So I really think that you cannot avoid
>>>>> doing a report zone to recover errors.
>>>>
>>>> Thanks for having brought this up. This is something I was not aware of.
>>>>
>>>> disk_zone_wplug_handle_error() submits a new request to retrieve zone
>>>> information while handling an error triggered by other requests. This
>>>> can easily lead to a deadlock as the above call trace shows. How about
>>>> introducing a queue flag for the "report zones" approach in
>>>> disk_zone_wplug_handle_error() such that the "report zones" approach is
>>>> only used for SAS SMR drives?
>>>
>>> Sure, but how would that solve the potential deadlock problem ? ALso, I am not
>>> entirely clear on how the deadlock can happen given that zone write plugs are
>>> queueing/blocking BIOs, not requests. So even assuming you have a large number
>>> of BIOs plugged in a zone write plug, the error handler work should still be
>>> able to issue a request to do a report zones, no ? On which resource can the
>>> deadlock happen ? Plugged BIOs do not yet use a tag, right ?
>>>
>>> What am I missing here ? Or is it maybe something that can happen with your
>>> modifications because you changed the zone write plug behavior to allow for more
>>> than one BIO at a time being unplugged and issued to the device ?
>>>
>>> Note that if you do have a test case for this triggering the deadlock, we
>>> definitely need to solve this and ideally have a blktest case checking it.
>>
>> Hi Damien,
>>
>> The call trace mentioned above comes from the kernel log and was
>> encountered while I was testing this patch series. A reproducer has
>> already been shared - see also
>> https://lore.kernel.org/linux-block/e840b66a-79f0-4169-9ab1-
>> c475d9608e4d@acm.org/. The lockup happened after the queue was filled up
>> with requests and hence sd_zbc_report_zones() failed to allocate an
>> additional request for the zone report.
>>
>> I'm wondering whether this lockup can also happen with the current
>> upstream kernel by submitting multiple unaligned writes simultaneously
>> where each write affects another zone and where the number of writes
>> matches the queue depth.
>
> Let me check. This indeed may be a possibility.
Thinking more about this, the answer is no, this cannot happen. The reason is
that zone write plugs are blocking BIOs, not requests. So the number of BIOs
queued waiting for execution does not matter as we are not holding tags. A
report zone for recovering from a write error in a zone may have to wait for a
tag for some time before executing, when many zones are being read/written at
the same time. But as long as requests complete and tags are freed, error
recovery report zones will eventually proceed and execute. No deadlock.
Even if the drive stops responding, we will eventually get a timeout and drive
reset which will abort all commands, so even then, we will have forward
progress with error recovery.
This is for the current upstream code.
Now, in your case, with several write requests issued per zone and said write
requests potentially waiting in the requeue or dispatch lists, then a report
zone for error recovery may indeed block forever... So with your changes, a
solution for this is needed I think.
--
Damien Le Moal
Western Digital Research
next prev parent reply other threads:[~2024-11-25 4:19 UTC|newest]
Thread overview: 73+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-11-19 0:27 [PATCH v16 00/26] Improve write performance for zoned UFS devices Bart Van Assche
2024-11-19 0:27 ` [PATCH v16 01/26] blk-zoned: Fix a reference count leak Bart Van Assche
2024-11-19 2:23 ` Damien Le Moal
2024-11-19 20:21 ` Bart Van Assche
2024-11-19 0:27 ` [PATCH v16 02/26] blk-zoned: Split disk_zone_wplugs_work() Bart Van Assche
2024-11-19 0:27 ` [PATCH v16 03/26] blk-zoned: Split queue_zone_wplugs_show() Bart Van Assche
2024-11-19 2:25 ` Damien Le Moal
2024-11-19 0:27 ` [PATCH v16 04/26] blk-zoned: Only handle errors after pending zoned writes have completed Bart Van Assche
2024-11-19 2:50 ` Damien Le Moal
2024-11-19 20:51 ` Bart Van Assche
2024-11-21 3:23 ` Damien Le Moal
2024-11-21 17:43 ` Bart Van Assche
2024-11-19 0:27 ` [PATCH v16 05/26] blk-zoned: Fix a deadlock triggered by unaligned writes Bart Van Assche
2024-11-19 2:57 ` Damien Le Moal
2024-11-19 21:04 ` Bart Van Assche
2024-11-21 3:32 ` Damien Le Moal
2024-11-21 17:51 ` Bart Van Assche
2024-11-25 4:00 ` Damien Le Moal
2024-11-25 4:19 ` Damien Le Moal [this message]
2025-01-09 19:11 ` Bart Van Assche
2025-01-10 5:07 ` Damien Le Moal
2025-01-10 18:17 ` Bart Van Assche
2024-11-19 0:27 ` [PATCH v16 06/26] blk-zoned: Fix requeuing of zoned writes Bart Van Assche
2024-11-19 3:00 ` Damien Le Moal
2024-11-19 21:06 ` Bart Van Assche
2024-11-19 0:27 ` [PATCH v16 07/26] block: Support block drivers that preserve the order of write requests Bart Van Assche
2024-11-19 7:37 ` Damien Le Moal
2024-11-19 21:08 ` Bart Van Assche
2024-11-19 0:27 ` [PATCH v16 08/26] dm-linear: Report to the block layer that the write order is preserved Bart Van Assche
2024-11-19 0:27 ` [PATCH v16 09/26] mq-deadline: Remove a local variable Bart Van Assche
2024-11-19 7:38 ` Damien Le Moal
2024-11-19 21:11 ` Bart Van Assche
2024-11-19 0:27 ` [PATCH v16 10/26] blk-mq: Clean up blk_mq_requeue_work() Bart Van Assche
2024-11-19 7:39 ` Damien Le Moal
2024-11-19 0:28 ` [PATCH v16 11/26] block: Optimize blk_mq_submit_bio() for the cache hit scenario Bart Van Assche
2024-11-19 7:40 ` Damien Le Moal
2024-11-19 0:28 ` [PATCH v16 12/26] block: Rework request allocation in blk_mq_submit_bio() Bart Van Assche
2024-11-19 7:44 ` Damien Le Moal
2024-11-19 0:28 ` [PATCH v16 13/26] block: Support allocating from a specific software queue Bart Van Assche
2024-11-19 0:28 ` [PATCH v16 14/26] blk-mq: Restore the zoned write order when requeuing Bart Van Assche
2024-11-19 7:52 ` Damien Le Moal
2024-11-19 21:16 ` Bart Van Assche
2024-11-19 0:28 ` [PATCH v16 15/26] blk-zoned: Document the locking order Bart Van Assche
2024-11-19 7:52 ` Damien Le Moal
2024-11-19 0:28 ` [PATCH v16 16/26] blk-zoned: Document locking assumptions Bart Van Assche
2024-11-19 7:53 ` Damien Le Moal
2024-11-19 21:18 ` Bart Van Assche
2024-11-21 3:34 ` Damien Le Moal
2024-11-19 0:28 ` [PATCH v16 17/26] blk-zoned: Uninline functions that are not in the hot path Bart Van Assche
2024-11-19 7:55 ` Damien Le Moal
2024-11-19 21:20 ` Bart Van Assche
2024-11-21 3:36 ` Damien Le Moal
2024-11-19 0:28 ` [PATCH v16 18/26] blk-zoned: Make disk_should_remove_zone_wplug() more robust Bart Van Assche
2024-11-19 7:58 ` Damien Le Moal
2024-11-19 0:28 ` [PATCH v16 19/26] blk-zoned: Add an argument to blk_zone_plug_bio() Bart Van Assche
2024-11-19 0:28 ` [PATCH v16 20/26] blk-zoned: Support pipelining of zoned writes Bart Van Assche
2024-11-19 0:28 ` [PATCH v16 21/26] scsi: core: Retry unaligned " Bart Van Assche
2024-11-19 0:28 ` [PATCH v16 22/26] scsi: sd: Increase retry count for " Bart Van Assche
2024-11-19 0:28 ` [PATCH v16 23/26] scsi: scsi_debug: Add the preserves_write_order module parameter Bart Van Assche
2024-11-19 0:28 ` [PATCH v16 24/26] scsi: scsi_debug: Support injecting unaligned write errors Bart Van Assche
2024-11-19 0:28 ` [PATCH v16 25/26] scsi: scsi_debug: Skip host/bus reset settle delay Bart Van Assche
2024-11-19 0:28 ` [PATCH v16 26/26] scsi: ufs: Inform the block layer about write ordering Bart Van Assche
[not found] ` <37f95f44-ab1d-20db-e0c7-94946cb9d4eb@quicinc.com>
2024-11-22 18:20 ` Bart Van Assche
2024-11-23 0:34 ` Can Guo
2024-11-19 8:01 ` [PATCH v16 00/26] Improve write performance for zoned UFS devices Damien Le Moal
2024-11-19 19:08 ` Bart Van Assche
2024-11-21 3:20 ` Damien Le Moal
2024-11-21 18:00 ` Bart Van Assche
2024-11-25 3:59 ` Damien Le Moal
2025-01-09 19:02 ` Bart Van Assche
2025-01-10 5:10 ` Damien Le Moal
2024-11-19 12:25 ` Christoph Hellwig
2024-11-19 18:52 ` Bart Van Assche
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ec7b6eef-48e6-42f3-8e6f-54dfd3724393@kernel.org \
--to=dlemoal@kernel.org \
--cc=axboe@kernel.dk \
--cc=bvanassche@acm.org \
--cc=hch@lst.de \
--cc=jaegeuk@kernel.org \
--cc=linux-block@vger.kernel.org \
--cc=linux-scsi@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox