From: okaya@codeaurora.org (okaya@codeaurora.org)
Subject: AER: Malformed TLP recovery deadlock with NVMe drives
Date: Mon, 07 May 2018 23:45:57 +0100 [thread overview]
Message-ID: <16bdb0febb842ad0980db9214c8076c5@codeaurora.org> (raw)
In-Reply-To: <1125ddf8-f342-3f8f-90ee-0aa94287360c@gmail.com>
On 2018-05-07 21:58, Alex G. wrote:
> On 05/07/2018 03:30 PM, okaya@codeaurora.org wrote:
>> On 2018-05-07 21:16, Alex G. wrote:
>>> On 05/07/2018 01:46 PM, okaya@codeaurora.org wrote:
>>>> On 2018-05-07 19:36, Alex G. wrote:
>>>>> Hi! Me again!
>>>>>
>>>>> I'm seeing what appears to be a deadlock in the AER recovery path.
>>>>> It
>>>>> seems that the device_lock() call in report_slot_reset() never
>>>>> returns.
>>>>> How we get there is interesting:
>>>>
>>>> Can you give this patch a try?
>>>>
>>> Oh! Patches so soon? Yay!
>>>
>>>> https://patchwork.kernel.org/patch/10351515/
>>>
>>> Thank you! I tried a few runs. there was one run where we didn't lock
>>> up, but the other runs all went like before.
>>>
>>> For comparison, the run that didn't deadlock looked like [2].
>>>
>>
>>
>> Sounds like there are multiple problems.
>
> If it were easy, somebody would have patched it by now ;)
Can you file a bugzilla CC me, keith and bjorn and attach all of your
logs?
Let's debug this there.
>
>> With this patch, you shouldn't
>> see link down and up interrupts during reset but i do see them in the
>> log.
>
> You will see the messages from the link up/down events regardless if
> any
> action is actually taken.
>
>> Can you also share a fail case log with this patch and a diff of your
>> hacks so that we know where prints are coming from.
>
> Of course. Example of failing case [3], and is identical to the fail
> log
> without any patches. Although prints have the function name, the diff
> is
> in [4].
>
> Alex
>
> [3] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1509.log
> [4] http://gtech.myftp.org/~mrnuke/nvme_logs/print_hacks.patch
>
>
>>> [2] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1429.log
>>>>> [1] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1308.log
WARNING: multiple messages have this Message-ID (diff)
From: okaya@codeaurora.org
To: "Alex G." <mr.nuke.me@gmail.com>
Cc: Alex_Gagniuc@dellteam.com, linux-pci@vger.kernel.org,
shyam_iyer@dell.com, linux-nvme@lists.infradead.org,
Keith Busch <keith.busch@intel.com>,
austin_bolen@dell.com, linux-pci-owner@vger.kernel.org
Subject: Re: AER: Malformed TLP recovery deadlock with NVMe drives
Date: Mon, 07 May 2018 23:45:57 +0100 [thread overview]
Message-ID: <16bdb0febb842ad0980db9214c8076c5@codeaurora.org> (raw)
In-Reply-To: <1125ddf8-f342-3f8f-90ee-0aa94287360c@gmail.com>
On 2018-05-07 21:58, Alex G. wrote:
> On 05/07/2018 03:30 PM, okaya@codeaurora.org wrote:
>> On 2018-05-07 21:16, Alex G. wrote:
>>> On 05/07/2018 01:46 PM, okaya@codeaurora.org wrote:
>>>> On 2018-05-07 19:36, Alex G. wrote:
>>>>> Hi! Me again!
>>>>>
>>>>> I'm seeing what appears to be a deadlock in the AER recovery path.
>>>>> It
>>>>> seems that the device_lock() call in report_slot_reset() never
>>>>> returns.
>>>>> How we get there is interesting:
>>>>
>>>> Can you give this patch a try?
>>>>
>>> Oh! Patches so soon? Yay!
>>>
>>>> https://patchwork.kernel.org/patch/10351515/
>>>
>>> Thank you! I tried a few runs. there was one run where we didn't lock
>>> up, but the other runs all went like before.
>>>
>>> For comparison, the run that didn't deadlock looked like [2].
>>>
>>
>>
>> Sounds like there are multiple problems.
>
> If it were easy, somebody would have patched it by now ;)
Can you file a bugzilla CC me, keith and bjorn and attach all of your
logs?
Let's debug this there.
>
>> With this patch, you shouldn't
>> see link down and up interrupts during reset but i do see them in the
>> log.
>
> You will see the messages from the link up/down events regardless if
> any
> action is actually taken.
>
>> Can you also share a fail case log with this patch and a diff of your
>> hacks so that we know where prints are coming from.
>
> Of course. Example of failing case [3], and is identical to the fail
> log
> without any patches. Although prints have the function name, the diff
> is
> in [4].
>
> Alex
>
> [3] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1509.log
> [4] http://gtech.myftp.org/~mrnuke/nvme_logs/print_hacks.patch
>
>
>>> [2] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1429.log
>>>>> [1] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1308.log
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
next prev parent reply other threads:[~2018-05-07 22:45 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-05-07 18:36 AER: Malformed TLP recovery deadlock with NVMe drives Alex G.
2018-05-07 18:36 ` Alex G.
2018-05-07 18:46 ` okaya
2018-05-07 18:46 ` okaya
2018-05-07 20:16 ` Alex G.
2018-05-07 20:16 ` Alex G.
2018-05-07 20:30 ` okaya
2018-05-07 20:30 ` okaya
2018-05-07 20:58 ` Alex G.
2018-05-07 20:58 ` Alex G.
2018-05-07 21:48 ` Sinan Kaya
2018-05-07 21:48 ` Sinan Kaya
2018-05-07 22:45 ` okaya [this message]
2018-05-07 22:45 ` okaya
2018-05-07 23:57 ` Alex_Gagniuc
2018-05-07 23:57 ` Alex_Gagniuc
2018-05-08 0:21 ` okaya
2018-05-08 0:21 ` okaya
2018-05-08 16:58 ` Bjorn Helgaas
2018-05-08 16:58 ` Bjorn Helgaas
2018-05-08 17:32 ` Alex G.
2018-05-08 17:32 ` Alex G.
2018-05-08 18:01 ` Bjorn Helgaas
2018-05-08 18:01 ` Bjorn Helgaas
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=16bdb0febb842ad0980db9214c8076c5@codeaurora.org \
--to=okaya@codeaurora.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.