* Hang when running LLVM+clang test suite
@ 2018-01-20 10:47 David Zarzycki
2018-01-21 2:50 ` Keith Busch
0 siblings, 1 reply; 6+ messages in thread
From: David Zarzycki @ 2018-01-20 10:47 UTC (permalink / raw)
Hello NVMe developers,
The LLVM+clang the test suite regularly (but not reliably) hangs the kernel (version 4.14.13-300.fc27.x86_64). I don?t see this hang when running the test suite in /tmp (tmpfs) or on a SATA SSD.
Here are photos of the console debug info, with the NVMe driver in the backtrace:
http://znu.io/dual8168hang.tar
Here is another instance of the hang, again with NVMe in the backtrace:
http://znu.io/IMG_0362.jpg
Here is the ?lspci -v? output:
45:00.0 Non-Volatile memory controller: Toshiba America Info Systems NVMe Controller (rev 01) (prog-if 02 [NVM Express])
Subsystem: Toshiba America Info Systems Device 0001
Flags: bus master, fast devsel, latency 0, IRQ 38, NUMA node 0
Memory at 94200000 (64-bit, non-prefetchable) [size=16K]
Capabilities: [40] Power Management version 3
Capabilities: [50] MSI: Enable- Count=1/8 Maskable- 64bit+
Capabilities: [70] Express Endpoint, MSI 00
Capabilities: [b0] MSI-X: Enable+ Count=8 Masked-
Capabilities: [100] Advanced Error Reporting
Capabilities: [168] Alternative Routing-ID Interpretation (ARI)
Capabilities: [178] #19
Capabilities: [198] Latency Tolerance Reporting
Capabilities: [1a0] L1 PM Substates
Kernel driver in use: nvme
Kernel modules: nvme
How can I help debug this?
Dave
________________________________________________________________________
Cross reference:
https://bugzilla.redhat.com/show_bug.cgi?id=1536480
^ permalink raw reply [flat|nested] 6+ messages in thread
* Hang when running LLVM+clang test suite
2018-01-20 10:47 Hang when running LLVM+clang test suite David Zarzycki
@ 2018-01-21 2:50 ` Keith Busch
2018-01-21 13:49 ` David Zarzycki
0 siblings, 1 reply; 6+ messages in thread
From: Keith Busch @ 2018-01-21 2:50 UTC (permalink / raw)
On Sat, Jan 20, 2018@05:47:06AM -0500, David Zarzycki wrote:
> Hello NVMe developers,
>
> The LLVM+clang the test suite regularly (but not reliably) hangs the kernel (version 4.14.13-300.fc27.x86_64). I don?t see this hang when running the test suite in /tmp (tmpfs) or on a SATA SSD.
>
> Here are photos of the console debug info, with the NVMe driver in the backtrace:
>
> http://znu.io/dual8168hang.tar
>
> Here is another instance of the hang, again with NVMe in the backtrace:
>
> http://znu.io/IMG_0362.jpg
It looks like the scheduler is stuck or a task struct is corrupt. I can't
think of anything off the top of my head what nvme has to do with that,
though. It just invokes the callback associated with a command and
doesn't directly manipulate any scheduler structs.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Hang when running LLVM+clang test suite
2018-01-21 2:50 ` Keith Busch
@ 2018-01-21 13:49 ` David Zarzycki
2018-01-21 14:13 ` David Zarzycki
0 siblings, 1 reply; 6+ messages in thread
From: David Zarzycki @ 2018-01-21 13:49 UTC (permalink / raw)
> On Jan 20, 2018,@21:50, Keith Busch <keith.busch@intel.com> wrote:
>
> On Sat, Jan 20, 2018@05:47:06AM -0500, David Zarzycki wrote:
>> Hello NVMe developers,
>>
>> The LLVM+clang the test suite regularly (but not reliably) hangs the kernel (version 4.14.13-300.fc27.x86_64). I don?t see this hang when running the test suite in /tmp (tmpfs) or on a SATA SSD.
>>
>> Here are photos of the console debug info, with the NVMe driver in the backtrace:
>>
>> http://znu.io/dual8168hang.tar
>>
>> Here is another instance of the hang, again with NVMe in the backtrace:
>>
>> http://znu.io/IMG_0362.jpg
>
> It looks like the scheduler is stuck or a task struct is corrupt. I can't
> think of anything off the top of my head what nvme has to do with that,
> though. It just invokes the callback associated with a command and
> doesn't directly manipulate any scheduler structs.
Hi Keith,
Thanks for looking at the backtraces. What other subsystems should I be looking at then?
Given that the LLVM+clang test suite is reliable when built/run in tmpfs, that implies that most of the kernel is reliable. I?ve also run the test suite reliably on an ext4 filesystem on a SATA SSD.
I?ve tried both xfs and ext4 on NVMe and they both crash, which implies that individual filesystems aren't the problem. Please note that the NVMe setup is simple: one partition and no LVM, RAID, bcache, etc.
What?s left at this point? What other combinations or debug parameters should I test?
Thanks for any help you can give,
Dave
^ permalink raw reply [flat|nested] 6+ messages in thread
* Hang when running LLVM+clang test suite
2018-01-21 13:49 ` David Zarzycki
@ 2018-01-21 14:13 ` David Zarzycki
2018-01-22 18:37 ` Keith Busch
0 siblings, 1 reply; 6+ messages in thread
From: David Zarzycki @ 2018-01-21 14:13 UTC (permalink / raw)
> On Jan 21, 2018,@08:49, David Zarzycki <dave@znu.io> wrote:
>
>
>
>> On Jan 20, 2018,@21:50, Keith Busch <keith.busch@intel.com> wrote:
>>
>> On Sat, Jan 20, 2018@05:47:06AM -0500, David Zarzycki wrote:
>>> Hello NVMe developers,
>>>
>>> The LLVM+clang the test suite regularly (but not reliably) hangs the kernel (version 4.14.13-300.fc27.x86_64). I don?t see this hang when running the test suite in /tmp (tmpfs) or on a SATA SSD.
>>>
>>> Here are photos of the console debug info, with the NVMe driver in the backtrace:
>>>
>>> http://znu.io/dual8168hang.tar
>>>
>>> Here is another instance of the hang, again with NVMe in the backtrace:
>>>
>>> http://znu.io/IMG_0362.jpg
>>
>> It looks like the scheduler is stuck or a task struct is corrupt. I can't
>> think of anything off the top of my head what nvme has to do with that,
>> though. It just invokes the callback associated with a command and
>> doesn't directly manipulate any scheduler structs.
>
> Hi Keith,
>
> Thanks for looking at the backtraces. What other subsystems should I be looking at then?
>
> Given that the LLVM+clang test suite is reliable when built/run in tmpfs, that implies that most of the kernel is reliable. I?ve also run the test suite reliably on an ext4 filesystem on a SATA SSD.
>
> I?ve tried both xfs and ext4 on NVMe and they both crash, which implies that individual filesystems aren't the problem. Please note that the NVMe setup is simple: one partition and no LVM, RAID, bcache, etc.
>
> What?s left at this point? What other combinations or debug parameters should I test?
To my surprise, I think I?ve narrowed down the bug to the block multi-queue layer. Can you please confirm that the following test is reasonable?
1) Create a file in /tmp (tmpfs with plenty of RAM and 2x the size needed by the test suite)
2) ?losetup? the file
3) 'cat /sys/block/loop0/queue/scheduler' ("[mq-deadline] none? in this case)
4) Create ext4 partition on /dev/loop0 and mount it
5) Run stress test in the loopback filesystem
Thanks,
Dave
^ permalink raw reply [flat|nested] 6+ messages in thread* Hang when running LLVM+clang test suite
2018-01-21 14:13 ` David Zarzycki
@ 2018-01-22 18:37 ` Keith Busch
2018-01-22 23:33 ` David Zarzycki
0 siblings, 1 reply; 6+ messages in thread
From: Keith Busch @ 2018-01-22 18:37 UTC (permalink / raw)
On Sun, Jan 21, 2018@09:13:27AM -0500, David Zarzycki wrote:
> To my surprise, I think I?ve narrowed down the bug to the block multi-queue layer. Can you please confirm that the following test is reasonable?
>
> 1) Create a file in /tmp (tmpfs with plenty of RAM and 2x the size needed by the test suite)
> 2) ?losetup? the file
> 3) 'cat /sys/block/loop0/queue/scheduler' ("[mq-deadline] none? in this case)
> 4) Create ext4 partition on /dev/loop0 and mount it
> 5) Run stress test in the loopback filesystem
The test looks fair to run. Does your error only happen if using
mq-deadline, or are there problems when using "none"?
^ permalink raw reply [flat|nested] 6+ messages in thread* Hang when running LLVM+clang test suite
2018-01-22 18:37 ` Keith Busch
@ 2018-01-22 23:33 ` David Zarzycki
0 siblings, 0 replies; 6+ messages in thread
From: David Zarzycki @ 2018-01-22 23:33 UTC (permalink / raw)
> On Jan 22, 2018,@13:37, Keith Busch <keith.busch@intel.com> wrote:
>
> On Sun, Jan 21, 2018@09:13:27AM -0500, David Zarzycki wrote:
>> To my surprise, I think I?ve narrowed down the bug to the block multi-queue layer. Can you please confirm that the following test is reasonable?
>>
>> 1) Create a file in /tmp (tmpfs with plenty of RAM and 2x the size needed by the test suite)
>> 2) ?losetup? the file
>> 3) 'cat /sys/block/loop0/queue/scheduler' ("[mq-deadline] none? in this case)
>> 4) Create ext4 partition on /dev/loop0 and mount it
>> 5) Run stress test in the loopback filesystem
>
> The test looks fair to run. Does your error only happen if using
> mq-deadline, or are there problems when using "none??
Hi Keith,
Thanks and yes it does. FWIW, I?ve elimited a variable from the setup (nohz_full), and now it locks up hard. I?m going persuing this further on the linux-block at vger.kernel.org mailing list.
Dave
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2018-01-22 23:33 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-01-20 10:47 Hang when running LLVM+clang test suite David Zarzycki
2018-01-21 2:50 ` Keith Busch
2018-01-21 13:49 ` David Zarzycki
2018-01-21 14:13 ` David Zarzycki
2018-01-22 18:37 ` Keith Busch
2018-01-22 23:33 ` David Zarzycki
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox