Hang when running LLVM+clang test suite

Linux-NVME Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Hang when running LLVM+clang test suite
@ 2018-01-20 10:47 David Zarzycki
  2018-01-21  2:50 ` Keith Busch
  0 siblings, 1 reply; 6+ messages in thread
From: David Zarzycki @ 2018-01-20 10:47 UTC (permalink / raw)


Hello NVMe developers,

The LLVM+clang the test suite regularly (but not reliably) hangs the kernel (version 4.14.13-300.fc27.x86_64). I don?t see this hang when running the test suite in /tmp (tmpfs) or on a SATA SSD.

Here are photos of the console debug info, with the NVMe driver in the backtrace:

http://znu.io/dual8168hang.tar

Here is another instance of the hang, again with NVMe in the backtrace:

http://znu.io/IMG_0362.jpg

Here is the ?lspci -v? output:

45:00.0 Non-Volatile memory controller: Toshiba America Info Systems NVMe Controller (rev 01) (prog-if 02 [NVM Express])
	Subsystem: Toshiba America Info Systems Device 0001
	Flags: bus master, fast devsel, latency 0, IRQ 38, NUMA node 0
	Memory at 94200000 (64-bit, non-prefetchable) [size=16K]
	Capabilities: [40] Power Management version 3
	Capabilities: [50] MSI: Enable- Count=1/8 Maskable- 64bit+
	Capabilities: [70] Express Endpoint, MSI 00
	Capabilities: [b0] MSI-X: Enable+ Count=8 Masked-
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [168] Alternative Routing-ID Interpretation (ARI)
	Capabilities: [178] #19
	Capabilities: [198] Latency Tolerance Reporting
	Capabilities: [1a0] L1 PM Substates
	Kernel driver in use: nvme
	Kernel modules: nvme


How can I help debug this?

Dave

________________________________________________________________________

Cross reference:

https://bugzilla.redhat.com/show_bug.cgi?id=1536480

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Hang when running LLVM+clang test suite
  2018-01-20 10:47 Hang when running LLVM+clang test suite David Zarzycki
@ 2018-01-21  2:50 ` Keith Busch
  2018-01-21 13:49   ` David Zarzycki
  0 siblings, 1 reply; 6+ messages in thread
From: Keith Busch @ 2018-01-21  2:50 UTC (permalink / raw)


On Sat, Jan 20, 2018@05:47:06AM -0500, David Zarzycki wrote:
> Hello NVMe developers,
> 
> The LLVM+clang the test suite regularly (but not reliably) hangs the kernel (version 4.14.13-300.fc27.x86_64). I don?t see this hang when running the test suite in /tmp (tmpfs) or on a SATA SSD.
> 
> Here are photos of the console debug info, with the NVMe driver in the backtrace:
> 
> http://znu.io/dual8168hang.tar
> 
> Here is another instance of the hang, again with NVMe in the backtrace:
> 
> http://znu.io/IMG_0362.jpg

It looks like the scheduler is stuck or a task struct is corrupt. I can't
think of anything off the top of my head what nvme has to do with that,
though. It just invokes the callback associated with a command and
doesn't directly manipulate any scheduler structs.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Hang when running LLVM+clang test suite
  2018-01-21  2:50 ` Keith Busch
@ 2018-01-21 13:49   ` David Zarzycki
  2018-01-21 14:13     ` David Zarzycki
  0 siblings, 1 reply; 6+ messages in thread
From: David Zarzycki @ 2018-01-21 13:49 UTC (permalink / raw)

> On Jan 20, 2018,@21:50, Keith Busch <keith.busch@intel.com> wrote:
> 
> On Sat, Jan 20, 2018@05:47:06AM -0500, David Zarzycki wrote:
>> Hello NVMe developers,
>> 
>> The LLVM+clang the test suite regularly (but not reliably) hangs the kernel (version 4.14.13-300.fc27.x86_64). I don?t see this hang when running the test suite in /tmp (tmpfs) or on a SATA SSD.
>> 
>> Here are photos of the console debug info, with the NVMe driver in the backtrace:
>> 
>> http://znu.io/dual8168hang.tar
>> 
>> Here is another instance of the hang, again with NVMe in the backtrace:
>> 
>> http://znu.io/IMG_0362.jpg
> 
> It looks like the scheduler is stuck or a task struct is corrupt. I can't
> think of anything off the top of my head what nvme has to do with that,
> though. It just invokes the callback associated with a command and
> doesn't directly manipulate any scheduler structs.

Hi Keith,

Thanks for looking at the backtraces. What other subsystems should I be looking at then?

Given that the LLVM+clang test suite is reliable when built/run in tmpfs, that implies that most of the kernel is reliable. I?ve also run the test suite reliably on an ext4 filesystem on a SATA SSD.

I?ve tried both xfs and ext4 on NVMe and they both crash, which implies that individual filesystems aren't the problem. Please note that the NVMe setup is simple: one partition and no LVM, RAID, bcache, etc.

What?s left at this point? What other combinations or debug parameters should I test?

Thanks for any help you can give,
Dave

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Hang when running LLVM+clang test suite
  2018-01-21 13:49   ` David Zarzycki
@ 2018-01-21 14:13     ` David Zarzycki
  2018-01-22 18:37       ` Keith Busch
  0 siblings, 1 reply; 6+ messages in thread
From: David Zarzycki @ 2018-01-21 14:13 UTC (permalink / raw)




> On Jan 21, 2018,@08:49, David Zarzycki <dave@znu.io> wrote:
> 
> 
> 
>> On Jan 20, 2018,@21:50, Keith Busch <keith.busch@intel.com> wrote:
>> 
>> On Sat, Jan 20, 2018@05:47:06AM -0500, David Zarzycki wrote:
>>> Hello NVMe developers,
>>> 
>>> The LLVM+clang the test suite regularly (but not reliably) hangs the kernel (version 4.14.13-300.fc27.x86_64). I don?t see this hang when running the test suite in /tmp (tmpfs) or on a SATA SSD.
>>> 
>>> Here are photos of the console debug info, with the NVMe driver in the backtrace:
>>> 
>>> http://znu.io/dual8168hang.tar
>>> 
>>> Here is another instance of the hang, again with NVMe in the backtrace:
>>> 
>>> http://znu.io/IMG_0362.jpg
>> 
>> It looks like the scheduler is stuck or a task struct is corrupt. I can't
>> think of anything off the top of my head what nvme has to do with that,
>> though. It just invokes the callback associated with a command and
>> doesn't directly manipulate any scheduler structs.
> 
> Hi Keith,
> 
> Thanks for looking at the backtraces. What other subsystems should I be looking at then?
> 
> Given that the LLVM+clang test suite is reliable when built/run in tmpfs, that implies that most of the kernel is reliable. I?ve also run the test suite reliably on an ext4 filesystem on a SATA SSD.
> 
> I?ve tried both xfs and ext4 on NVMe and they both crash, which implies that individual filesystems aren't the problem. Please note that the NVMe setup is simple: one partition and no LVM, RAID, bcache, etc.
> 
> What?s left at this point? What other combinations or debug parameters should I test?

To my surprise, I think I?ve narrowed down the bug to the block multi-queue layer. Can you please confirm that the following test is reasonable?

1) Create a file in /tmp (tmpfs with plenty of RAM and 2x the size needed by the test suite)
2) ?losetup? the file
3) 'cat /sys/block/loop0/queue/scheduler' ("[mq-deadline] none? in this case)
4) Create ext4 partition on /dev/loop0 and mount it
5) Run stress test in the loopback filesystem

Thanks,
Dave

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Hang when running LLVM+clang test suite
  2018-01-21 14:13     ` David Zarzycki
@ 2018-01-22 18:37       ` Keith Busch
  2018-01-22 23:33         ` David Zarzycki
  0 siblings, 1 reply; 6+ messages in thread
From: Keith Busch @ 2018-01-22 18:37 UTC (permalink / raw)


On Sun, Jan 21, 2018@09:13:27AM -0500, David Zarzycki wrote:
> To my surprise, I think I?ve narrowed down the bug to the block multi-queue layer. Can you please confirm that the following test is reasonable?
> 
> 1) Create a file in /tmp (tmpfs with plenty of RAM and 2x the size needed by the test suite)
> 2) ?losetup? the file
> 3) 'cat /sys/block/loop0/queue/scheduler' ("[mq-deadline] none? in this case)
> 4) Create ext4 partition on /dev/loop0 and mount it
> 5) Run stress test in the loopback filesystem

The test looks fair to run. Does your error only happen if using
mq-deadline, or are there problems when using "none"?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Hang when running LLVM+clang test suite
  2018-01-22 18:37       ` Keith Busch
@ 2018-01-22 23:33         ` David Zarzycki
  0 siblings, 0 replies; 6+ messages in thread
From: David Zarzycki @ 2018-01-22 23:33 UTC (permalink / raw)




> On Jan 22, 2018,@13:37, Keith Busch <keith.busch@intel.com> wrote:
> 
> On Sun, Jan 21, 2018@09:13:27AM -0500, David Zarzycki wrote:
>> To my surprise, I think I?ve narrowed down the bug to the block multi-queue layer. Can you please confirm that the following test is reasonable?
>> 
>> 1) Create a file in /tmp (tmpfs with plenty of RAM and 2x the size needed by the test suite)
>> 2) ?losetup? the file
>> 3) 'cat /sys/block/loop0/queue/scheduler' ("[mq-deadline] none? in this case)
>> 4) Create ext4 partition on /dev/loop0 and mount it
>> 5) Run stress test in the loopback filesystem
> 
> The test looks fair to run. Does your error only happen if using
> mq-deadline, or are there problems when using "none??

Hi Keith,

Thanks and yes it does. FWIW, I?ve elimited a variable from the setup (nohz_full), and now it locks up hard. I?m going persuing this further on the linux-block at vger.kernel.org mailing list.

Dave

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-01-22 23:33 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-01-20 10:47 Hang when running LLVM+clang test suite David Zarzycki
2018-01-21  2:50 ` Keith Busch
2018-01-21 13:49   ` David Zarzycki
2018-01-21 14:13     ` David Zarzycki
2018-01-22 18:37       ` Keith Busch
2018-01-22 23:33         ` David Zarzycki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox