Re: Hard LOCKUP on 4.15-rc9 + 'blkmq/for-next' branch

public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed

From: Jens Axboe <axboe@kernel.dk>
To: David Zarzycki <dave@znu.io>
Cc: "linux-block@vger.kernel.org" <linux-block@vger.kernel.org>
Subject: Re: Hard LOCKUP on 4.15-rc9 + 'blkmq/for-next' branch
Date: Mon, 22 Jan 2018 18:20:30 -0700	[thread overview]
Message-ID: <6025644f-e147-903a-e836-a2f53d3a61e9@kernel.dk> (raw)
In-Reply-To: <7E93BF17-8476-4A93-BD27-D2E22E5DAB00@znu.io>

On 1/22/18 6:05 PM, David Zarzycki wrote:
> 
> 
>> On Jan 22, 2018, at 18:34, Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 1/22/18 4:31 PM, David Zarzycki wrote:
>>> Hello,
>>>
>>> I previously reported a hang when building LLVM+clang on a block
>>> multi-queue device (NVMe _or_ loopback onto tmpfs with the ’none’
>>> scheduler).
>>>
>>> I’ve since updated the kernel to 4.15-rc9, merged the
>>> ‘blkmq/for-next’ branch, disabled nohz_full parameter (used for
>>> testing), and tried again. Both NVMe and loopback now lock up hard
>>> (ext4 if it matters). Here are the backtraces:
>>>
>>> NVMe:      http://znu.io/IMG_0366.jpg
>>> Loopback:  http://znu.io/IMG_0367.jpg
>>
>> I tried to reproduce this today using the exact recipe that you provide,
>> but it ran fine for hours. Similar setup, nvme on a dual socket box
>> with 48 threads.
> 
> Hi Jens,
> 
> Thanks for the quick reply and thanks for trying to reproduce this.
> I’m not sure if this makes a difference, but this dual Skylake machine
> has 96 threads, not 48 threads. Also, just to be clear, NVMe doesn’t
> seem to matter. I hit this bug with a tmpfs loopback device set up
> like so:
>
> dd if=/dev/zero bs=1024k count=10000 of=/tmp/loopdisk
> losetup /dev/loop0 /tmp/loopdisk
> echo none > /sys/block/loop0/queue/scheduler
> mkfs -t ext4 -L loopy /dev/loop0
> mount /dev/loop0 /l
> ### build LLVM+clang in /l
> ### 'ninja check-all’ in a loop in /l
> 
> (No swap is setup because the machine has 192 GiB of RAM.)

The 48 vs 96 is probably not that significant.Just to be clear, you can
reproduce something else on tmpfs loopback, they don't look related
apart from the fact that they are both lockups off the IO completion
path.

>>> What should I try next to help debug this?
>>
>> This one looks different than the other one. Are you sure your hw is
>> sane?
>
> I can build LLVM+clang in /tmp (tmpfs) reliably which suggests the the
> fundamental hardware is sane. It’s only when the software multi-queue
> layer gets involved that I see quick crashes/hangs.
> 
> As for the different backtraces, that's probably because I removed
> nohz_full from the kernel boot parameters.

Hardware issues can manifest itself in mysterious ways. It might very
well be a software bug, but it'd be the first one of its kind that I've
seen reported. Which does make me a little skeptical, it might just be
the canary in this case.

>> I'd probably try and enable lockdep debugging etc and see if you
>> catch anything.
> 
> Thanks. I turned on lockdep plus other lock debugging. Here is the
> resulting backtrace:
> 
> http://znu.io/IMG_0368.jpg
> 
> Here is the resulting backtrace with transparent huge pages disabled:
> 
> http://znu.io/IMG_0369.jpg
> 
> Here is the resulting backtrace with transparent huge pages disabled
> AND with systemd-coredumps disabled too:
> 
> http://znu.io/IMG_0370.jpg

All of these are off the blk-wbt completion path. I suggested earlier to
try and disable CONFIG_BLK_WBT to see if it goes away, or at least to
see if the pattern changes.

Lockdep didn't catch anything. Maybe try some of the other debugging
features, like page poisoning, memory allocation debugging, slub debug
on-by-default.

> I’m open to trying anything at this point. Thanks for helping,

I'd try other types of stress testing. Has the machine otherwise been
stable, or is it a new box?

-- 
Jens Axboe

next prev parent reply	other threads:[~2018-01-23  1:20 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-22 23:31 Hard LOCKUP on 4.15-rc9 + 'blkmq/for-next' branch David Zarzycki
2018-01-22 23:34 ` Jens Axboe
2018-01-23  1:05   ` David Zarzycki
2018-01-23  1:20     ` Jens Axboe [this message]
2018-01-23 13:48       ` David Zarzycki
2018-01-23 15:34         ` Jens Axboe
2018-01-23 17:54           ` David Zarzycki
2018-01-23 18:00             ` Jens Axboe
2018-01-23 18:12               ` David Zarzycki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6025644f-e147-903a-e836-a2f53d3a61e9@kernel.dk \
    --to=axboe@kernel.dk \
    --cc=dave@znu.io \
    --cc=linux-block@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox