All of lore.kernel.org
 help / color / mirror / Atom feed
* I/O 0 QID 0 timeout, disable controller - kernel 4.4 / 4.5 NVMe controller dropouts
@ 2016-04-14  5:13 Sam McLeod
  2016-04-14 13:21 ` Keith Busch
  0 siblings, 1 reply; 7+ messages in thread
From: Sam McLeod @ 2016-04-14  5:13 UTC (permalink / raw)


We have 6 Supermicro servers all of the same (or very similar spec),

Since Kernel 4.4 / 4.5 we've had NVMe devices randomly dropping.
It does not relate to a particular server, disk, controller etc... and downgrading to kernel 4.1.

With kernel 4.4 the servers would load and the disk randomly disappear.
With 4.5 the server loads with one of the disks missing every time.


```
[   66.856719] nvme 0000:03:00.0: I/O 0 QID 0 timeout, disable controller
[   66.957911] nvme 0000:03:00.0: Identify Controller failed (-4)
[   66.957961] nvme 0000:03:00.0: Removing after probe failure status: -5
```

We have tried:

- Swapping the disk
- Swapping the NVMe cables
- Swapping the NVMe controller (motherboard)
- Swapping the backplane
- Downgrading from Kernel 4.5.0 to 4.4.2 given recent changes to the storage subsystem
- Upgrading disk and motherboard firmwares
- Swapping the motherboard

So it's essentially a whole new server except that we haven't done a reinstall - why? Because I want to understand the problem and if reinstalling fixes it we'll never know why it's happening on this machine and not on our other 5.

- No SMART or nvme-cli errors are reported on the drive when it is functioning.
- If the drive is swapped into another bay it works fine and whatever drive is replaced into that bay then eventually times out / fails.


- CentOS 7 (Latest patches installed)
- Kernel 4.5.0
- 2x Intel DC3600 NVMe (2.5" FF)
- Intel Corporation C610/X99 series chipset
- Full `lspci -tvv` output: https://gist.github.com/sammcj/8839c536b2cf6d4def8d2572eb1b4e8a
- Full kernel config: https://gist.github.com/sammcj/7d1e79775bf984424b92679d16c015c6

^ permalink raw reply	[flat|nested] 7+ messages in thread

* I/O 0 QID 0 timeout, disable controller - kernel 4.4 / 4.5 NVMe controller dropouts
  2016-04-14  5:13 I/O 0 QID 0 timeout, disable controller - kernel 4.4 / 4.5 NVMe controller dropouts Sam McLeod
@ 2016-04-14 13:21 ` Keith Busch
  2016-04-14 14:08   ` Jens Axboe
  0 siblings, 1 reply; 7+ messages in thread
From: Keith Busch @ 2016-04-14 13:21 UTC (permalink / raw)


On Thu, Apr 14, 2016@03:13:22PM +1000, Sam McLeod wrote:
> We have 6 Supermicro servers all of the same (or very similar spec),
> 
> Since Kernel 4.4 / 4.5 we've had NVMe devices randomly dropping.
> It does not relate to a particular server, disk, controller etc... and downgrading to kernel 4.1.
> 
> With kernel 4.4 the servers would load and the disk randomly disappear.
> With 4.5 the server loads with one of the disks missing every time.
> 
> 
> ```
> [   66.856719] nvme 0000:03:00.0: I/O 0 QID 0 timeout, disable controller
> [   66.957911] nvme 0000:03:00.0: Identify Controller failed (-4)
> [   66.957961] nvme 0000:03:00.0: Removing after probe failure status: -5
> ```

Looks like more fallout from reducing the scope of admin queue completion
polling...

Jens:

Could we please apply the MSI-x fix commit to 4.6 instead of 4.7 so 4.6
isn't equally broken? Currently staged in for-next here:

  http://git.kernel.dk/?p=linux-block.git;a=commitdiff;h=788e15abbb9408c9399d7e3445ac9afb3b2fd7d6;hp=e0489487ec9cd79ee1fa0dc5d3789c08b0e51a2c

I'd also like to submit an apporpriate port to stable if no objections.

Thanks,
Keith

^ permalink raw reply	[flat|nested] 7+ messages in thread

* I/O 0 QID 0 timeout, disable controller - kernel 4.4 / 4.5 NVMe controller dropouts
  2016-04-14 13:21 ` Keith Busch
@ 2016-04-14 14:08   ` Jens Axboe
  2016-04-14 15:23     ` Christoph Hellwig
  0 siblings, 1 reply; 7+ messages in thread
From: Jens Axboe @ 2016-04-14 14:08 UTC (permalink / raw)


On 04/14/2016 07:21 AM, Keith Busch wrote:
> On Thu, Apr 14, 2016@03:13:22PM +1000, Sam McLeod wrote:
>> We have 6 Supermicro servers all of the same (or very similar spec),
>>
>> Since Kernel 4.4 / 4.5 we've had NVMe devices randomly dropping.
>> It does not relate to a particular server, disk, controller etc... and downgrading to kernel 4.1.
>>
>> With kernel 4.4 the servers would load and the disk randomly disappear.
>> With 4.5 the server loads with one of the disks missing every time.
>>
>>
>> ```
>> [   66.856719] nvme 0000:03:00.0: I/O 0 QID 0 timeout, disable controller
>> [   66.957911] nvme 0000:03:00.0: Identify Controller failed (-4)
>> [   66.957961] nvme 0000:03:00.0: Removing after probe failure status: -5
>> ```
>
> Looks like more fallout from reducing the scope of admin queue completion
> polling...
>
> Jens:
>
> Could we please apply the MSI-x fix commit to 4.6 instead of 4.7 so 4.6
> isn't equally broken? Currently staged in for-next here:
>
>    https://urldefense.proofpoint.com/v2/url?u=http-3A__git.kernel.dk_-3Fp-3Dlinux-2Dblock.git-3Ba-3Dcommitdiff-3Bh-3D788e15abbb9408c9399d7e3445ac9afb3b2fd7d6-3Bhp-3De0489487ec9cd79ee1fa0dc5d3789c08b0e51a2c&d=CwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=cK1a7KivzZRh1fKQMjSm2A&m=WjNBEBATs2DJsluOdhxTZFsQleenkWVxdeLMNPJCTnc&s=7AMgD6x25bdbt4Lp-qTGaHVjXW6yP4GSes8jTZ7SBr0&e=
>
> I'd also like to submit an apporpriate port to stable if no objections.

It feels awfully risky for the current series. Yes, we know this patch 
fixes the reported cases, but I'm worried that there are other 
controllers that will now fail because we don't probe with legacy 
interrupts. But the alternative is polling, which isn't great either and 
would (once again) cause the current and next series to diverge in weird 
and interesting ways.

Hmm

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 7+ messages in thread

* I/O 0 QID 0 timeout, disable controller - kernel 4.4 / 4.5 NVMe controller dropouts
  2016-04-14 14:08   ` Jens Axboe
@ 2016-04-14 15:23     ` Christoph Hellwig
  2016-04-14 20:03       ` Jens Axboe
  0 siblings, 1 reply; 7+ messages in thread
From: Christoph Hellwig @ 2016-04-14 15:23 UTC (permalink / raw)


The code in the 4.7 tree still falls back to legacy irqs, it just tries
MSI-X first, then MSI, then legacy.  Which seems a common patters in
MSI-X using drivers, as I've just started a tree wide audit for the
usage of the MSI/MSI-X APIs.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* I/O 0 QID 0 timeout, disable controller - kernel 4.4 / 4.5 NVMe controller dropouts
  2016-04-14 15:23     ` Christoph Hellwig
@ 2016-04-14 20:03       ` Jens Axboe
  2016-04-14 23:38         ` Sam McLeod
  0 siblings, 1 reply; 7+ messages in thread
From: Jens Axboe @ 2016-04-14 20:03 UTC (permalink / raw)


On 04/14/2016 09:23 AM, Christoph Hellwig wrote:
> The code in the 4.7 tree still falls back to legacy irqs, it just tries
> MSI-X first, then MSI, then legacy.  Which seems a common patters in
> MSI-X using drivers, as I've just started a tree wide audit for the
> usage of the MSI/MSI-X APIs.

Yeah, that seems sane. OK, let's queue it up for 4.6, it's sucks to have 
known broken behavior like the current out there.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 7+ messages in thread

* I/O 0 QID 0 timeout, disable controller - kernel 4.4 / 4.5 NVMe controller dropouts
  2016-04-14 20:03       ` Jens Axboe
@ 2016-04-14 23:38         ` Sam McLeod
  2016-04-15 16:22           ` Keith Busch
  0 siblings, 1 reply; 7+ messages in thread
From: Sam McLeod @ 2016-04-14 23:38 UTC (permalink / raw)


Thank you all involved, is there anything we can set in the mean time to prevent this or do you suggest rolling back to a much older kernel?

Sent from my iPhone

> On 15 Apr 2016,@6:03 AM, Jens Axboe <axboe@fb.com> wrote:
> 
>> On 04/14/2016 09:23 AM, Christoph Hellwig wrote:
>> The code in the 4.7 tree still falls back to legacy irqs, it just tries
>> MSI-X first, then MSI, then legacy.  Which seems a common patters in
>> MSI-X using drivers, as I've just started a tree wide audit for the
>> usage of the MSI/MSI-X APIs.
> 
> Yeah, that seems sane. OK, let's queue it up for 4.6, it's sucks to have known broken behavior like the current out there.
> 
> -- 
> Jens Axboe
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* I/O 0 QID 0 timeout, disable controller - kernel 4.4 / 4.5 NVMe controller dropouts
  2016-04-14 23:38         ` Sam McLeod
@ 2016-04-15 16:22           ` Keith Busch
  0 siblings, 0 replies; 7+ messages in thread
From: Keith Busch @ 2016-04-15 16:22 UTC (permalink / raw)


On Fri, Apr 15, 2016@09:38:45AM +1000, Sam McLeod wrote:
> Thank you all involved, is there anything we can set in the mean time to prevent this or do you suggest rolling back to a much older kernel?

Hi Sam,

Until we get the fix submitted to stable, you'll unfortunately have
to use kernels older than 4.4 or use 4.6 when the fix is applied to
mainline. There's no viable work-around without changing the driver.

If you want to apply a kernel patch to your own tree in the meantime,
a possible work-around patch is provided here:

  http://lists.infradead.org/pipermail/linux-nvme/2016-March/004277.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-04-15 16:22 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-04-14  5:13 I/O 0 QID 0 timeout, disable controller - kernel 4.4 / 4.5 NVMe controller dropouts Sam McLeod
2016-04-14 13:21 ` Keith Busch
2016-04-14 14:08   ` Jens Axboe
2016-04-14 15:23     ` Christoph Hellwig
2016-04-14 20:03       ` Jens Axboe
2016-04-14 23:38         ` Sam McLeod
2016-04-15 16:22           ` Keith Busch

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.