public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed
* nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting, source drive corruption observed
@ 2022-12-15  1:38 J. Hart
  2022-12-15  8:23 ` Christoph Hellwig
  2022-12-16 23:16 ` Keith Busch
  0 siblings, 2 replies; 27+ messages in thread
From: J. Hart @ 2022-12-15  1:38 UTC (permalink / raw)
  To: linux-nvme; +Cc: jfhart085, kbusch, axboe, hch, sagi

I am attempting to load an nvme device (nvme0n1) to use as main system 
drive using the following command:

rsync -axvH /. --exclude=/lost+found --exclude=/var/log.bu 
--exclude=/usr/var/log.bu --exclude=/usr/X11R6/var/log.bu 
--exclude=/home/jhart/.cache/mozilla/firefox/are7uokl.default-release/cache2.bu 
--exclude=/home/jhart/.cache/thunderbird/7zsnqnss.default/cache2.bu 
/mnt/root_new 2>&1 | tee root.log

The total transfer would be approximately 50 GB.  This is being done at 
run level 1, and only the kernel threads and the root shell are observed 
to be active.

The following log messages appear after a minute or so, and rsync hangs. 
The nvme drive cannot be unmounted without a reboot.

dmesg reports the following:

[Dec14 19:24] nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting
[Dec14 19:25] nvme nvme0: I/O 0 QID 1 timeout, reset controller
[ +30.719985] nvme nvme0: I/O 8 QID 0 timeout, reset controller
[Dec14 19:28] nvme nvme0: Device not ready; aborting reset, CSTS=0x1
[  +0.031803] nvme nvme0: Abort status: 0x371
[Dec14 19:30] nvme nvme0: Device not ready; aborting reset, CSTS=0x1
[  +0.000019] nvme nvme0: Removing after probe failure status: -19

I have also observed file system corruption on the source drive of the 
transfer.  I would not normally think this to be related, except that 
after the first time I observed it, I made certain that I corrected the 
file content before any additional attempts, but have seen this again 
after every attempt.  The modification dates and file sizes did not 
change, but the file content on the source drive did.  I confirmed this 
using the "diff" utility, and again using a rsync dry run with the check 
sum test enabled.

kernel/distro:

Linux DellXPS 6.1.0 #1 SMP Tue Dec 13 21:48:51 JST 2022 x86_64 GNU/Linux
custom distribution built entirely from source

nvme controller:

MZHOU M.2 NVME SSD-PCIe 4.0 X4 adaptor
Key-M NGFF PCI-E 3.0、2.0 or 1.0 controller expansion cards
(2230 2242 2260 2280 22110 M.2 SSD)

02:00.0 Non-Volatile memory controller: Kingston Technologies Device 
500f (rev 03) (prog-if 02)
         Subsystem: Kingston Technologies Device 500f
         Flags: bus master, fast devsel, latency 0, IRQ 16
         Memory at ef9fc000 (64-bit, non-prefetchable) [size=16K]
         Capabilities: [40] Power Management version 3
         Capabilities: [50] MSI: Enable- Count=1/8 Maskable+ 64bit+
         Capabilities: [70] Express Endpoint, MSI 00
         Capabilities: [b0] MSI-X: Enable- Count=16 Masked-
         Kernel driver in use: nvme

nvme drive:

Model Number:                       KINGSTON SNVSE500G
Serial Number:                      50026B7685D8EE42
Firmware Version:                   S8542105
PCI Vendor/Subsystem ID:            0x2646
IEEE OUI Identifier:                0x0026b7
Controller ID:                      1
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          500,107,862,016 [500 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            0026b7 685d8ee425
Local Time is:                      Tue Nov 29 20:31:21 2022 JST
Firmware Updates (0x12):            1 Slot, no Reset required
Optional Admin Commands (0x0016):   Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero 
Sav/Sel_Feat Timestmp
Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size:         64 Pages
Warning  Comp. Temp. Threshold:     85 Celsius
Critical Comp. Temp. Threshold:     90 Celsius

CPU (quad core, cpu 0 shown, others the same):

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 23
model name	: Intel(R) Core(TM)2 Quad  CPU   Q9550  @ 2.83GHz
stepping	: 7
microcode	: 0x705
cpu MHz		: 1999.839
cache size	: 6144 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx lm 
constant_tsc arch_perfmon pebs bts rep_good nopl cpuid aperfmperf pni 
dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 
lahf_lm pti tpr_shadow vnmi flexpriority vpid dtherm
vmx flags	: vnmi flexpriority tsc_offset vtpr vapic
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds 
swapgs itlb_multihit mmio_unknown
bogomips	: 5666.43
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:




^ permalink raw reply	[flat|nested] 27+ messages in thread
* Re: nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting, source drive corruption observed
@ 2022-12-17 12:07 J. Hart
  0 siblings, 0 replies; 27+ messages in thread
From: J. Hart @ 2022-12-17 12:07 UTC (permalink / raw)
  To: linux-nvme; +Cc: jfhart085, kbusch, axboe, hch, sagi, hch

I have replaced the Kingston NV1-E 500 GB NVME SSD drive with a Samsung 970 EVO Plus 500GB NVME SSD drive.  I retained the same PCIe controller (Mzhou M.2 NVME SSD-PCIe 4.0 X4 adapter).  I then attempted the same rsync transfer at single user run level as I had done before with the Kingston NVME SSD.  The transfer has apparently completed successfully and without incident. No unusual log messages or corruption was observed.

J. Hart


^ permalink raw reply	[flat|nested] 27+ messages in thread
* Re: nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting, source drive corruption observed
@ 2022-12-17 15:07 J. Hart
  0 siblings, 0 replies; 27+ messages in thread
From: J. Hart @ 2022-12-17 15:07 UTC (permalink / raw)
  To: linux-nvme; +Cc: jfhart085, kbusch, axboe, hch, sagi, hch

I have done an fsck check on the /dev/nvme0n1p3 file system after the 
rsync invocation referenced earlier.  In the first run I found errors 
which fsck should have repaired if I understand correctly.  In repeating 
the fsck invocation immediately afterwards, I found errors again each time.
This was done using the replacement Samsung nvme ssd (Samsung 970 EVO 
Plus 500G).

Here is the output :

-bash-3.2# fsck -f -C /dev/nvme0n1p3
fsck from util-linux 2.34
e2fsck 1.42.12 (29-Aug-2014)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure 

Pass 3: Checking directory connectivity 

Pass 4: Checking reference counts
Pass 5: Checking group summary information 

Block bitmap differences:  -87163220 -87163384 -87187960 -122079572 
-122079736
Fix<y>? yes
 

/dev/nvme0n1p3: ***** FILE SYSTEM WAS MODIFIED *****
/dev/nvme0n1p3: 699563/30523392 files (0.1% non-contiguous), 
13841458/122079744 blocks
-bash-3.2# fsck -f -C /dev/nvme0n1p3
fsck from util-linux 2.34
e2fsck 1.42.12 (29-Aug-2014)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure 

Pass 3: Checking directory connectivity 

Pass 4: Checking reference counts 

Pass 5: Checking group summary information 

Block bitmap differences:  -87163220 -87163384 -122079736 

Fix<y>? yes
 

/dev/nvme0n1p3: ***** FILE SYSTEM WAS MODIFIED *****
/dev/nvme0n1p3: 699563/30523392 files (0.1% non-contiguous), 
13841458/122079744 blocks
-bash-3.2# fsck -f -C /dev/nvme0n1p3
fsck from util-linux 2.34
e2fsck 1.42.12 (29-Aug-2014)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure 

Pass 3: Checking directory connectivity 

Pass 4: Checking reference counts
Pass 5: Checking group summary information 

Block bitmap differences:  -87163220 -87163384 -122079572 -122079736 

Fix<y>? yes
 

/dev/nvme0n1p3: ***** FILE SYSTEM WAS MODIFIED *****
/dev/nvme0n1p3: 699563/30523392 files (0.1% non-contiguous), 
13841458/122079744 blocks


I also tried loading a smaller partition on the Samsung card at 
/dev/nvme0n1p3.  The copy stopped with a "no space left on device" 
error, which should not have been possible as the source device is a 
32MB partition, and the destination partition on the nvme ssd is a 64MB 
partition.  The two files to be transferred were very small and could 
not have accounted for this as they totaled less than 5MB. I found file 
system damage on the nvme destination partition in this case as well. 
It also occurred repeatedly. I am still investigating this last case.

In no instance did I note any otherwise unusual log messages or errors 
from the nvme driver.

I do not yet know if there has been any damage to any other filesystems, 
but I will check.

J. Hart


^ permalink raw reply	[flat|nested] 27+ messages in thread
* Re: nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting, source drive corruption observed
@ 2022-12-17 16:14 J. Hart
  0 siblings, 0 replies; 27+ messages in thread
From: J. Hart @ 2022-12-17 16:14 UTC (permalink / raw)
  To: linux-nvme; +Cc: jfhart085, kbusch, axboe, hch, sagi, hch


> I also tried loading a smaller partition on the Samsung card at /dev/nvme0n1p3.  The copy stopped with a "no space left on device" error, which should not have been possible as the source device is a 32MB partition, and the destination partition on the nvme ssd is a 64MB partition.  The two files to be transferred were very small and could not have accounted for this as they totaled less than 5MB. I found file system damage on the nvme destination partition in this case as well. It also occurred repeatedly. I am still investigating this last case.
> 
> In no instance did I note any otherwise unusual log messages or errors from the nvme driver.
> 
> I do not yet know if there has been any damage to any other filesystems, but I will check.

A correction is in order above:
That smaller partition was at /dev/nvme0n1p2, not /dev/nvme0n1p3.  The 
former is a 64MB partition, the latter is much larger.

I have checked for damage in all the filesystems on all the non-NVME 
block devices on the system, and have found none since installing the 
Samsung ssd device.

I am presently unable to safely use an NVME ssd drive as the behavior 
appears to be unstable.  I can still do testing with the Samsung drive 
if needed, but the Kingston has been removed, and will be returned on 
Monday local time (Japan Standard, as I'm in Kobe)

J. Hart


^ permalink raw reply	[flat|nested] 27+ messages in thread
* Re: nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting, source drive corruption observed
@ 2022-12-17 21:57 J. Hart
  0 siblings, 0 replies; 27+ messages in thread
From: J. Hart @ 2022-12-17 21:57 UTC (permalink / raw)
  To: linux-nvme; +Cc: jfhart085, kbusch, axboe, hch, sagi, hch


An additional note if I may:

Memory tests run overnight using memtest86plus-6.00 found no issues.
Please let me know if there is anything else you need from me.

With Thanks,

J. Hart


^ permalink raw reply	[flat|nested] 27+ messages in thread
* Re: nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting, source drive corruption observed
@ 2022-12-18  6:20 J. Hart
  0 siblings, 0 replies; 27+ messages in thread
From: J. Hart @ 2022-12-18  6:20 UTC (permalink / raw)
  To: linux-nvme; +Cc: jfhart085, kbusch, axboe, hch, sagi, hch

The following may be interesting to note:

This afternoon I also tried an fsck check on the smaller partition 
/dev/nvme0n1p2 (the 64MB one).  I made sure it was not mounted and 
repeatedly ran fsck on it without anything else between runs.  The 
results would alternate between reporting no damage, and reporting 
damage present.  Both occurrences were frequent.

J. Hart

> I have done an fsck check on the /dev/nvme0n1p3 file system after the rsync invocation referenced earlier.  In the first run I found errors which fsck should have repaired if I understand correctly.  In repeating the fsck invocation immediately afterwards, I found errors again each time.
> This was done using the replacement Samsung nvme ssd (Samsung 970 EVO Plus 500G).



^ permalink raw reply	[flat|nested] 27+ messages in thread
* Re: nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting, source drive corruption observed
@ 2022-12-18 12:08 J. Hart
  2022-12-19 14:45 ` Keith Busch
  0 siblings, 1 reply; 27+ messages in thread
From: J. Hart @ 2022-12-18 12:08 UTC (permalink / raw)
  To: linux-nvme; +Cc: jfhart085, kbusch, axboe, hch, sagi, hch

Here are some consecutive fsck runs done on /dev/nvme0n1p3:

-bash-3.2# fsck -f -C /dev/nvme0n1p2
fsck from util-linux 2.34
e2fsck 1.42.12 (29-Aug-2014)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure 

Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/nvme0n1p2: 335/16384 files (0.6% non-contiguous), 7496/16384 blocks 

-bash-3.2# fsck -f -C /dev/nvme0n1p2
fsck from util-linux 2.34
e2fsck 1.42.12 (29-Aug-2014)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure 

Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences:  -5716
Fix<y>? yes
 

/dev/nvme0n1p2: ***** FILE SYSTEM WAS MODIFIED *****
/dev/nvme0n1p2: 335/16384 files (0.6% non-contiguous), 7496/16384 blocks
-bash-3.2# fsck -f -C /dev/nvme0n1p2
fsck from util-linux 2.34
e2fsck 1.42.12 (29-Aug-2014)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure 

Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences:  -5716
Fix<y>? yes

/dev/nvme0n1p2: ***** FILE SYSTEM WAS MODIFIED *****
/dev/nvme0n1p2: 335/16384 files (0.6% non-contiguous), 7496/16384 blocks
-bash-3.2# fsck -f -C /dev/nvme0n1p2
fsck from util-linux 2.34
e2fsck 1.42.12 (29-Aug-2014)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure 

Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/nvme0n1p2: 335/16384 files (0.6% non-contiguous), 7496/16384 blocks 

-bash-3.2# fsck -f -C /dev/nvme0n1p2
fsck from util-linux 2.34
e2fsck 1.42.12 (29-Aug-2014)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure 

Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/nvme0n1p2: 335/16384 files (0.6% non-contiguous), 7496/16384 blocks 

-bash-3.2# fsck -f -C /dev/nvme0n1p2
fsck from util-linux 2.34
e2fsck 1.42.12 (29-Aug-2014)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure 

Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/nvme0n1p2: 335/16384 files (0.6% non-contiguous), 7496/16384 blocks 

-bash-3.2# fsck -f -C /dev/nvme0n1p2
fsck from util-linux 2.34
e2fsck 1.42.12 (29-Aug-2014)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure 

Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/nvme0n1p2: 335/16384 files (0.6% non-contiguous), 7496/16384 blocks 


========================================================
There is a very similar partition (same contents) on a SATA drive on the 
system, and this does not happen there when I test it.

J. Hart



^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2023-01-18 10:27 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-12-15  1:38 nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting, source drive corruption observed J. Hart
2022-12-15  8:23 ` Christoph Hellwig
2022-12-15  9:07   ` J. Hart
2022-12-15  9:09     ` Christoph Hellwig
2022-12-15  9:15       ` J. Hart
2022-12-15 13:33       ` J. Hart
2022-12-15 17:34         ` Keith Busch
2022-12-15 22:30           ` J. Hart
2022-12-16  6:39             ` Christoph Hellwig
2022-12-16 19:08               ` Keith Busch
2023-01-18 10:27             ` Mark Ruijter
2022-12-16 23:16 ` Keith Busch
2022-12-17  1:28   ` J. Hart
2022-12-19 14:41     ` Keith Busch
2022-12-20  1:10       ` J. Hart
2022-12-20 16:56         ` Keith Busch
2022-12-21  7:50           ` Christoph Hellwig
  -- strict thread matches above, loose matches on Subject: below --
2022-12-17 12:07 J. Hart
2022-12-17 15:07 J. Hart
2022-12-17 16:14 J. Hart
2022-12-17 21:57 J. Hart
2022-12-18  6:20 J. Hart
2022-12-18 12:08 J. Hart
2022-12-19 14:45 ` Keith Busch
2022-12-19 23:40   ` J. Hart
2022-12-20 18:10     ` Keith Busch
2022-12-20 14:04   ` J. Hart

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox