linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: NeilBrown <neilb@suse.com>
To: RQM <rqm@protonmail.com>,
	"linux-raid@vger.kernel.org" <linux-raid@vger.kernel.org>
Subject: Re: Troubleshooting "Buffer I/O error" on reading md device
Date: Tue, 02 Jan 2018 15:28:42 +1100	[thread overview]
Message-ID: <87373og9z9.fsf@notabene.neil.brown.name> (raw)
In-Reply-To: <1z_MZ4Xqld_IRMUbGJE66v2VUhXkBhlHnWJEfLASWNcv5s3Wo3A1YeuQBJBuksxJtFPpmsPbg1_F8PC3Sj4HrzL6Go3aIanVihzcC-4ZHEQ=@protonmail.com>

[-- Attachment #1: Type: text/plain, Size: 3374 bytes --]

On Mon, Jan 01 2018, RQM wrote:

> Hello everyone,
>
> I hope this list is the right place to ask the following:
>
> I've got a 5-disk RAID-5 array that's been built by a QNAP NAS device, which has recently failed (I suspect a faulty SATA controller or backplane).
> I migrated the disks to a desktop computer that runs Debian stretch (kernel 4.9.65-3+deb9u1 amd64) and mdadm version 3.4. Although the array can be assembled, I encountered the following error in my dmesg output ([1], recorded directly after a recent reboot and fsck attempt) when running fsck:
>
> Buffer I/O error on dev md0, logical block 1598030208, async page read
>
> I can reliably reproduce that error by trying to read from the md0 device. It's always the same block, also across reboots.
>
> I have suspected that possibly, one of the drives involved is faulty. Although smart errors have been logged [2], the errors are not recent enough to correlate with the fsck run. Also, I had sha1sum complete without error on every one of the individual disk devices /dev/sd[b-f], so reading from the drives does not provoke an error.
>
> Finally, I tried scrubbing the array by writing repair to md/sync_action. The process completed without any output to dmesg or signs of trouble in /proc/mdstat. However, reading from the array still fails at the same block as above, 1598030208.
>
> Here's the output of mdadm --detail /dev/md0: [3]
>
> I assume the md driver would know what exactly the problem is, but I don't know where to look to find that information. How can I proceed troubleshooting this issue?
>
> FYI, I had posted this on serverfault [4] previously, but unfortunately didn't arrive at a conclusion.
>
> Thank you very much in advance!
>
> [1] https://paste.ubuntu.com/26303735/
> [2] https://paste.ubuntu.com/26303737/
> [3] https://paste.ubuntu.com/26303754/
> [4] https://serverfault.com/questions/889687/troubleshooting-buffer-i-o-error-on-software-raid-md-device

This is truly weird.  I'd even go so far as to say that it cannot
possibly happen (but I've been wrong before).

Step one is confirm that it is easy to reproduce.
Does
  dd if=/dev/md0 bs=4K skip=1598030208 count=1 of=/dev/null

trigger the message reliably?
To check that "4K" is the correct blocksize, run
  blockdev --getbsz /dev/md0

use whatever number if gives as 'bs='.

If you cannot reproduce like that, try a larger count and then a smaller
skip with a large count.

Once you can reproduce with minimal IO, do
  echo file:raid5.c +p > /sys/kernel/debug/dynamic_debug/control
  # repeat experiment
  echo file:raid5.c -p > /sys/kernel/debug/dynamic_debug/control

and report the messages that appear in 'dmesg'.
Also report "mdadm -E" of each member device, and kernel version (though
I see that is in the serverfault report :  4.9.30-2+deb9u5).

Then run
  blktrace /dev/md0 /dev/sd[acdef]
in one window while reproducing the error again in another window.
Then interrupt the blktrace.  This will produce several blocktrace*
files.  create a tar.gz of these and put them somewhere that I can get
them - hopefully they won't be too big.

With all this information, I can poke around and will hopefully be able
to explain if fine detail exactly why this cannot possible happen
(unless it turns out that I'm wrong again).

Thanks,
NeilBrown

  

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

  parent reply	other threads:[~2018-01-02  4:28 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-02  2:46 Troubleshooting "Buffer I/O error" on reading md device RQM
2018-01-02  3:13 ` Reindl Harald
2018-01-02  4:28 ` NeilBrown [this message]
2018-01-02 10:40   ` RQM
2018-01-02 21:27     ` NeilBrown
2018-01-02 22:30       ` Roger Heflin
2018-01-04 14:45       ` RQM
2018-01-05  1:05         ` NeilBrown
2018-01-05 12:55           ` RQM
2018-01-13 12:18             ` RQM
2018-02-02  1:55               ` NeilBrown
2022-11-01 23:49               ` Darshaka Pathirana

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87373og9z9.fsf@notabene.neil.brown.name \
    --to=neilb@suse.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=rqm@protonmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).