All of lore.kernel.org
 help / color / mirror / Atom feed
From: NeilBrown <neilb@suse.de>
To: Hans-Peter Jansen <hpj@urpla.net>
Cc: Linux RAID <linux-raid@vger.kernel.org>,
	Sebastian Riemer <sebastian.riemer@profitbricks.com>
Subject: Re: Persistent failures with simple md setup
Date: Tue, 5 Feb 2013 14:44:48 +1100	[thread overview]
Message-ID: <20130205144448.2f40b306@notabene.brown> (raw)
In-Reply-To: <2286786.BnthJ2WIKW@xrated>

[-- Attachment #1: Type: text/plain, Size: 9028 bytes --]

On Mon, 04 Feb 2013 21:43:29 +0100 Hans-Peter Jansen <hpj@urpla.net> wrote:

> Am Mittwoch, 30. Januar 2013, 18:12:46 schrieb Hans-Peter Jansen:
> > 
> > Hmm, according to mdadm from openSUSE:12.1:Update, the relevant fixes should
> > be in place. It might be an unfortunate combination of this issue and the
> > asynchronously applied updates, interfered by the *switching* behavior.
> > 
> > I started with regenerating the initrds now, and a first reboot succeeded so
> > far. Good.
> > 
> > Will ask my friend to reboot the system a dozen times tonight.
> 
> After a few reboots, the issue reappeared. I really believe now, that by
> driving the md in degraded mode for some time and with the switching behavior, 
> just re-adding the devices resulted in unsynced raid1 devices.
> 
> Next, my friend managed to create a nearby data disaster: I've explained him,
> how he would be able to re-add a device himself. He did so on sunday with his
> home partition, and since there appeared no progress bar in /proc/mdstat, he 
> immediately repeated the command. 
> 
> Neil, is it conceivable (due to a race or the like), that repeating to add 
> (re-add) a device potentially creates data salad, since that home-fs (xfs) 
> gone mad a few minutes later (firefox crashed, and couldn't be started, kmail 
> crashed, and so on (all those processes, that write to ~). He decided to 
> reboot, and that jailed him in the emergency recovery console, because /home 
> couldn't be mounted anymore.

There was a bug prior to 2.6.37 (fixed by commit 1a855a0606653d2) which
sounds vaguely related, but you seem to be running a 3.0.x kernel(?) so
shouldn't be affected by that.

Without logs of precisely what happened, it is very hard to guess.



> Today, I hammered the raid1 partitions with "check". During one run, this 
> appeared in syslog:
> 
> Feb  4 11:18:26 zaphkiel kernel: [11165.652478] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
> Feb  4 11:18:26 zaphkiel kernel: [11165.652486] ata2.00: irq_stat 0x40000008
> Feb  4 11:18:26 zaphkiel kernel: [11165.652495] ata2.00: failed command: READ FPDMA QUEUED
> Feb  4 11:18:26 zaphkiel kernel: [11165.652510] ata2.00: cmd 60/80:e0:12:ef:c2/00:00:0c:00:00/40 tag 28 ncq 65536 in
> Feb  4 11:18:26 zaphkiel kernel: [11165.652513]          res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F>
> Feb  4 11:18:26 zaphkiel kernel: [11165.652520] ata2.00: status: { DRDY ERR }
> Feb  4 11:18:26 zaphkiel kernel: [11165.652524] ata2.00: error: { UNC }
> Feb  4 11:18:26 zaphkiel kernel: [11165.652876] ata2.00: failed to IDENTIFY (I/O error, err_mask=0x100)
> Feb  4 11:18:26 zaphkiel kernel: [11165.652882] ata2.00: revalidation failed (errno=-5)
> Feb  4 11:18:26 zaphkiel kernel: [11165.652890] ata2: hard resetting link
> Feb  4 11:18:26 zaphkiel kernel: [11165.957043] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> Feb  4 11:18:26 zaphkiel kernel: [11165.969910] ata2.00: configured for UDMA/133
> Feb  4 11:18:26 zaphkiel kernel: [11165.970048] ata2: EH complete
> Feb  4 11:18:28 zaphkiel kernel: [11167.949241] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
> Feb  4 11:18:28 zaphkiel kernel: [11167.949249] ata2.00: irq_stat 0x40000008
> Feb  4 11:18:28 zaphkiel kernel: [11167.949257] ata2.00: failed command: READ FPDMA QUEUED
> Feb  4 11:18:28 zaphkiel kernel: [11167.949272] ata2.00: cmd 60/80:10:12:ef:c2/00:00:0c:00:00/40 tag 2 ncq 65536 in
> Feb  4 11:18:28 zaphkiel kernel: [11167.949275]          res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F>
> Feb  4 11:18:28 zaphkiel kernel: [11167.949282] ata2.00: status: { DRDY ERR }
> Feb  4 11:18:28 zaphkiel kernel: [11167.949287] ata2.00: error: { UNC }
> Feb  4 11:18:28 zaphkiel kernel: [11167.962146] ata2.00: configured for UDMA/133
> Feb  4 11:18:28 zaphkiel kernel: [11167.962206] ata2: EH complete
> Feb  4 11:18:30 zaphkiel kernel: [11169.898187] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
> Feb  4 11:18:30 zaphkiel kernel: [11169.898195] ata2.00: irq_stat 0x40000008
> Feb  4 11:18:30 zaphkiel kernel: [11169.898204] ata2.00: failed command: READ FPDMA QUEUED
> Feb  4 11:18:30 zaphkiel kernel: [11169.898219] ata2.00: cmd 60/80:e0:12:ef:c2/00:00:0c:00:00/40 tag 28 ncq 65536 in
> Feb  4 11:18:30 zaphkiel kernel: [11169.898222]          res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F>
> Feb  4 11:18:30 zaphkiel kernel: [11169.898229] ata2.00: status: { DRDY ERR }
> Feb  4 11:18:30 zaphkiel kernel: [11169.898234] ata2.00: error: { UNC }
> Feb  4 11:18:30 zaphkiel kernel: [11169.912066] ata2.00: configured for UDMA/133
> Feb  4 11:18:30 zaphkiel kernel: [11169.912117] ata2: EH complete
> Feb  4 11:18:32 zaphkiel kernel: [11171.905192] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
> Feb  4 11:18:32 zaphkiel kernel: [11171.905200] ata2.00: irq_stat 0x40000008
> Feb  4 11:18:32 zaphkiel kernel: [11171.905208] ata2.00: failed command: READ FPDMA QUEUED
> Feb  4 11:18:32 zaphkiel kernel: [11171.905223] ata2.00: cmd 60/80:10:12:ef:c2/00:00:0c:00:00/40 tag 2 ncq 65536 in
> Feb  4 11:18:32 zaphkiel kernel: [11171.905226]          res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F>
> Feb  4 11:18:32 zaphkiel kernel: [11171.905233] ata2.00: status: { DRDY ERR }
> Feb  4 11:18:32 zaphkiel kernel: [11171.905238] ata2.00: error: { UNC }
> Feb  4 11:18:32 zaphkiel kernel: [11171.919099] ata2.00: configured for UDMA/133
> Feb  4 11:18:32 zaphkiel kernel: [11171.919152] ata2: EH complete
> Feb  4 11:18:34 zaphkiel kernel: [11173.912191] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
> Feb  4 11:18:34 zaphkiel kernel: [11173.912199] ata2.00: irq_stat 0x40000008
> Feb  4 11:18:34 zaphkiel kernel: [11173.912208] ata2.00: failed command: READ FPDMA QUEUED
> Feb  4 11:18:34 zaphkiel kernel: [11173.912223] ata2.00: cmd 60/80:e0:12:ef:c2/00:00:0c:00:00/40 tag 28 ncq 65536 in
> Feb  4 11:18:34 zaphkiel kernel: [11173.912226]          res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F>
> Feb  4 11:18:34 zaphkiel kernel: [11173.912233] ata2.00: status: { DRDY ERR }
> Feb  4 11:18:34 zaphkiel kernel: [11173.912238] ata2.00: error: { UNC }
> Feb  4 11:18:34 zaphkiel kernel: [11173.925101] ata2.00: configured for UDMA/133
> Feb  4 11:18:34 zaphkiel kernel: [11173.925159] ata2: EH complete
> Feb  4 11:18:36 zaphkiel kernel: [11175.861152] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
> Feb  4 11:18:36 zaphkiel kernel: [11175.861160] ata2.00: irq_stat 0x40000008
> Feb  4 11:18:36 zaphkiel kernel: [11175.861168] ata2.00: failed command: READ FPDMA QUEUED
> Feb  4 11:18:36 zaphkiel kernel: [11175.861183] ata2.00: cmd 60/80:10:12:ef:c2/00:00:0c:00:00/40 tag 2 ncq 65536 in
> Feb  4 11:18:36 zaphkiel kernel: [11175.861186]          res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F>
> Feb  4 11:18:36 zaphkiel kernel: [11175.861193] ata2.00: status: { DRDY ERR }
> Feb  4 11:18:36 zaphkiel kernel: [11175.861198] ata2.00: error: { UNC }
> Feb  4 11:18:36 zaphkiel kernel: [11175.874052] ata2.00: configured for UDMA/133
> Feb  4 11:18:36 zaphkiel kernel: [11175.874103] sd 1:0:0:0: [sdb] Unhandled sense code
> Feb  4 11:18:36 zaphkiel kernel: [11175.874109] sd 1:0:0:0: [sdb]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> Feb  4 11:18:36 zaphkiel kernel: [11175.874117] sd 1:0:0:0: [sdb]  Sense Key : Medium Error [current] [descriptor]
> Feb  4 11:18:36 zaphkiel kernel: [11175.874125] Descriptor sense data with sense descriptors (in hex):
> Feb  4 11:18:36 zaphkiel kernel: [11175.874130]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
> Feb  4 11:18:36 zaphkiel kernel: [11175.874145]         0c c2 ef 3f 
> Feb  4 11:18:36 zaphkiel kernel: [11175.874153] sd 1:0:0:0: [sdb]  Add. Sense: Unrecovered read error - auto reallocate failed
> Feb  4 11:18:36 zaphkiel kernel: [11175.874163] sd 1:0:0:0: [sdb] CDB: Read(10): 28 00 0c c2 ef 12 00 00 80 00
> Feb  4 11:18:36 zaphkiel kernel: [11175.874180] end_request: I/O error, dev sdb, sector 214101823
> Feb  4 11:18:36 zaphkiel kernel: [11175.874234] ata2: EH complete
> Feb  4 11:18:38 zaphkiel kernel: [11177.954091] md: md124: data-check done.
> 
> This is a classical URE, isn't it? Interestingly, nonetheless, the raid1 check 
> run succeeded! (Not so good, is it?)

What?  Success not good?  :-)

md didn't report any errors.  Maybe it didn't see any.  Where is sector
214101823?



> Last question: since I had to massage the system anyway, I've updated mdadm 
> from 3.2.2 to 3.2.6. I red, that it can be dangerous to do so, what do I risk
> here?

Where did you read that?
If you find you need to re-create an array with "--create", a different mdadm
might give a different result, which wouldn't be what you want.  Otherwise it
should be safe.

NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

  reply	other threads:[~2013-02-05  3:44 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-01-29 22:14 Persistent failures with simple md setup Hans-Peter Jansen
2013-01-30  9:07 ` Sebastian Riemer
2013-01-30 17:12   ` Hans-Peter Jansen
2013-02-04 20:43     ` Hans-Peter Jansen
2013-02-05  3:44       ` NeilBrown [this message]
2013-02-27 17:01         ` Hans-Peter Jansen
2013-02-28  3:40           ` NeilBrown
2013-02-28 10:49             ` Hans-Peter Jansen
2013-02-28 21:25               ` NeilBrown
2013-02-28 22:16                 ` Hans-Peter Jansen
     [not found]                   ` <4291349.FrQcKOnicQ@xrated>
2013-03-03 23:33                     ` NeilBrown
2013-03-13  0:52                     ` NeilBrown
2013-03-15 22:43                       ` Hans-Peter Jansen
2013-03-18 11:20                         ` Hans-Peter Jansen
2013-03-21  3:24                           ` NeilBrown
2013-04-10 13:28                             ` Hans-Peter Jansen
2013-04-10 13:44                             ` Hans-Peter Jansen
2013-04-11  7:33                               ` NeilBrown
2013-01-30  9:20 ` Roy Sigurd Karlsbakk

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130205144448.2f40b306@notabene.brown \
    --to=neilb@suse.de \
    --cc=hpj@urpla.net \
    --cc=linux-raid@vger.kernel.org \
    --cc=sebastian.riemer@profitbricks.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.