Re: want-replacement got stuck?

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: NeilBrown <neilb@suse.de>
To: George Spelvin <linux@horizon.com>
Cc: linux-raid@vger.kernel.org
Subject: Re: want-replacement got stuck?
Date: Thu, 22 Nov 2012 13:15:45 +1100	[thread overview]
Message-ID: <20121122131545.5f31143f@notabene.brown> (raw)
In-Reply-To: <20121121163300.30697.qmail@science.horizon.com>

[-- Attachment #1: Type: text/plain, Size: 2786 bytes --]

On 21 Nov 2012 11:33:00 -0500 "George Spelvin" <linux@horizon.com> wrote:

> Just to follow up to that earlier complaint, ext4 is now noticing some errors:
> 
> Nov 21 06:21:53 science kernel: EXT4-fs error (device md5): ext4_find_entry:1234: inode #5881516: comm rsync: checksumming directory block 0
> Nov 21 07:57:03 science kernel: EXT4-fs error (device md5): ext4_validate_block_bitmap:353: comm flush-9:5: bg 4206: bad block bitmap checksum
> Nov 21 08:41:37 science kernel: EXT4-fs error (device md5): ext4_validate_block_bitmap:353: comm flush-9:5: bg 3960: bad block bitmap checksum
> Nov 21 08:45:18 science kernel: EXT4-fs error (device md5): ext4_validate_block_bitmap:353: comm flush-9:5: bg 4737: bad block bitmap checksum
> Nov 21 08:50:16 science kernel: EXT4-fs error (device md5): ext4_mb_generate_buddy:741: group 4206, 5621 clusters in bitmap, 6888 in gd
> Nov 21 08:50:16 science kernel: JBD2: Spotted dirty metadata buffer (dev = md5, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
> Nov 21 15:50:29 science kernel: EXT4-fs error (device md5): ext4_validate_block_bitmap:353: comm python: bg 4138: bad block bitmap checksum
> Nov 21 16:21:00 science kernel: UDP: bad checksum. From 187.194.52.187:65535 to 71.41.210.146:6881 ulen 70
> 
> I also experienced transient corruption of the last few K of my incoming mailbox.  (I.e. the last
> couple of messages were overwritten with some other text file.  This morning, it's fine.)
> 
> Something is definitely wonky here...  I'm leaving it in the "stuck" state for a while
> in case there's useful debugging info to be extracted, but I'm getting very alarmed by these
> messages and want to reboot soon.

Yes.... this is a real worry.  Fortunately I know what is causing it.

The code for writing to a RAID10 naively assumes that if the 'main' device in
a slot is faulty, then there isn't any replacement device to write to either.

This is normally the case as a faulty device will be promptly remove - or it
should be at least.  As you've already discovered, sometimes it isn't prompt.

But even if it were, there could be races so that the main device fails just
as we look at it, and then the replacement couldn't possibly have been moved
down yet.

Meanwhile you have a corrupted filesystem.  Sorry.
The nature of the corruption is that since the replacement finished no writes
have gone to slot-3 at all.  So if md ever devices to read from slot 3 it
will get stale data.

I suggest you fail the sdd2, reboot, make sure one sda2, sb2,sde2 are in the
array, run fsck, and then if it seems happy enough, add sdc2 and/or sdd2 back
in so they rebuild completely.

Thanks for helping to make md better by risking your data :-)

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

next prev parent reply	other threads:[~2012-11-22  2:15 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-11-20 22:11 want-replacement got stuck? George Spelvin
2012-11-21 16:33 ` George Spelvin
2012-11-21 16:41   ` Roman Mamedov
2012-11-21 18:08     ` George Spelvin
2012-11-21 19:21   ` joystick
2012-11-21 21:19     ` George Spelvin
2012-11-21 22:56       ` joystick
2012-11-22  3:25       ` George Spelvin
2012-11-22  4:22         ` NeilBrown
2012-11-22  5:27           ` George Spelvin
2012-11-22  5:39             ` George Spelvin
2012-11-22  5:47               ` NeilBrown
2012-11-22  6:45                 ` George Spelvin
2012-11-22 11:30                   ` George Spelvin
2012-11-22  2:15   ` NeilBrown [this message]
2012-11-22  2:10 ` NeilBrown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20121122131545.5f31143f@notabene.brown \
    --to=neilb@suse.de \
    --cc=linux-raid@vger.kernel.org \
    --cc=linux@horizon.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).