Re: filesystem corruption

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Neil Brown <neilb@suse.de>
To: "Patrick H." <linux-raid@feystorm.net>
Cc: linux-raid@vger.kernel.org
Subject: Re: filesystem corruption
Date: Mon, 3 Jan 2011 15:56:30 +1100	[thread overview]
Message-ID: <20110103155630.565341d0@notabene.brown> (raw)
In-Reply-To: <4D214B5C.3010103@feystorm.net>

On Sun, 02 Jan 2011 21:06:52 -0700 "Patrick H." <linux-raid@feystorm.net>
wrote:

> That makes sense assuming that MD acknowleges the write once the data is 
> written to the data disks but not necessarily the parity disk, which is 
> what I gather you were saying is what happens. Is there any option that 
> can change the behavior so that md wont ack the write until its been 
> committed to all disks (I'm guessing no since you didnt mention it)?
> Also does raid6 suffer this problem? Is it smart enough to use both 
> parity disks when calculating replacement, or will it just use one?
> 

md/raid5 doesn't acknowledge the write until both the data and the parity
have been written.  But that doesn't make any difference.
If you schedule a number of interdependent writes (data and parity) and then
allow some to complete but not all, then you have inconsistency.
Recovery from losing a single device requires consistency of parity and data.

RAID6 suffers equally from this problem.  Even if it used both parity disks
to recover (which it doesn't) how would that help?  It would then have two
possible value for the data and no way to know which was correct, and every
possibility that both are incorrect.  This would happen if a single data
block was successfully written, but neither parity blocks were.

The only way you can avoid this 'write hole' is by journalling in multiples
of whole stripes.  No current filesystems that I know of can do this as they
journal in blocks, and the maximum block size is less than the minimum stripe
size.  So you would need journalling integrated with md/raid, or you would
need a filesystem which was designed to understand this problem and write
whole stripes at a time, always to an area of the device which did not
contain live data.

NeilBrown

next prev parent reply	other threads:[~2011-01-03  4:56 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-01-03  1:58 filesystem corruption Patrick H.
2011-01-03  3:16 ` Neil Brown
     [not found]   ` <4D214B5C.3010103@feystorm.net>
2011-01-03  4:56     ` Neil Brown [this message]
2011-01-03  5:05       ` Patrick H.
2011-01-04  5:33         ` NeilBrown
2011-01-04  7:50           ` Patrick H.
2011-01-04 17:31             ` Patrick H.
2011-01-05  1:22               ` Patrick H.
2011-01-05  7:02   ` CoolCold
     [not found]   ` <AANLkTinL_nz58f8rSPuhYvVwGY5jdu1XVkNLC1ky5A65@mail.gmail.com>
2011-01-05 14:28     ` Patrick H.
2011-01-05 15:52       ` Spelic
2011-01-05 15:55         ` Patrick H.

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110103155630.565341d0@notabene.brown \
    --to=neilb@suse.de \
    --cc=linux-raid@feystorm.net \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).