linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jamie Lokier <jamie@shareable.org>
To: Bryan Henderson <hbryan@us.ibm.com>
Cc: linux-fsdevel@vger.kernel.org, 谢纲 <xiegang112@gmail.com>
Subject: Re: ordered I/O with multipath
Date: Thu, 9 Apr 2009 21:00:15 +0100	[thread overview]
Message-ID: <20090409200015.GB20334@shareable.org> (raw)
In-Reply-To: <OF52211E98.D8FE376C-ON88257593.0062B7D2-88257593.0065E033@us.ibm.com>

Bryan Henderson wrote:
> > If the RAID code is changed to handle barriers, that would still have
> > possible "scattershot" corruption on RAID-5, because writing a single
> > sector on the logical device affects more than one visible sector if
> > it is interrupted.  In other words, the "radius of corruption" is
> > bigger than one sector for RAID-5, and it's not contiguous either.
> 
> I've seen several RAID-5 systems, and they all went to great lengths to 
> ensure that interrupting a write to Sector A can't destroy Sector B.  It 
> isn't easy; it involves journalling.  But I've always taken it as an 
> absolute requirement.

How do you do a second layer of journalling (in addition to the
filesystem's) without a big performance penalty for the extra seeks?

> I assume you're talking about something like where Sectors 1-5 are covered 
> by a single parity sector and the RAID system restarts between it has 
> written Sector 1 and when it has written the new parity.  Now if you lose 
> Sector 2, you'll recover incorrect contents for it.
> 
> Linux kernel RAID-5 isn't one of the ones I've looked at; I presume you're 
> saying it does have this problem.

No, I'm assuming it has this problem because every description of
RAID-5 I've seen does not mention journalling or anything equivalent.

> > In principle, journalling filesystems need to know the "radius of
> > corruption" to provide robust journalling.  If individual sector
> > writes are atomic, this isn't an issue.  Some people think sector
> > writes are atomic on modern hard drives (but I wouldn't count on it).
> > But it is definitely not atomic when writing to a RAID or multipath if
> > the write affects more than one device.
> 
> It would make a lot more sense to make the RAID block device driver 
> present a block device that can't corrupt data upon something as simple as 
> a restart in the middle of write to an unrelated sector than to make 
> filesystem drivers comprehend a block device that can.  Less work, more 
> integrity.

A lot less performance?

> Some have noted recently that block devices are really too simple to do 
> some of the fancy storage things we'd like to do these days anyway, so 
> another approach would be to integrate the RAID-5 function in the 
> filesystem driver instead of attempting to have a RAID block device layer.

Like ZFS and BTRFS I guess.

This is why RAID ought to work better in the filesystem.
Two layers of journalling or equivalent does not sound good.

> For now, I'll just try to remember not to use Linux kernel RAID-5.

I've no idea if you should avoid it.  I'm making assumptions.

Other parts of Linux are a bit flaky on the issue of data integrity on
crashes though, and I/O barriers are not passed down through Linux
software RAID-5, so I'd be mighty surprised if it provides atomic writes.

> >If individual sector writes are atomic, this isn't an issue.
> 
> True, however: atomic is sufficient, but not necessary.  In the real 
> world, disk drive writes aren't atomic, and it's OK.  A journalling 
> filesystem can deal with a failed write wiping out the previous contents 
> of the subject sector.  It just can't deal with a failed write polluting 
> some unrelated previously hardened sector.

That's right.  But an failed write might corrupt previously
hardened sectors in these cases:

    - Disks with 4k sectors pretending to be 512 byte sectors.

    - RAIDs without journalling (or other equivalent) and no
      battery backup.

    - SSDs and other flash storage if their internal algorithms are stupid.

I've just noticed that a system crash is not the only way this type of
corruption can happen.

Does this argue for an additional parameter from the block device
hints: In addition to strip sizes, sector size -

   Radius of Corruption on Failed Write?

For hard disks, this is the sector size.  But for RAIDs and maybe some
flash storage, it might be larger.

-- Jamie

  reply	other threads:[~2009-04-09 20:00 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-04-08  4:59 ordered I/O with multipath 谢纲
2009-04-08 14:30 ` Jamie Lokier
2009-04-08 14:53   ` 谢纲
2009-04-09 18:32   ` Bryan Henderson
2009-04-09 20:00     ` Jamie Lokier [this message]
2009-04-10 16:42       ` Bryan Henderson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090409200015.GB20334@shareable.org \
    --to=jamie@shareable.org \
    --cc=hbryan@us.ibm.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=xiegang112@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).