From: Jamie Lokier <jamie@shareable.org>
To: Bryan Henderson <hbryan@us.ibm.com>
Cc: linux-fsdevel@vger.kernel.org, 谢纲 <xiegang112@gmail.com>
Subject: Re: ordered I/O with multipath
Date: Thu, 9 Apr 2009 21:00:15 +0100 [thread overview]
Message-ID: <20090409200015.GB20334@shareable.org> (raw)
In-Reply-To: <OF52211E98.D8FE376C-ON88257593.0062B7D2-88257593.0065E033@us.ibm.com>
Bryan Henderson wrote:
> > If the RAID code is changed to handle barriers, that would still have
> > possible "scattershot" corruption on RAID-5, because writing a single
> > sector on the logical device affects more than one visible sector if
> > it is interrupted. In other words, the "radius of corruption" is
> > bigger than one sector for RAID-5, and it's not contiguous either.
>
> I've seen several RAID-5 systems, and they all went to great lengths to
> ensure that interrupting a write to Sector A can't destroy Sector B. It
> isn't easy; it involves journalling. But I've always taken it as an
> absolute requirement.
How do you do a second layer of journalling (in addition to the
filesystem's) without a big performance penalty for the extra seeks?
> I assume you're talking about something like where Sectors 1-5 are covered
> by a single parity sector and the RAID system restarts between it has
> written Sector 1 and when it has written the new parity. Now if you lose
> Sector 2, you'll recover incorrect contents for it.
>
> Linux kernel RAID-5 isn't one of the ones I've looked at; I presume you're
> saying it does have this problem.
No, I'm assuming it has this problem because every description of
RAID-5 I've seen does not mention journalling or anything equivalent.
> > In principle, journalling filesystems need to know the "radius of
> > corruption" to provide robust journalling. If individual sector
> > writes are atomic, this isn't an issue. Some people think sector
> > writes are atomic on modern hard drives (but I wouldn't count on it).
> > But it is definitely not atomic when writing to a RAID or multipath if
> > the write affects more than one device.
>
> It would make a lot more sense to make the RAID block device driver
> present a block device that can't corrupt data upon something as simple as
> a restart in the middle of write to an unrelated sector than to make
> filesystem drivers comprehend a block device that can. Less work, more
> integrity.
A lot less performance?
> Some have noted recently that block devices are really too simple to do
> some of the fancy storage things we'd like to do these days anyway, so
> another approach would be to integrate the RAID-5 function in the
> filesystem driver instead of attempting to have a RAID block device layer.
Like ZFS and BTRFS I guess.
This is why RAID ought to work better in the filesystem.
Two layers of journalling or equivalent does not sound good.
> For now, I'll just try to remember not to use Linux kernel RAID-5.
I've no idea if you should avoid it. I'm making assumptions.
Other parts of Linux are a bit flaky on the issue of data integrity on
crashes though, and I/O barriers are not passed down through Linux
software RAID-5, so I'd be mighty surprised if it provides atomic writes.
> >If individual sector writes are atomic, this isn't an issue.
>
> True, however: atomic is sufficient, but not necessary. In the real
> world, disk drive writes aren't atomic, and it's OK. A journalling
> filesystem can deal with a failed write wiping out the previous contents
> of the subject sector. It just can't deal with a failed write polluting
> some unrelated previously hardened sector.
That's right. But an failed write might corrupt previously
hardened sectors in these cases:
- Disks with 4k sectors pretending to be 512 byte sectors.
- RAIDs without journalling (or other equivalent) and no
battery backup.
- SSDs and other flash storage if their internal algorithms are stupid.
I've just noticed that a system crash is not the only way this type of
corruption can happen.
Does this argue for an additional parameter from the block device
hints: In addition to strip sizes, sector size -
Radius of Corruption on Failed Write?
For hard disks, this is the sector size. But for RAIDs and maybe some
flash storage, it might be larger.
-- Jamie
next prev parent reply other threads:[~2009-04-09 20:00 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-04-08 4:59 ordered I/O with multipath 谢纲
2009-04-08 14:30 ` Jamie Lokier
2009-04-08 14:53 ` 谢纲
2009-04-09 18:32 ` Bryan Henderson
2009-04-09 20:00 ` Jamie Lokier [this message]
2009-04-10 16:42 ` Bryan Henderson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090409200015.GB20334@shareable.org \
--to=jamie@shareable.org \
--cc=hbryan@us.ibm.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=xiegang112@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.