From: Jamie Lokier <jamie@shareable.org>
To: Bryan Henderson <hbryan@us.ibm.com>
Cc: linux-fsdevel@vger.kernel.org, 谢纲 <xiegang112@gmail.com>
Subject: Re: ordered I/O with multipath
Date: Thu, 9 Apr 2009 21:00:15 +0100 [thread overview]
Message-ID: <20090409200015.GB20334@shareable.org> (raw)
In-Reply-To: <OF52211E98.D8FE376C-ON88257593.0062B7D2-88257593.0065E033@us.ibm.com>
Bryan Henderson wrote:
> > If the RAID code is changed to handle barriers, that would still have
> > possible "scattershot" corruption on RAID-5, because writing a single
> > sector on the logical device affects more than one visible sector if
> > it is interrupted. In other words, the "radius of corruption" is
> > bigger than one sector for RAID-5, and it's not contiguous either.
>
> I've seen several RAID-5 systems, and they all went to great lengths to
> ensure that interrupting a write to Sector A can't destroy Sector B. It
> isn't easy; it involves journalling. But I've always taken it as an
> absolute requirement.
How do you do a second layer of journalling (in addition to the
filesystem's) without a big performance penalty for the extra seeks?
> I assume you're talking about something like where Sectors 1-5 are covered
> by a single parity sector and the RAID system restarts between it has
> written Sector 1 and when it has written the new parity. Now if you lose
> Sector 2, you'll recover incorrect contents for it.
>
> Linux kernel RAID-5 isn't one of the ones I've looked at; I presume you're
> saying it does have this problem.
No, I'm assuming it has this problem because every description of
RAID-5 I've seen does not mention journalling or anything equivalent.
> > In principle, journalling filesystems need to know the "radius of
> > corruption" to provide robust journalling. If individual sector
> > writes are atomic, this isn't an issue. Some people think sector
> > writes are atomic on modern hard drives (but I wouldn't count on it).
> > But it is definitely not atomic when writing to a RAID or multipath if
> > the write affects more than one device.
>
> It would make a lot more sense to make the RAID block device driver
> present a block device that can't corrupt data upon something as simple as
> a restart in the middle of write to an unrelated sector than to make
> filesystem drivers comprehend a block device that can. Less work, more
> integrity.
A lot less performance?
> Some have noted recently that block devices are really too simple to do
> some of the fancy storage things we'd like to do these days anyway, so
> another approach would be to integrate the RAID-5 function in the
> filesystem driver instead of attempting to have a RAID block device layer.
Like ZFS and BTRFS I guess.
This is why RAID ought to work better in the filesystem.
Two layers of journalling or equivalent does not sound good.
> For now, I'll just try to remember not to use Linux kernel RAID-5.
I've no idea if you should avoid it. I'm making assumptions.
Other parts of Linux are a bit flaky on the issue of data integrity on
crashes though, and I/O barriers are not passed down through Linux
software RAID-5, so I'd be mighty surprised if it provides atomic writes.
> >If individual sector writes are atomic, this isn't an issue.
>
> True, however: atomic is sufficient, but not necessary. In the real
> world, disk drive writes aren't atomic, and it's OK. A journalling
> filesystem can deal with a failed write wiping out the previous contents
> of the subject sector. It just can't deal with a failed write polluting
> some unrelated previously hardened sector.
That's right. But an failed write might corrupt previously
hardened sectors in these cases:
- Disks with 4k sectors pretending to be 512 byte sectors.
- RAIDs without journalling (or other equivalent) and no
battery backup.
- SSDs and other flash storage if their internal algorithms are stupid.
I've just noticed that a system crash is not the only way this type of
corruption can happen.
Does this argue for an additional parameter from the block device
hints: In addition to strip sizes, sector size -
Radius of Corruption on Failed Write?
For hard disks, this is the sector size. But for RAIDs and maybe some
flash storage, it might be larger.
-- Jamie
next prev parent reply other threads:[~2009-04-09 20:00 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-04-08 4:59 ordered I/O with multipath 谢纲
2009-04-08 14:30 ` Jamie Lokier
2009-04-08 14:53 ` 谢纲
2009-04-09 18:32 ` Bryan Henderson
2009-04-09 20:00 ` Jamie Lokier [this message]
2009-04-10 16:42 ` Bryan Henderson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090409200015.GB20334@shareable.org \
--to=jamie@shareable.org \
--cc=hbryan@us.ibm.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=xiegang112@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).