From: John Robinson <john.robinson@anonymous.org.uk>
To: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: linux-raid@vger.kernel.org
Subject: Re: RFC: detection of silent corruption via ATA long sector reads
Date: Sun, 04 Jan 2009 12:31:17 +0000 [thread overview]
Message-ID: <4960AC15.8030207@anonymous.org.uk> (raw)
In-Reply-To: <yq1eizj5xos.fsf@sermon.lab.mkp.net>
On 04/01/2009 07:37, Martin K. Petersen wrote:
>>>>>> "John" == John Robinson <john.robinson@anonymous.org.uk> writes:
>
> John> Excuse me if I'm being dense - and indeed tell me! - but RAID
> John> 4/5/6 already suffer from having to do ready-modify-write for
> John> small writes, so is there any chance this could be done at
> John> relatively little additional expense for these?
>
> You'd still need to store a checksum somewhere else, incurring
> additional seek cost. You could attempt to weasel out of that by adding
> the checksum sector after a limited number of blocks and hope that you'd
> be able to pull it in or write it out in one sweep.
>
> The downside is that assume we do checksums on - say - 8KB chunks in the
> RAID5 case. We only need to store a few handfuls of bytes of checksum
> goo per block. But we can't address less than a 512 byte sector. So we
> need to either waste the bulk of 1 sector for every 16 to increase the
> likelihood of adjacent access. Or we can push the checksum sector
> further out to fill it completely. That wastes less space but has a
> higher chance of causing an extra seek. Pick your poison.
Well, I was assuming that MD/DM operates in chunk size amounts (e.g. 32K
or 64 sectors) anyway, and having a sector or two of checksums on disc
immediately following each chunk would be a pretty small cost,
increasing each read or write cycle only marginally (e.g. to 65
sectors), which shouldn't cause much drop in performance (I guess 1/64th
in throughput and IOPS, if the discs themselves are the bottleneck).
Essentially DIF on 32k blocks instead of 512 byte ones. But perhaps this
is a bad assumption and MD/DM already optimises out whole-chunk reads
and writes where they're not required (for very short,
less-than-one-chunk transactions), and I've no idea whether this happens
a lot.
> The reason I'm advocating checksumming on logical (filesystem) blocks is
> that the filesystems have a much better idea what's good and what's bad
> in a recovery situation. And the filesystems already have an
> infrastructure for storing metadata like checksums. The cost of
> accessing that metadata is inherent and inevitable.
Yes, I can see that. But the old premise that RAID tried to maintain was
that disc sectors don't go bad. You're quite reasonably dropping the
premise rather than trying to do more to maintain it. There might be
validity to both approaches.
> We also don't want to do checksumming at every layer. That's going to
> suck from a performance perspective. It's better to do checksumming
> high up in the stack and only do it once. As long as we give the upper
> layers the option of re-driving the I/O.
>
> That involves adding a cookie to each bio that gets filled out by DM/MD
> on completion. If the filesystem checksum fails we can resubmit the I/O
> and pass along the cookie indicating that we want a different copy than
> the one the cookie represents.
I'd like to understand this mechanism better; at first glance it's
either going to be too simplistic and not cover the various block layer
cases well, or it means you end up re-implementing RAID and LVM in the
filesystem.
Just my €$£0.02 of course.
Cheers,
John.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2009-01-04 12:31 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <fa.8mwKV7y4hm+Q6mvIKtp9QGoJYUU@ifi.uio.no>
[not found] ` <fa.4QcsYZC0gJJwJ0eUOht3hDYaVWs@ifi.uio.no>
2008-12-28 22:40 ` RFC: detection of silent corruption via ATA long sector reads Sitsofe Wheeler
2008-12-30 13:48 ` Mark Lord
2009-01-02 20:26 ` Greg Freemyer
2009-01-02 20:43 ` Sitsofe Wheeler
2009-01-02 21:05 ` Greg Freemyer
2009-01-02 22:04 ` Martin K. Petersen
2009-01-02 22:41 ` Greg Freemyer
2009-01-03 3:01 ` Martin K. Petersen
2009-01-03 13:20 ` John Robinson
2009-01-04 7:37 ` Martin K. Petersen
2009-01-04 12:31 ` John Robinson [this message]
2009-01-04 13:49 ` John Robinson
2009-01-05 2:43 ` Martin K. Petersen
2009-01-05 2:45 ` Martin K. Petersen
2009-01-05 3:24 ` NeilBrown
2008-12-26 21:44 Greg Freemyer
2008-12-26 22:15 ` Robert Hancock
2008-12-28 22:26 ` Mark Lord
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4960AC15.8030207@anonymous.org.uk \
--to=john.robinson@anonymous.org.uk \
--cc=linux-raid@vger.kernel.org \
--cc=martin.petersen@oracle.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).