* Re: RFC: log record CRC validation
[not found] <20070725092445.GT12413810@sgi.com>
@ 2007-07-25 10:14 ` Mark Goodwin
2007-07-26 5:55 ` David Chinner
2007-07-26 17:53 ` Michael Nishimoto
0 siblings, 2 replies; 18+ messages in thread
From: Mark Goodwin @ 2007-07-25 10:14 UTC (permalink / raw)
To: David Chinner; +Cc: xfs-dev, xfs-oss
David Chinner wrote:
>
> The next question is the hard one. What do we do when we detect
> a log record CRC error? Right now it just warns and sets a flag
> in the log. I think it should probably prevent log replay from
> replaying past this point (i.e. trim the head back to the last
> good log record) but I'm not sure what the best thing to do here.
>
> Comments?
1. perhaps use a new flag XLOG_CRC_MISMATCH instead of XLOG_CHKSUM_MISMATCH
2. is there (or could there be if we added it), correction for n-bit errors?
-- Mark
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RFC: log record CRC validation
2007-07-25 10:14 ` RFC: log record CRC validation Mark Goodwin
@ 2007-07-26 5:55 ` David Chinner
2007-07-26 23:01 ` Andi Kleen
2007-07-26 17:53 ` Michael Nishimoto
1 sibling, 1 reply; 18+ messages in thread
From: David Chinner @ 2007-07-26 5:55 UTC (permalink / raw)
To: Mark Goodwin; +Cc: xfs-dev, xfs-oss
On Wed, Jul 25, 2007 at 08:14:05PM +1000, Mark Goodwin wrote:
>
>
> David Chinner wrote:
> >
> >The next question is the hard one. What do we do when we detect
> >a log record CRC error? Right now it just warns and sets a flag
> >in the log. I think it should probably prevent log replay from
> >replaying past this point (i.e. trim the head back to the last
> >good log record) but I'm not sure what the best thing to do here.
> >
> >Comments?
>
> 1. perhaps use a new flag XLOG_CRC_MISMATCH instead of XLOG_CHKSUM_MISMATCH
Yeah, that's easy to do. What to do with the error is more important,
though.
> 2. is there (or could there be if we added it), correction for n-bit errors?
Nope. To do that, we'd need to implement some type of Reed-Solomon
coding and would need to use more bits on disk to store the ECC
data. That would have a much bigger impact on log throughput than a
table based CRC on a chunk of data that is hot in the CPU cache. And
we'd have to write the code as well. ;)
However, I'm not convinced that this sort of error correction is the
best thing to do at a high level as all the low level storage
already does Reed-Solomon based bit error correction. I'd much
prefer to use a different method of redundancy in the filesystem so
the error detection and correction schemes at different levels don't
have the same weaknesses.
That means the filesystem needs strong enough CRCs to detect bit
errors and sufficient structure validity checking to detect gross
errors. XFS already does pretty good structure checking; we don't
have bit error detection though....
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RFC: log record CRC validation
2007-07-25 10:14 ` RFC: log record CRC validation Mark Goodwin
2007-07-26 5:55 ` David Chinner
@ 2007-07-26 17:53 ` Michael Nishimoto
2007-07-26 23:31 ` David Chinner
1 sibling, 1 reply; 18+ messages in thread
From: Michael Nishimoto @ 2007-07-26 17:53 UTC (permalink / raw)
To: markgw; +Cc: David Chinner, xfs-dev, xfs-oss
Mark Goodwin wrote:
>
> David Chinner wrote:
> >
> > The next question is the hard one. What do we do when we detect
> > a log record CRC error? Right now it just warns and sets a flag
> > in the log. I think it should probably prevent log replay from
> > replaying past this point (i.e. trim the head back to the last
> > good log record) but I'm not sure what the best thing to do here.
> >
> > Comments?
>
> 1. perhaps use a new flag XLOG_CRC_MISMATCH instead of
> XLOG_CHKSUM_MISMATCH
> 2. is there (or could there be if we added it), correction for n-bit
> errors?
>
> -- Mark
>
I don't see the original message in the archives. Is CRC checking being
added to xfs log data? If so, what data has been collected to show that
this needs to be added?
thanks,
Michael
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RFC: log record CRC validation
2007-07-26 5:55 ` David Chinner
@ 2007-07-26 23:01 ` Andi Kleen
2007-07-26 23:50 ` David Chinner
0 siblings, 1 reply; 18+ messages in thread
From: Andi Kleen @ 2007-07-26 23:01 UTC (permalink / raw)
To: David Chinner; +Cc: Mark Goodwin, xfs-dev, xfs-oss
David Chinner <dgc@sgi.com> writes:
>
> Nope. To do that, we'd need to implement some type of Reed-Solomon
> coding and would need to use more bits on disk to store the ECC
> data. That would have a much bigger impact on log throughput than a
> table based CRC on a chunk of data that is hot in the CPU cache.
Processing or rewriting cache hot data shouldn't be significantly
different in cost (assuming the basic CPU usage of the algorithms
is not too different); just the cache lines need to be already exclusive
which is likely the case with logs.
> And we'd have to write the code as well. ;)
Modern kernels have R-S functions in lib/reed_solomon. They
are used in some of the flash file systems. I haven't checked
how their performance compares to standard CRC though.
>
> However, I'm not convinced that this sort of error correction is the
> best thing to do at a high level as all the low level storage
> already does Reed-Solomon based bit error correction. I'd much
> prefer to use a different method of redundancy in the filesystem so
> the error detection and correction schemes at different levels don't
> have the same weaknesses.
Agreed. On the file system level the best way to handle this is
likely data duplicated on different blocks.
> That means the filesystem needs strong enough CRCs to detect bit
> errors and sufficient structure validity checking to detect gross
> errors. XFS already does pretty good structure checking; we don't
The trouble is that it tends to go to too drastic measures (shutdown) if it
detects any inconsistency.
-Andi
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RFC: log record CRC validation
2007-07-26 17:53 ` Michael Nishimoto
@ 2007-07-26 23:31 ` David Chinner
2007-07-27 1:24 ` Michael Nishimoto
2007-07-28 2:00 ` William J. Earl
0 siblings, 2 replies; 18+ messages in thread
From: David Chinner @ 2007-07-26 23:31 UTC (permalink / raw)
To: Michael Nishimoto; +Cc: markgw, David Chinner, xfs-dev, xfs-oss
On Thu, Jul 26, 2007 at 10:53:02AM -0700, Michael Nishimoto wrote:
> Mark Goodwin wrote:
> >David Chinner wrote:
> >>
> >> The next question is the hard one. What do we do when we detect
> >> a log record CRC error? Right now it just warns and sets a flag
> >> in the log. I think it should probably prevent log replay from
> >> replaying past this point (i.e. trim the head back to the last
> >> good log record) but I'm not sure what the best thing to do here.
> >>
> >> Comments?
> >
> >1. perhaps use a new flag XLOG_CRC_MISMATCH instead of
> >XLOG_CHKSUM_MISMATCH
> >2. is there (or could there be if we added it), correction for n-bit
> >errors?
> >
> >-- Mark
> >
> I don't see the original message in the archives.
Strange. I received the original email to a different address, so it
definitely got through the list daemon. I've put it inline at the
bottom of the mail.
> Is CRC checking being added to xfs log data?
Yes. it's a little used debug option right now, and I'm
planning on making it default behaviour.
> If so, what data has been collected to show that this needs to be added?
The size of high-end filesystems are now at the same order of
magnitude as the bit error rate of the storage hardware. e.g. 1PB =
10^16 bits. The bit error rate of high end FC drives? 1 in 10^16
bits. For "enterprise" SATA drives? 1 in 10^15 bits. For desktop
SATA drives it's 1 in 10^14 bits (i.e. 1 in 10TB).
We've got filesystems capable of moving > 2 x 10^16 bits of data
*per day* and we see lots of instances of multi-TB arrays made out
of desktop SATA disks. Given the recent studies of long-term disk
reliability, these vendor figures are likely to be the best error
rates we can hope for.....
IOWs, we don't need evidence to justify this sort of error detection
detection because simple maths says there is going to be errors. We
have to deal with that, and hence we are going to be adding CRC
checking to on-disk metadata structures so we can detect bit errors
that would otherwise go undetected and result in filesystem
corruption.
This means that instead of getting shutdown reports for some strange
and unreproducable btree corruption, we'll get a shutdown for a CRC
failure on the btree block. It is very likely that this will occur
much earlier than a subtle btree corruption would otherwise be
detected and hence we'll be less likely to propagate errors around
the fileystem.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
Date: Wed, 25 Jul 2007 19:24:45 +1000
From: David Chinner <dgc@sgi.com>
To: xfs-dev <xfs-dev@sgi.com>
Cc: xfs-oss <xfs@oss.sgi.com>
Subject: RFC: log record CRC validation
Folks,
I've just fixed up the never-used-debug log record checksumming
code with an eye to permanently enabling it for production
filesystems.
Firstly, I updated the simple 32 bit wide XOR checksum to use the
crc32c module. This places an new dependency on XFS - it will now
depends on CONFIG_LIBCRC32C. I'm also not sure what the best
method to use is - the little endian or big endian CRC algorithm
so I just went for the default (crc32c()).
This then resulted in recovery failing to verify the checksums,
and it turns out that is because xfs_pack_data() gets passed a
padded buffer and size to checksum (padded to 512 bytes), whereas
the unpacking (recovery) only checksummed the unpadded record
length. Hence this code probably never worked reliably if anyone
ever enabled it.
This does bring up a question - probably for Tim - do we only get
rounded to BBs or do we get rounded to the log stripe unit when
packing the log records before writeout? It seems froma quick test
that it is only BBs, but confirmation would be good....
The next question is the hard one. What do we do when we detect
a log record CRC error? Right now it just warns and sets a flag
in the log. I think it should probably prevent log replay from
replaying past this point (i.e. trim the head back to the last
good log record) but I'm not sure what the best thing to do here.
Comments?
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
---
fs/xfs/Kconfig | 2 -
fs/xfs/linux-2.6/xfs_linux.h | 1
fs/xfs/xfs_log_priv.h | 2 -
fs/xfs/xfs_log_recover.c | 80 ++++++++++++++++++-------------------------
4 files changed, 37 insertions(+), 48 deletions(-)
Index: 2.6.x-xfs-new/fs/xfs/xfs_log_priv.h
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_log_priv.h 2007-07-23 11:09:52.000000000 +1000
+++ 2.6.x-xfs-new/fs/xfs/xfs_log_priv.h 2007-07-25 11:43:02.706927249 +1000
@@ -329,7 +329,7 @@ typedef struct xlog_rec_header {
int h_len; /* len in bytes; should be 64-bit aligned: 4 */
xfs_lsn_t h_lsn; /* lsn of this LR : 8 */
xfs_lsn_t h_tail_lsn; /* lsn of 1st LR w/ buffers not committed: 8 */
- uint h_chksum; /* may not be used; non-zero if used : 4 */
+ __be32 h_crc; /* log record crc value : 4 */
int h_prev_block; /* block number to previous LR : 4 */
int h_num_logops; /* number of log operations in this LR : 4 */
uint h_cycle_data[XLOG_HEADER_CYCLE_SIZE / BBSIZE];
Index: 2.6.x-xfs-new/fs/xfs/xfs_log_recover.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_log_recover.c 2007-07-25 10:42:54.000000000 +1000
+++ 2.6.x-xfs-new/fs/xfs/xfs_log_recover.c 2007-07-25 19:10:12.185803348 +1000
@@ -3302,28 +3302,18 @@ xlog_recover_process_iunlinks(
}
-#ifdef DEBUG
+#define XLOG_CHKSUM_SEED 0x4c585343 /* 'XLCS' */
STATIC void
-xlog_pack_data_checksum(
+xlog_pack_data_crc(
xlog_t *log,
xlog_in_core_t *iclog,
int size)
{
- int i;
- uint *up;
- uint chksum = 0;
-
- up = (uint *)iclog->ic_datap;
- /* divide length by 4 to get # words */
- for (i = 0; i < (size >> 2); i++) {
- chksum ^= INT_GET(*up, ARCH_CONVERT);
- up++;
- }
- INT_SET(iclog->ic_header.h_chksum, ARCH_CONVERT, chksum);
+ u32 crc;
+
+ crc = crc32c(XLOG_CHKSUM_SEED, iclog->ic_datap, size);
+ iclog->ic_header.h_crc = cpu_to_be32(crc);
}
-#else
-#define xlog_pack_data_checksum(log, iclog, size)
-#endif
/*
* Stamp cycle number in every block
@@ -3340,7 +3330,7 @@ xlog_pack_data(
xfs_caddr_t dp;
xlog_in_core_2_t *xhdr;
- xlog_pack_data_checksum(log, iclog, size);
+ xlog_pack_data_crc(log, iclog, size);
cycle_lsn = CYCLE_LSN_DISK(iclog->ic_header.h_lsn);
@@ -3368,41 +3358,38 @@ xlog_pack_data(
}
}
-#if defined(DEBUG) && defined(XFS_LOUD_RECOVERY)
STATIC void
-xlog_unpack_data_checksum(
+xlog_unpack_data_crc(
xlog_rec_header_t *rhead,
xfs_caddr_t dp,
xlog_t *log)
{
- uint *up = (uint *)dp;
- uint chksum = 0;
- int i;
-
- /* divide length by 4 to get # words */
- for (i=0; i < INT_GET(rhead->h_len, ARCH_CONVERT) >> 2; i++) {
- chksum ^= INT_GET(*up, ARCH_CONVERT);
- up++;
- }
- if (chksum != INT_GET(rhead->h_chksum, ARCH_CONVERT)) {
- if (rhead->h_chksum ||
- ((log->l_flags & XLOG_CHKSUM_MISMATCH) == 0)) {
- cmn_err(CE_DEBUG,
- "XFS: LogR chksum mismatch: was (0x%x) is (0x%x)\n",
- INT_GET(rhead->h_chksum, ARCH_CONVERT), chksum);
- cmn_err(CE_DEBUG,
-"XFS: Disregard message if filesystem was created with non-DEBUG kernel");
- if (XFS_SB_VERSION_HASLOGV2(&log->l_mp->m_sb)) {
- cmn_err(CE_DEBUG,
- "XFS: LogR this is a LogV2 filesystem\n");
- }
- log->l_flags |= XLOG_CHKSUM_MISMATCH;
- }
+ u32 hdr_crc;
+ u32 crc;
+ int size;
+
+ /* XXX: XLOG_LSUNITTOB for v2 logs? Initial tests
+ * seem to indicate it's not needed... */
+ size = BBTOB(BTOBB(INT_GET(rhead->h_len, ARCH_CONVERT)));
+ hdr_crc = be32_to_cpu(rhead->h_crc);
+ crc = crc32c(XLOG_CHKSUM_SEED, dp, size);
+ if (hdr_crc != crc) {
+ cmn_err(CE_ALERT, "XFS Recovery: Log Record CRC error: "
+ "was (0x%x), computed (0x%x), size 0x%x.\n",
+ hdr_crc, crc, size);
+ print_hex_dump(KERN_ALERT, "record: ", 0, 32, 1, dp, 32, 0);
+
+ /*
+ * XXX: What should we do here? if we've detected a
+ * log record corruption, then we can't recovery past
+ * this point, right?
+ *
+ * Leave it to a higher layer to detect this flag and
+ * act appropriately?
+ */
+ log->l_flags |= XLOG_CHKSUM_MISMATCH;
}
}
-#else
-#define xlog_unpack_data_checksum(rhead, dp, log)
-#endif
STATIC void
xlog_unpack_data(
@@ -3412,6 +3399,7 @@ xlog_unpack_data(
{
int i, j, k;
xlog_in_core_2_t *xhdr;
+ xfs_caddr_t odp = dp;
for (i = 0; i < BTOBB(INT_GET(rhead->h_len, ARCH_CONVERT)) &&
i < (XLOG_HEADER_CYCLE_SIZE / BBSIZE); i++) {
@@ -3429,7 +3417,7 @@ xlog_unpack_data(
}
}
- xlog_unpack_data_checksum(rhead, dp, log);
+ xlog_unpack_data_crc(rhead, odp, log);
}
STATIC int
Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_linux.h
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_linux.h 2007-07-23 17:16:56.000000000 +1000
+++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_linux.h 2007-07-25 11:47:07.411080498 +1000
@@ -76,6 +76,7 @@
#include <linux/notifier.h>
#include <linux/delay.h>
#include <linux/log2.h>
+#include <linux/crc32c.h>
#include <asm/page.h>
#include <asm/div64.h>
Index: 2.6.x-xfs-new/fs/xfs/Kconfig
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/Kconfig 2007-01-16 10:54:14.000000000 +1100
+++ 2.6.x-xfs-new/fs/xfs/Kconfig 2007-07-25 12:16:50.804060182 +1000
@@ -1,6 +1,6 @@
config XFS_FS
tristate "XFS filesystem support"
- depends on BLOCK
+ depends on BLOCK && LIBCRC32C
help
XFS is a high performance journaling filesystem which originated
on the SGI IRIX platform. It is completely multi-threaded, can
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RFC: log record CRC validation
2007-07-26 23:01 ` Andi Kleen
@ 2007-07-26 23:50 ` David Chinner
0 siblings, 0 replies; 18+ messages in thread
From: David Chinner @ 2007-07-26 23:50 UTC (permalink / raw)
To: Andi Kleen; +Cc: David Chinner, Mark Goodwin, xfs-dev, xfs-oss
On Fri, Jul 27, 2007 at 01:01:15AM +0200, Andi Kleen wrote:
> David Chinner <dgc@sgi.com> writes:
> >
> > Nope. To do that, we'd need to implement some type of Reed-Solomon
> > coding and would need to use more bits on disk to store the ECC
> > data. That would have a much bigger impact on log throughput than a
> > table based CRC on a chunk of data that is hot in the CPU cache.
>
> Processing or rewriting cache hot data shouldn't be significantly
> different in cost (assuming the basic CPU usage of the algorithms
> is not too different); just the cache lines need to be already exclusive
> which is likely the case with logs.
*nod*
> > And we'd have to write the code as well. ;)
>
> Modern kernels have R-S functions in lib/reed_solomon. They
> are used in some of the flash file systems. I haven't checked
> how their performance compares to standard CRC though.
Ah, I didn't know that. I'll have a look at it....
Admittedly I didn't look all that hard because:
> > However, I'm not convinced that this sort of error correction is the
> > best thing to do at a high level as all the low level storage
> > already does Reed-Solomon based bit error correction. I'd much
> > prefer to use a different method of redundancy in the filesystem so
> > the error detection and correction schemes at different levels don't
> > have the same weaknesses.
>
> Agreed. On the file system level the best way to handle this is
> likely data duplicated on different blocks.
Yes, something like that. I haven't looked into all the potential
ways of providing redundancy yet - I'm still focussing on making
error detection more effective.
> > That means the filesystem needs strong enough CRCs to detect bit
> > errors and sufficient structure validity checking to detect gross
> > errors. XFS already does pretty good structure checking; we don't
>
> The trouble is that it tends to go to too drastic measures (shutdown) if it
> detects any inconsistency.
IMO, that's not drastic - it's the only sane thing to do in the
absence of redundant metadata that you can use to recover from. To
continue operations on a known corrupted filesystem risks making it
far, far worse, esp. if the corruption is in something like a free
space btree.
However, solving this is a separable problem - reliable error
correction comes after robust error detection....
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RFC: log record CRC validation
2007-07-26 23:31 ` David Chinner
@ 2007-07-27 1:24 ` Michael Nishimoto
2007-07-27 6:59 ` David Chinner
2007-07-28 2:00 ` William J. Earl
1 sibling, 1 reply; 18+ messages in thread
From: Michael Nishimoto @ 2007-07-27 1:24 UTC (permalink / raw)
To: David Chinner; +Cc: markgw, xfs-dev, xfs-oss
The log checksum code has not been used since the
development phase of xfs. It did work at one point because I
remember using it and then decided to disable it and use just
the current cycle stamping technique. The checksum code was
just advisory, so I could see if it ever occurred during
development.
When a CRC error is found, your suggestion is correct. Recovery
should backup and process only completely good log records. The code
backs up in this same fashion when it encounters a region of
missing sector updates because of the async nature of log
writes and disk caches.
At this point, I'm not convinced that xfs needs to do CRCs on
the xfs log because the size of an xfs log is relatively small.
Michael
> Date: Wed, 25 Jul 2007 19:24:45 +1000
> From: David Chinner <dgc@sgi.com>
> To: xfs-dev <xfs-dev@sgi.com>
> Cc: xfs-oss <xfs@oss.sgi.com>
> Subject: RFC: log record CRC validation
>
> Folks,
>
> I've just fixed up the never-used-debug log record checksumming
> code with an eye to permanently enabling it for production
> filesystems.
>
> Firstly, I updated the simple 32 bit wide XOR checksum to use the
> crc32c module. This places an new dependency on XFS - it will now
> depends on CONFIG_LIBCRC32C. I'm also not sure what the best
> method to use is - the little endian or big endian CRC algorithm
> so I just went for the default (crc32c()).
>
> This then resulted in recovery failing to verify the checksums,
> and it turns out that is because xfs_pack_data() gets passed a
> padded buffer and size to checksum (padded to 512 bytes), whereas
> the unpacking (recovery) only checksummed the unpadded record
> length. Hence this code probably never worked reliably if anyone
> ever enabled it.
>
> This does bring up a question - probably for Tim - do we only get
> rounded to BBs or do we get rounded to the log stripe unit when
> packing the log records before writeout? It seems froma quick test
> that it is only BBs, but confirmation would be good....
>
> The next question is the hard one. What do we do when we detect
> a log record CRC error? Right now it just warns and sets a flag
> in the log. I think it should probably prevent log replay from
> replaying past this point (i.e. trim the head back to the last
> good log record) but I'm not sure what the best thing to do here.
>
> Comments?
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> Principal Engineer
> SGI Australian Software Group
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RFC: log record CRC validation
2007-07-27 1:24 ` Michael Nishimoto
@ 2007-07-27 6:59 ` David Chinner
2007-08-01 0:49 ` Michael Nishimoto
0 siblings, 1 reply; 18+ messages in thread
From: David Chinner @ 2007-07-27 6:59 UTC (permalink / raw)
To: Michael Nishimoto; +Cc: David Chinner, markgw, xfs-dev, xfs-oss
On Thu, Jul 26, 2007 at 06:24:51PM -0700, Michael Nishimoto wrote:
> The log checksum code has not been used since the
> development phase of xfs. It did work at one point because I
> remember using it and then decided to disable it and use just
> the current cycle stamping technique. The checksum code was
> just advisory, so I could see if it ever occurred during
> development.
>
> When a CRC error is found, your suggestion is correct. Recovery
> should backup and process only completely good log records. The code
> backs up in this same fashion when it encounters a region of
> missing sector updates because of the async nature of log
> writes and disk caches.
Yes, but that's usually only in the last 8 log-buffers worth of the
the log that the hole exists in (i.e. 256k by default). However, if
the tail block has a CRC error, we've got to through away the entire
log and that, like zeroing a dirty log from xfs_repair, generally
results in a corrupted filesystem image.
An example of where this could be a problem is reusing a just-freed
extent. Before reusing it we force the log to get the transaction on
disk and rely on log replay to ensure that the block is freed in the
event of a crash. We then go and write over the contents of the
block. If that log transaction is not replayed (that freed the
extent) then we've overwritten the previous contents of that extent
and so the "current" contents of the extent after log replay are wrong.
IOWs, I think that if we come across a bad CRC in a log record we
can replay up to that point but we've still got to abort the mount
and ask the user to run repair....
> At this point, I'm not convinced that xfs needs to do CRCs on
> the xfs log because the size of an xfs log is relatively small.
Sure, but the same argument can be made about the superblock,
or an AGF and a directory block. That doesn't mean that they'll
never have an error.
Statistically speaking, the log contains that blocks in the
filesystem we most frequently do I/O to, so it's the most likely
region to see an I/O path induced bit error. If we see one
on recovery......
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RFC: log record CRC validation
2007-07-26 23:31 ` David Chinner
2007-07-27 1:24 ` Michael Nishimoto
@ 2007-07-28 2:00 ` William J. Earl
2007-07-28 14:03 ` Andi Kleen
2007-07-31 5:30 ` David Chinner
1 sibling, 2 replies; 18+ messages in thread
From: William J. Earl @ 2007-07-28 2:00 UTC (permalink / raw)
To: xfs-oss; +Cc: David Chinner, Michael Nishimoto, markgw
David Chinner wrote:
> On Thu, Jul 26, 2007 at 10:53:02AM -0700, Michael Nishimoto wrote:
>
> ...
>> Is CRC checking being added to xfs log data?
>>
>
> Yes. it's a little used debug option right now, and I'm
> planning on making it default behaviour.
>
>
>> If so, what data has been collected to show that this needs to be added?
>>
>
> The size of high-end filesystems are now at the same order of
> magnitude as the bit error rate of the storage hardware. e.g. 1PB =
> 10^16 bits. The bit error rate of high end FC drives? 1 in 10^16
> bits. For "enterprise" SATA drives? 1 in 10^15 bits. For desktop
> SATA drives it's 1 in 10^14 bits (i.e. 1 in 10TB).
>
> We've got filesystems capable of moving > 2 x 10^16 bits of data
> *per day* and we see lots of instances of multi-TB arrays made out
> of desktop SATA disks. Given the recent studies of long-term disk
> reliability, these vendor figures are likely to be the best error
> rates we can hope for.....
>
> IOWs, we don't need evidence to justify this sort of error detection
> detection because simple maths says there is going to be errors. We
> have to deal with that, and hence we are going to be adding CRC
> checking to on-disk metadata structures so we can detect bit errors
> that would otherwise go undetected and result in filesystem
> corruption.
>
> This means that instead of getting shutdown reports for some strange
> and unreproducable btree corruption, we'll get a shutdown for a CRC
> failure on the btree block. It is very likely that this will occur
> much earlier than a subtle btree corruption would otherwise be
> detected and hence we'll be less likely to propagate errors around
> the fileystem.
> ...
Mike Nishimoto pointed out this thread to me, and suggested I
reply, since I have worked on the analyzing the disk failure modes.
First, note that the claimed bit error rates are rates of
reported bad blocks, not rates of silent data corruption. The latter,
while not quoted, are far lower.
With large modern disks, it is unrealistic to not use RAID 1,
RAID 5, or RAID 6 to mask reported disk bit errors. A CRC in a
filesystem data structure, however, is not required to detect disk
errors, even without RAID protection, since the disks report error
correction failures. The rates of 1 in 10**14 (desktop SATA), 1 in
10**15 (Enterprise SATA and some FC), and 1 in 10**16 (some FC) bits
read are detected and reported error rates. From conversations with
drive vendors, these are actually fairly conservative, and assume you
write the data once and read it after 10 years without rewriting it, as
in an archive, which is the worst case, since you have ten years of
accumulated deterioration.
With RAID protection, seeing reported errors which are not masked
by the RAID layer is extremely unlikely, except in the case of a drive
failure, where it is necessary to read all of the surviving drives to
reconstruct the contents of the lost drive. With 500 GB drives and a
7D+1P RAID 5 group, we need to read 350 GB to rebuild the RAID array
using a replacement drive, which is to say about 2.8 * 10**12 bits.
This implies we would see one block not reconstructed in every 35
rebuilds on desktop SATA drives, if the data were archive data written
once and the rebuild happened after 10 years. We would expect about
one rebuild in 10 years.
With more RAID groups, the chance of data loss grows very
high. With 100 groups, we would see a rebuild every month or so, so
we would expect an unmasked read error every few years. With better
drives, the chance is of course reduced, but is still significant in
multi-PB storage systems. RAID 6 largely eliminates the chance of
seeing a read error, at some cost in performance.
Note that some people have reported both much higher annual
failure rates for drives (which increases the frequency of RAID
reconstructions and hence the chance of data loss) and higher read error
rates. Based on personal experience with a large number of drives, I
believe that both of these are a consequence of systems (including
software and disk host bus adapters) dealing poorly with common
transient disk problems, not actual errors on the surface of the disk.
For example, if the drive firmware gets confused and stops talking, the
controller will treat it as a read timeout, and the software may simply
report "read failure", which in turn may be interpreted as "drive
failed", even though an adequate drive reset would return the drive to
service with no harm done to the data.
In addition, many people have taken up using desktop drives for
primary storage, for which they are not designed. Desktop drives are
typically rated at 800,000 to 1,000,000 hours MTBF at 30% duty
cycle. Using them at 100% duty cycle drastically decreases their
MTBF, which in turn drastically increases the rate of unmasked read
errors, as a consequence of the extra RAID reconstructions.
Lastly, quite a few "white box" vendors have shipped chassis which
do not adequately cool and power the drives, and excessive heat can also
drastically reduce the MTBF.
In a well-designed chassis, with software which correctly
recovers from transient drive issues, I have observed higher than
claimed MTBF and much lower than claimed bit error rates. The
undetected error rate from a modern drive is not quoted publicly, but
the block ECC is quite strong (since it has to mask raw bit error rates
as high as 1 in 10**3) and hence can detect most error scenarios and
correct many of them.
None of the above, however, implies that we need CRCs on
filesystem data structures. That is, if you get EIO on a disk read,
then you don't need a CRC to know the block is bad. Other concerns,
however, can motivate having CRCs. In particular, if the path between
drive and memory can corrupt the data, then CRCs can help us recover to
some extent. This has been a recurring problem with various
technologies, but was particularly common on IDE (PATA) drives with
their vulnerable cables, where mishandling of the cable could lead to
silent data corruption. CRCs on just filesystem data structures,
however, only help with metadata integrity, leaving file data integrity
in the dark. Some vendors at various times, such as Tandem and
NetApp, have added a checksum to each block, usually when they were
having data integrity issues which turned out in the end to be bad
cables or bad software, but which could be masked by RAID recovery, if
detected by the block checksum. It is usually cost-effective, however,
to simply select a reliable disk subsystem.
With SATA, SAS, and FC, which have link-level integrity checks,
silent data corruption on the link is unlikely. This leaves mainly the
host bus adapter, the main memory, and the path between them. If those
are bad, however, it is hard to see how much the filesystem can help.
In conclusion, I doubt that CRCs are worth the added
complexity. If I wanted to mask flaky hardware, I would look at using
RAID 6 and validating parity on all reads and doing RAID recovery on any
errors, but doing offline testing of the hardware and repairing or
replacing any broken bits would be yet simpler.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RFC: log record CRC validation
2007-07-28 2:00 ` William J. Earl
@ 2007-07-28 14:03 ` Andi Kleen
2007-07-31 5:30 ` David Chinner
1 sibling, 0 replies; 18+ messages in thread
From: Andi Kleen @ 2007-07-28 14:03 UTC (permalink / raw)
To: William J. Earl; +Cc: xfs-oss, David Chinner, Michael Nishimoto, markgw
"William J. Earl" <earl@agami.com> writes:
> It is usually
> cost-effective, however, to simply select a reliable disk subsystem.
In practice people don't run fully reliable disk subsystems
unfortunately. And also a previously reliable disk subsystem
might degrade over time. It's better to handle this case.
Also CPUs have gotten a lot faster recently with more cache
misses dominating than actual operations; CRCs on relatively small data
items that are already manipulated by the CPU (like file system
metadata) should be essentially free.
> With SATA, SAS, and FC, which have link-level integrity checks, silent
> data corruption on the link is unlikely.
For TCP/IP networking that is in theory true too -- with L2 networks
usually having CRCs. In practice the TCP/UDP checksum finds a lot of
corruptions wherever they come from and there are even some that are
only detected by stronger checksums (like in SSL) or go undetected.
While disk subsystems are probably doing a bit better than IP networks
in terms of error rates they are far from perfect either.
-Andi
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RFC: log record CRC validation
2007-07-28 2:00 ` William J. Earl
2007-07-28 14:03 ` Andi Kleen
@ 2007-07-31 5:30 ` David Chinner
2007-08-01 1:32 ` William J. Earl
1 sibling, 1 reply; 18+ messages in thread
From: David Chinner @ 2007-07-31 5:30 UTC (permalink / raw)
To: William J. Earl; +Cc: xfs-oss, David Chinner, Michael Nishimoto, markgw
On Fri, Jul 27, 2007 at 07:00:32PM -0700, William J. Earl wrote:
> David Chinner wrote:
> >On Thu, Jul 26, 2007 at 10:53:02AM -0700, Michael Nishimoto wrote:
> >
> >...
> >>Is CRC checking being added to xfs log data?
> >>
> >
> >Yes. it's a little used debug option right now, and I'm
> >planning on making it default behaviour.
> >
> >
> >>If so, what data has been collected to show that this needs to be added?
> >>
> >
> >The size of high-end filesystems are now at the same order of
> >magnitude as the bit error rate of the storage hardware. e.g. 1PB =
> >10^16 bits. The bit error rate of high end FC drives? 1 in 10^16
> >bits. For "enterprise" SATA drives? 1 in 10^15 bits. For desktop
> >SATA drives it's 1 in 10^14 bits (i.e. 1 in 10TB).
>
> First, note that the claimed bit error rates are rates of
> reported bad blocks, not rates of silent data corruption. The latter,
> while not quoted, are far lower.
Ok, fair enough, but in the absense of numbers and the fact that
real world MTBF numbers are lower than what mfg's quote I'm
always going to assume that this is the ballpark.
[snip stuff about raid6, drive data, I/O path corruptions, etc]
In summary you are effectively saying this: "if you spend enough
money on your storage, then the filesystem doesn't need to worry
about integrity."
I've heard exactly the same lecture you've just given from other
(ex-)XFS engineers that integrity is the total responsibility of the
block device. SGI used to ensure that XFS only ran on hardware that
followed this mantra and so could get away with that approach to
filesystem error detection.
But XFS doesn't live in that world any more. It stopped being true
when XFS got ported to linux. XFS lives in the world of commodity
hardware as well as the high end now and we are getting more and more
situations where we are having to make tradeoffs for preventing
corruption on commodity hardware. e.g. I/O barrier support for disks
with volatile write caches.
IMO, continuing down this same "the block device is perfect" path is
a "head in the sand" approach. By ignoring the fact that errors can
and do occur, we're screwing ourselves when something does actually
go wrong because we haven't put in place the mechanisms to detect
errors because we've assumed they will never happen.
We've spent 15 years so far trying to work out what has gone wrong
in XFS by adding more and more reactive debug into the code without
an eye to a robust solution. We add a chunk of code here to detect
that problem, a chunk of code there to detect this problem, and so
on. It's just not good enough anymore.
Like good security, filesystem integrity is not provided by a single
mechanism. "Defense in depth" is what we are aiming to provide here
and to do that you have to assume that errors can propagate through
every interface into the filesystem.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RFC: log record CRC validation
2007-07-27 6:59 ` David Chinner
@ 2007-08-01 0:49 ` Michael Nishimoto
2007-08-01 2:24 ` David Chinner
0 siblings, 1 reply; 18+ messages in thread
From: Michael Nishimoto @ 2007-08-01 0:49 UTC (permalink / raw)
To: David Chinner; +Cc: markgw, xfs-dev, xfs-oss
David Chinner wrote:
> On Thu, Jul 26, 2007 at 06:24:51PM -0700, Michael Nishimoto wrote:
> > The log checksum code has not been used since the
> > development phase of xfs. It did work at one point because I
> > remember using it and then decided to disable it and use just
> > the current cycle stamping technique. The checksum code was
> > just advisory, so I could see if it ever occurred during
> > development.
> >
> > When a CRC error is found, your suggestion is correct. Recovery
> > should backup and process only completely good log records. The code
> > backs up in this same fashion when it encounters a region of
> > missing sector updates because of the async nature of log
> > writes and disk caches.
>
> Yes, but that's usually only in the last 8 log-buffers worth of the
> the log that the hole exists in (i.e. 256k by default). However, if
> the tail block has a CRC error, we've got to through away the entire
> log and that, like zeroing a dirty log from xfs_repair, generally
> results in a corrupted filesystem image.
>
> An example of where this could be a problem is reusing a just-freed
> extent. Before reusing it we force the log to get the transaction on
> disk and rely on log replay to ensure that the block is freed in the
> event of a crash. We then go and write over the contents of the
> block. If that log transaction is not replayed (that freed the
> extent) then we've overwritten the previous contents of that extent
> and so the "current" contents of the extent after log replay are wrong.
>
> IOWs, I think that if we come across a bad CRC in a log record we
> can replay up to that point but we've still got to abort the mount
> and ask the user to run repair....
>
> > At this point, I'm not convinced that xfs needs to do CRCs on
> > the xfs log because the size of an xfs log is relatively small.
>
> Sure, but the same argument can be made about the superblock,
> or an AGF and a directory block. That doesn't mean that they'll
> never have an error.
>
> Statistically speaking, the log contains that blocks in the
> filesystem we most frequently do I/O to, so it's the most likely
> region to see an I/O path induced bit error. If we see one
> on recovery......
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> Principal Engineer
> SGI Australian Software Group
You are correct. I didn't need the example -- just a reminder that
CRC errors can occur anywhere, not just in the last 8 log-buffers
worth of data. ;-)
And yes, repair is necessary now because we are no longer replaying
a transaction which we thought was committed.
I have a request. Can this be made a mkfs.xfs option, so it can be
disabled?
What are your plans for adding CRCs to other metadata objects?
Michael
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RFC: log record CRC validation
2007-07-31 5:30 ` David Chinner
@ 2007-08-01 1:32 ` William J. Earl
2007-08-01 10:02 ` David Chinner
0 siblings, 1 reply; 18+ messages in thread
From: William J. Earl @ 2007-08-01 1:32 UTC (permalink / raw)
To: David Chinner; +Cc: xfs-oss, Michael Nishimoto, markgw
David Chinner wrote:
> On Fri, Jul 27, 2007 at 07:00:32PM -0700, William J. Earl wrote:
>
>> David Chinner wrote:
>>
>>> ...
>>> The size of high-end filesystems are now at the same order of
>>> magnitude as the bit error rate of the storage hardware. e.g. 1PB =
>>> 10^16 bits. The bit error rate of high end FC drives? 1 in 10^16
>>> bits. For "enterprise" SATA drives? 1 in 10^15 bits. For desktop
>>> SATA drives it's 1 in 10^14 bits (i.e. 1 in 10TB).
>>>
>> First, note that the claimed bit error rates are rates of
>> reported bad blocks, not rates of silent data corruption. The latter,
>> while not quoted, are far lower.
>>
>
> Ok, fair enough, but in the absense of numbers and the fact that
> real world MTBF numbers are lower than what mfg's quote I'm
> always going to assume that this is the ballpark.
>
>
The real world MTBF numbers are worse for people who use drives
outside their specified parameters (as at Google) and better, at least
in the first few years, for drives which are used inside their
parameters, as far as I have seen at Agami and my previous company.
Drive MTBF does not relate, however to data corruption, except in the
case of a failed RAID reconstruction, and even then the error is
reported.
> ...
>
> IMO, continuing down this same "the block device is perfect" path is
> a "head in the sand" approach. By ignoring the fact that errors can
> and do occur, we're screwing ourselves when something does actually
> go wrong because we haven't put in place the mechanisms to detect
> errors because we've assumed they will never happen.
>
> We've spent 15 years so far trying to work out what has gone wrong
> in XFS by adding more and more reactive debug into the code without
> an eye to a robust solution. We add a chunk of code here to detect
> that problem, a chunk of code there to detect this problem, and so
> on. It's just not good enough anymore.
>
> Like good security, filesystem integrity is not provided by a single
> mechanism. "Defense in depth" is what we are aiming to provide here
> and to do that you have to assume that errors can propagate through
> every interface into the filesystem.
>
>
>
I understand your argument, but why not simply strengthen the
block layer, even you do it with an optional XFS-based checksum scheme
on all blocks? That way, you would not wind up detecting metadata
corruption and silently ignoring file data corruption For example,
suppose you stole one block in N (where you might choose N according to
the RAID data stripe size, when running over MD), and used it as a
checksum block (storing (N-1)*K subblock checksums)? This in effect
would require either RAID full-stripe read-modify-write or at least an
extra block read-modify-write for a real block write, but it would give
you complete metadata and data integrity checking. This could be an
XFS feature or a new MD feature ("checksum" layer).
This would clearly be somewhat expensive for random writes,
much like RAID 6, and also expensive for random reads, unless the N were
equal to the RAID block size, but, as with the NetApp and Tandem
software checksums, it would assure that a high level of data integrity.
If the RAID block size were moderately large, say 64 KB, then you
could take one 4 KB block in 16 and pay only about 6% of the total
space. With modern disks, the extra sequential transfers (even if you
have to read 64 KB to get 4 KB) would not be that significant, since
rotational latency is the largest cost, not the actual data transfer.
You would also use more main memory, since you would prefer to buffer at
least the 4 KB checksum block associated with any 4 KB data blocks from
the RAID block. The extra memory would be between 6% and 100%,
depending on how much locality there is in block accesses. This in
turn would cause more disk accesses, due to the cache holding fewer real
data blocks.
[[HTML alternate version deleted]]
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RFC: log record CRC validation
2007-08-01 0:49 ` Michael Nishimoto
@ 2007-08-01 2:24 ` David Chinner
2007-08-01 2:36 ` Barry Naujok
2007-08-01 12:11 ` Andi Kleen
0 siblings, 2 replies; 18+ messages in thread
From: David Chinner @ 2007-08-01 2:24 UTC (permalink / raw)
To: Michael Nishimoto; +Cc: David Chinner, markgw, xfs-dev, xfs-oss
On Tue, Jul 31, 2007 at 05:49:18PM -0700, Michael Nishimoto wrote:
> David Chinner wrote:
> >On Thu, Jul 26, 2007 at 06:24:51PM -0700, Michael Nishimoto wrote:
> > > When a CRC error is found, your suggestion is correct. Recovery
> > > should backup and process only completely good log records. The code
> > > backs up in this same fashion when it encounters a region of
> > > missing sector updates because of the async nature of log
> > > writes and disk caches.
......
> >IOWs, I think that if we come across a bad CRC in a log record we
> >can replay up to that point but we've still got to abort the mount
> >and ask the user to run repair....
.....
> You are correct. I didn't need the example -- just a reminder that
> CRC errors can occur anywhere, not just in the last 8 log-buffers
> worth of data. ;-)
>
> And yes, repair is necessary now because we are no longer replaying
> a transaction which we thought was committed.
Ok, I'll make the mount replay up to previous good record and then
abort with -EUCLEAN.
> I have a request. Can this be made a mkfs.xfs option, so it can be
> disabled?
It's a definite possibility, because....
> What are your plans for adding CRCs to other metadata objects?
Other objects will require some on-disk format change, and that will
require a feature bit to be set.
Basically, we can add CRCs to individual inodes without a version bump
as we have some empty space in the v2 inode core. That will make v2 inodes
the default, but we already have a superblock bit for that and all versions
of linux support v2 inodes so there is no issue there. This will require
userspace tool changes, though, because tools like repair will revert v2
inodes back to v1 format if the are not using project id's or the link count
fits into 16 bits.....
I haven't looked at great depth into other structures in terms of
implementation details. I know that if we use a 16 bit CRC on
directories we can get away without a on-disk format change as the
xfs_da_blkinfo structure has 16 bits of padding. However, given that
directory block size can reach 64k, a CRC16 check is really only
capable of single bit error detection. Hence I think we really need
CRC32 here which means an on-disk format change.
For the other types of btree blocks we will also need on-disk
changes as there is no padding we can use to hold a CRC in the
headers. Adding CRCs to the SB, AGF, AGI and AFGL is trivial but
really needs a feature bit to say they are there.
So, yes, this sort of change will require a feature bit and that
means it would be a mkfs option. I'd prefer to have all the changes
ready to go before releasing them into the wild so that we only
need a single feature bit, though there is the possibility of
separating the directory changes out into "version 3" directories.
Conversion of existing filesystems is harder because of the
repacking of structures required. if we are going to convert
filesystems, I would expect it to be done offline by xfs_repair
(Hi Barry!). Given that it will already need tobe modified to
validate CRCs, I think this is probably the best approach to this
problem right now.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RFC: log record CRC validation
2007-08-01 2:24 ` David Chinner
@ 2007-08-01 2:36 ` Barry Naujok
2007-08-01 2:43 ` David Chinner
2007-08-01 12:11 ` Andi Kleen
1 sibling, 1 reply; 18+ messages in thread
From: Barry Naujok @ 2007-08-01 2:36 UTC (permalink / raw)
To: David Chinner, Michael Nishimoto; +Cc: markgw, xfs-dev, xfs-oss
On Wed, 01 Aug 2007 12:24:18 +1000, David Chinner <dgc@sgi.com> wrote:
> On Tue, Jul 31, 2007 at 05:49:18PM -0700, Michael Nishimoto wrote:
>
>> What are your plans for adding CRCs to other metadata objects?
>
> Other objects will require some on-disk format change, and that will
> require a feature bit to be set.
>
> Basically, we can add CRCs to individual inodes without a version bump
> as we have some empty space in the v2 inode core. That will make v2
> inodes
> the default, but we already have a superblock bit for that and all
> versions
> of linux support v2 inodes so there is no issue there. This will require
> userspace tool changes, though, because tools like repair will revert v2
> inodes back to v1 format if the are not using project id's or the link
> count
> fits into 16 bits.....
No, repair does not revert v2 inodes back to v1. Currently, inodes
are created as v1 in the kernel and moved to v2 as required.
Barry.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RFC: log record CRC validation
2007-08-01 2:36 ` Barry Naujok
@ 2007-08-01 2:43 ` David Chinner
0 siblings, 0 replies; 18+ messages in thread
From: David Chinner @ 2007-08-01 2:43 UTC (permalink / raw)
To: Barry Naujok; +Cc: David Chinner, Michael Nishimoto, markgw, xfs-dev, xfs-oss
On Wed, Aug 01, 2007 at 12:36:21PM +1000, Barry Naujok wrote:
> On Wed, 01 Aug 2007 12:24:18 +1000, David Chinner <dgc@sgi.com> wrote:
>
> >On Tue, Jul 31, 2007 at 05:49:18PM -0700, Michael Nishimoto wrote:
> >
> >>What are your plans for adding CRCs to other metadata objects?
> >
> >Other objects will require some on-disk format change, and that will
> >require a feature bit to be set.
> >
> >Basically, we can add CRCs to individual inodes without a version bump
> >as we have some empty space in the v2 inode core. That will make v2
> >inodes
> >the default, but we already have a superblock bit for that and all
> >versions
> >of linux support v2 inodes so there is no issue there. This will require
> >userspace tool changes, though, because tools like repair will revert v2
> >inodes back to v1 format if the are not using project id's or the link
> >count
> >fits into 16 bits.....
>
> No, repair does not revert v2 inodes back to v1. Currently, inodes
> are created as v1 in the kernel and moved to v2 as required.
Ah, yes, you're right - it's too easy to make flow mistakes in 1200
line functions....
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RFC: log record CRC validation
2007-08-01 1:32 ` William J. Earl
@ 2007-08-01 10:02 ` David Chinner
0 siblings, 0 replies; 18+ messages in thread
From: David Chinner @ 2007-08-01 10:02 UTC (permalink / raw)
To: William J. Earl; +Cc: David Chinner, xfs-oss, Michael Nishimoto, markgw
On Tue, Jul 31, 2007 at 06:32:59PM -0700, William J. Earl wrote:
> David Chinner wrote:
> >IMO, continuing down this same "the block device is perfect" path is
> >a "head in the sand" approach. By ignoring the fact that errors can
> >and do occur, we're screwing ourselves when something does actually
> >go wrong because we haven't put in place the mechanisms to detect
> >errors because we've assumed they will never happen.
> >
> >We've spent 15 years so far trying to work out what has gone wrong
> >in XFS by adding more and more reactive debug into the code without
> >an eye to a robust solution. We add a chunk of code here to detect
> >that problem, a chunk of code there to detect this problem, and so
> >on. It's just not good enough anymore.
> >
> >Like good security, filesystem integrity is not provided by a single
> >mechanism. "Defense in depth" is what we are aiming to provide here
> >and to do that you have to assume that errors can propagate through
> >every interface into the filesystem.
> >
> >
> >
> I understand your argument, but why not simply strengthen the
> block layer, even you do it with an optional XFS-based checksum scheme
> on all blocks?
You could probably do this using dm-crypt and an integrity hash.
Unfortunately, I don't trust dm-crypt as far as I can throw it
because we've been severely bitten by bugs in dm-crypt. Specifically
the bug that existed from 2.6.14 through to 2.6.19 where it reported
success to readahead bios that had been cancelled due to block
device congestion and hence returning uninitialised bios as
"complete" and error free. This resulted in XFS shutdowns because
the only code in the entire of Linux that triggered this problem was
the XFS btree readahead. Every time this occurred the finger was
pointed at XFS because it detected the corruption and we had *zero*
information telling us what the problem might have been. A btree
block CRC failure would have *immediately* told us the block layer
had returned us bad data.
This is where I come back to defense in depth. Block device level
integrity checking is not sufficient as bugs in this layer still
need to be caught by the filesystem. I stand by this even if I
implemented the block layer code myself - I don't write perfect code
and nobody else does, either. Hence we *must* assume that the layer
that returned the block is imperfect even if we wrote it....
> That way, you would not wind up detecting metadata
> corruption and silently ignoring file data corruption
Data integrity has different selection criteria compared to ensuring
structural integrity of the filesystem. That is, different users have
different requirements for data integrity and performance. e.g. bit
errors in video and audio data don't matter but performance does,
yet we still need to protect the filesystem metadata against the
same bit errors.
Hence, IMO, we really do need to treat them seperately to retain
flexibility in the filesystem for different applications. I would
prefer to aim for per-inode selection of data CRCs and allow
inheritance of the attribute through the directory heirachy. That
way the behaviour is user selectable for the subset of the files
that the system owner cares about enough to protect. If you
want to protect the entire filesystem, set the flag on the
root directory at mkfs time....
> For example,
> suppose you stole one block in N (where you might choose N according to
> the RAID data stripe size, when running over MD), and used it as a
> checksum block (storing (N-1)*K subblock checksums)? This in effect
> would require either RAID full-stripe read-modify-write or at least an
> extra block read-modify-write for a real block write, but it would give
> you complete metadata and data integrity checking. This could be an
> XFS feature or a new MD feature ("checksum" layer).
dm-crypt. But, see above.
> This would clearly be somewhat expensive for random writes,
> much like RAID 6, and also expensive for random reads, unless the N were
> equal to the RAID block size, but, as with the NetApp and Tandem
> software checksums, it would assure that a high level of data integrity.
Good integrity, but random writes is one of XFS's weaknesses
and this would just make it worse. Netapp leverages both WAFL's
linearisation of random writes and specialised hardware (NVRAM)
to avoid this performance proble altogether but we can't. It's
a showstopper, IMO.
No matter which way I look at it, block layer integrity checking
provides insufficient correctness guarantees while neutralising
XFS's strengths and magnifying it's weaknesses. To me it just doesn't
make sense for XFS to go down this path when there are other
options that don't have these drawbacks.
Basically, it makes much more sense to me to break the problem down
into it's component pieces and provide protection for each of those
pieces one at a time without introducing new issues. This way we
don't compromise the strengths of XFS or reduce the flexibility of
the filesystem in any way whilst providing better protection against
errors that cause corruption.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RFC: log record CRC validation
2007-08-01 2:24 ` David Chinner
2007-08-01 2:36 ` Barry Naujok
@ 2007-08-01 12:11 ` Andi Kleen
1 sibling, 0 replies; 18+ messages in thread
From: Andi Kleen @ 2007-08-01 12:11 UTC (permalink / raw)
To: David Chinner; +Cc: Michael Nishimoto, markgw, xfs-dev, xfs-oss
David Chinner <dgc@sgi.com> writes:
> I haven't looked at great depth into other structures in terms of
> implementation details. I know that if we use a 16 bit CRC on
> directories we can get away without a on-disk format change as the
> xfs_da_blkinfo structure has 16 bits of padding. However, given that
> directory block size can reach 64k, a CRC16 check is really only
> capable of single bit error detection. Hence I think we really need
> CRC32 here which means an on-disk format change.
When the directory format is changed it would be nice to also support
DT_* types at the same time. They can speed up some operations nicely
because file system walkers can avoid a stat() (and seek to the inode)
just to find out if a name is a directory or not. Right now there is
no space for this unfortunately.
-Andi
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2007-08-01 11:17 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20070725092445.GT12413810@sgi.com>
2007-07-25 10:14 ` RFC: log record CRC validation Mark Goodwin
2007-07-26 5:55 ` David Chinner
2007-07-26 23:01 ` Andi Kleen
2007-07-26 23:50 ` David Chinner
2007-07-26 17:53 ` Michael Nishimoto
2007-07-26 23:31 ` David Chinner
2007-07-27 1:24 ` Michael Nishimoto
2007-07-27 6:59 ` David Chinner
2007-08-01 0:49 ` Michael Nishimoto
2007-08-01 2:24 ` David Chinner
2007-08-01 2:36 ` Barry Naujok
2007-08-01 2:43 ` David Chinner
2007-08-01 12:11 ` Andi Kleen
2007-07-28 2:00 ` William J. Earl
2007-07-28 14:03 ` Andi Kleen
2007-07-31 5:30 ` David Chinner
2007-08-01 1:32 ` William J. Earl
2007-08-01 10:02 ` David Chinner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox