* Re: RFC: log record CRC validation [not found] <20070725092445.GT12413810@sgi.com> @ 2007-07-25 10:14 ` Mark Goodwin 2007-07-26 5:55 ` David Chinner 2007-07-26 17:53 ` Michael Nishimoto 0 siblings, 2 replies; 18+ messages in thread From: Mark Goodwin @ 2007-07-25 10:14 UTC (permalink / raw) To: David Chinner; +Cc: xfs-dev, xfs-oss David Chinner wrote: > > The next question is the hard one. What do we do when we detect > a log record CRC error? Right now it just warns and sets a flag > in the log. I think it should probably prevent log replay from > replaying past this point (i.e. trim the head back to the last > good log record) but I'm not sure what the best thing to do here. > > Comments? 1. perhaps use a new flag XLOG_CRC_MISMATCH instead of XLOG_CHKSUM_MISMATCH 2. is there (or could there be if we added it), correction for n-bit errors? -- Mark ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RFC: log record CRC validation 2007-07-25 10:14 ` RFC: log record CRC validation Mark Goodwin @ 2007-07-26 5:55 ` David Chinner 2007-07-26 23:01 ` Andi Kleen 2007-07-26 17:53 ` Michael Nishimoto 1 sibling, 1 reply; 18+ messages in thread From: David Chinner @ 2007-07-26 5:55 UTC (permalink / raw) To: Mark Goodwin; +Cc: xfs-dev, xfs-oss On Wed, Jul 25, 2007 at 08:14:05PM +1000, Mark Goodwin wrote: > > > David Chinner wrote: > > > >The next question is the hard one. What do we do when we detect > >a log record CRC error? Right now it just warns and sets a flag > >in the log. I think it should probably prevent log replay from > >replaying past this point (i.e. trim the head back to the last > >good log record) but I'm not sure what the best thing to do here. > > > >Comments? > > 1. perhaps use a new flag XLOG_CRC_MISMATCH instead of XLOG_CHKSUM_MISMATCH Yeah, that's easy to do. What to do with the error is more important, though. > 2. is there (or could there be if we added it), correction for n-bit errors? Nope. To do that, we'd need to implement some type of Reed-Solomon coding and would need to use more bits on disk to store the ECC data. That would have a much bigger impact on log throughput than a table based CRC on a chunk of data that is hot in the CPU cache. And we'd have to write the code as well. ;) However, I'm not convinced that this sort of error correction is the best thing to do at a high level as all the low level storage already does Reed-Solomon based bit error correction. I'd much prefer to use a different method of redundancy in the filesystem so the error detection and correction schemes at different levels don't have the same weaknesses. That means the filesystem needs strong enough CRCs to detect bit errors and sufficient structure validity checking to detect gross errors. XFS already does pretty good structure checking; we don't have bit error detection though.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RFC: log record CRC validation 2007-07-26 5:55 ` David Chinner @ 2007-07-26 23:01 ` Andi Kleen 2007-07-26 23:50 ` David Chinner 0 siblings, 1 reply; 18+ messages in thread From: Andi Kleen @ 2007-07-26 23:01 UTC (permalink / raw) To: David Chinner; +Cc: Mark Goodwin, xfs-dev, xfs-oss David Chinner <dgc@sgi.com> writes: > > Nope. To do that, we'd need to implement some type of Reed-Solomon > coding and would need to use more bits on disk to store the ECC > data. That would have a much bigger impact on log throughput than a > table based CRC on a chunk of data that is hot in the CPU cache. Processing or rewriting cache hot data shouldn't be significantly different in cost (assuming the basic CPU usage of the algorithms is not too different); just the cache lines need to be already exclusive which is likely the case with logs. > And we'd have to write the code as well. ;) Modern kernels have R-S functions in lib/reed_solomon. They are used in some of the flash file systems. I haven't checked how their performance compares to standard CRC though. > > However, I'm not convinced that this sort of error correction is the > best thing to do at a high level as all the low level storage > already does Reed-Solomon based bit error correction. I'd much > prefer to use a different method of redundancy in the filesystem so > the error detection and correction schemes at different levels don't > have the same weaknesses. Agreed. On the file system level the best way to handle this is likely data duplicated on different blocks. > That means the filesystem needs strong enough CRCs to detect bit > errors and sufficient structure validity checking to detect gross > errors. XFS already does pretty good structure checking; we don't The trouble is that it tends to go to too drastic measures (shutdown) if it detects any inconsistency. -Andi ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RFC: log record CRC validation 2007-07-26 23:01 ` Andi Kleen @ 2007-07-26 23:50 ` David Chinner 0 siblings, 0 replies; 18+ messages in thread From: David Chinner @ 2007-07-26 23:50 UTC (permalink / raw) To: Andi Kleen; +Cc: David Chinner, Mark Goodwin, xfs-dev, xfs-oss On Fri, Jul 27, 2007 at 01:01:15AM +0200, Andi Kleen wrote: > David Chinner <dgc@sgi.com> writes: > > > > Nope. To do that, we'd need to implement some type of Reed-Solomon > > coding and would need to use more bits on disk to store the ECC > > data. That would have a much bigger impact on log throughput than a > > table based CRC on a chunk of data that is hot in the CPU cache. > > Processing or rewriting cache hot data shouldn't be significantly > different in cost (assuming the basic CPU usage of the algorithms > is not too different); just the cache lines need to be already exclusive > which is likely the case with logs. *nod* > > And we'd have to write the code as well. ;) > > Modern kernels have R-S functions in lib/reed_solomon. They > are used in some of the flash file systems. I haven't checked > how their performance compares to standard CRC though. Ah, I didn't know that. I'll have a look at it.... Admittedly I didn't look all that hard because: > > However, I'm not convinced that this sort of error correction is the > > best thing to do at a high level as all the low level storage > > already does Reed-Solomon based bit error correction. I'd much > > prefer to use a different method of redundancy in the filesystem so > > the error detection and correction schemes at different levels don't > > have the same weaknesses. > > Agreed. On the file system level the best way to handle this is > likely data duplicated on different blocks. Yes, something like that. I haven't looked into all the potential ways of providing redundancy yet - I'm still focussing on making error detection more effective. > > That means the filesystem needs strong enough CRCs to detect bit > > errors and sufficient structure validity checking to detect gross > > errors. XFS already does pretty good structure checking; we don't > > The trouble is that it tends to go to too drastic measures (shutdown) if it > detects any inconsistency. IMO, that's not drastic - it's the only sane thing to do in the absence of redundant metadata that you can use to recover from. To continue operations on a known corrupted filesystem risks making it far, far worse, esp. if the corruption is in something like a free space btree. However, solving this is a separable problem - reliable error correction comes after robust error detection.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RFC: log record CRC validation 2007-07-25 10:14 ` RFC: log record CRC validation Mark Goodwin 2007-07-26 5:55 ` David Chinner @ 2007-07-26 17:53 ` Michael Nishimoto 2007-07-26 23:31 ` David Chinner 1 sibling, 1 reply; 18+ messages in thread From: Michael Nishimoto @ 2007-07-26 17:53 UTC (permalink / raw) To: markgw; +Cc: David Chinner, xfs-dev, xfs-oss Mark Goodwin wrote: > > David Chinner wrote: > > > > The next question is the hard one. What do we do when we detect > > a log record CRC error? Right now it just warns and sets a flag > > in the log. I think it should probably prevent log replay from > > replaying past this point (i.e. trim the head back to the last > > good log record) but I'm not sure what the best thing to do here. > > > > Comments? > > 1. perhaps use a new flag XLOG_CRC_MISMATCH instead of > XLOG_CHKSUM_MISMATCH > 2. is there (or could there be if we added it), correction for n-bit > errors? > > -- Mark > I don't see the original message in the archives. Is CRC checking being added to xfs log data? If so, what data has been collected to show that this needs to be added? thanks, Michael ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RFC: log record CRC validation 2007-07-26 17:53 ` Michael Nishimoto @ 2007-07-26 23:31 ` David Chinner 2007-07-27 1:24 ` Michael Nishimoto 2007-07-28 2:00 ` William J. Earl 0 siblings, 2 replies; 18+ messages in thread From: David Chinner @ 2007-07-26 23:31 UTC (permalink / raw) To: Michael Nishimoto; +Cc: markgw, David Chinner, xfs-dev, xfs-oss On Thu, Jul 26, 2007 at 10:53:02AM -0700, Michael Nishimoto wrote: > Mark Goodwin wrote: > >David Chinner wrote: > >> > >> The next question is the hard one. What do we do when we detect > >> a log record CRC error? Right now it just warns and sets a flag > >> in the log. I think it should probably prevent log replay from > >> replaying past this point (i.e. trim the head back to the last > >> good log record) but I'm not sure what the best thing to do here. > >> > >> Comments? > > > >1. perhaps use a new flag XLOG_CRC_MISMATCH instead of > >XLOG_CHKSUM_MISMATCH > >2. is there (or could there be if we added it), correction for n-bit > >errors? > > > >-- Mark > > > I don't see the original message in the archives. Strange. I received the original email to a different address, so it definitely got through the list daemon. I've put it inline at the bottom of the mail. > Is CRC checking being added to xfs log data? Yes. it's a little used debug option right now, and I'm planning on making it default behaviour. > If so, what data has been collected to show that this needs to be added? The size of high-end filesystems are now at the same order of magnitude as the bit error rate of the storage hardware. e.g. 1PB = 10^16 bits. The bit error rate of high end FC drives? 1 in 10^16 bits. For "enterprise" SATA drives? 1 in 10^15 bits. For desktop SATA drives it's 1 in 10^14 bits (i.e. 1 in 10TB). We've got filesystems capable of moving > 2 x 10^16 bits of data *per day* and we see lots of instances of multi-TB arrays made out of desktop SATA disks. Given the recent studies of long-term disk reliability, these vendor figures are likely to be the best error rates we can hope for..... IOWs, we don't need evidence to justify this sort of error detection detection because simple maths says there is going to be errors. We have to deal with that, and hence we are going to be adding CRC checking to on-disk metadata structures so we can detect bit errors that would otherwise go undetected and result in filesystem corruption. This means that instead of getting shutdown reports for some strange and unreproducable btree corruption, we'll get a shutdown for a CRC failure on the btree block. It is very likely that this will occur much earlier than a subtle btree corruption would otherwise be detected and hence we'll be less likely to propagate errors around the fileystem. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group Date: Wed, 25 Jul 2007 19:24:45 +1000 From: David Chinner <dgc@sgi.com> To: xfs-dev <xfs-dev@sgi.com> Cc: xfs-oss <xfs@oss.sgi.com> Subject: RFC: log record CRC validation Folks, I've just fixed up the never-used-debug log record checksumming code with an eye to permanently enabling it for production filesystems. Firstly, I updated the simple 32 bit wide XOR checksum to use the crc32c module. This places an new dependency on XFS - it will now depends on CONFIG_LIBCRC32C. I'm also not sure what the best method to use is - the little endian or big endian CRC algorithm so I just went for the default (crc32c()). This then resulted in recovery failing to verify the checksums, and it turns out that is because xfs_pack_data() gets passed a padded buffer and size to checksum (padded to 512 bytes), whereas the unpacking (recovery) only checksummed the unpadded record length. Hence this code probably never worked reliably if anyone ever enabled it. This does bring up a question - probably for Tim - do we only get rounded to BBs or do we get rounded to the log stripe unit when packing the log records before writeout? It seems froma quick test that it is only BBs, but confirmation would be good.... The next question is the hard one. What do we do when we detect a log record CRC error? Right now it just warns and sets a flag in the log. I think it should probably prevent log replay from replaying past this point (i.e. trim the head back to the last good log record) but I'm not sure what the best thing to do here. Comments? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- fs/xfs/Kconfig | 2 - fs/xfs/linux-2.6/xfs_linux.h | 1 fs/xfs/xfs_log_priv.h | 2 - fs/xfs/xfs_log_recover.c | 80 ++++++++++++++++++------------------------- 4 files changed, 37 insertions(+), 48 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/xfs_log_priv.h =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_log_priv.h 2007-07-23 11:09:52.000000000 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_log_priv.h 2007-07-25 11:43:02.706927249 +1000 @@ -329,7 +329,7 @@ typedef struct xlog_rec_header { int h_len; /* len in bytes; should be 64-bit aligned: 4 */ xfs_lsn_t h_lsn; /* lsn of this LR : 8 */ xfs_lsn_t h_tail_lsn; /* lsn of 1st LR w/ buffers not committed: 8 */ - uint h_chksum; /* may not be used; non-zero if used : 4 */ + __be32 h_crc; /* log record crc value : 4 */ int h_prev_block; /* block number to previous LR : 4 */ int h_num_logops; /* number of log operations in this LR : 4 */ uint h_cycle_data[XLOG_HEADER_CYCLE_SIZE / BBSIZE]; Index: 2.6.x-xfs-new/fs/xfs/xfs_log_recover.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_log_recover.c 2007-07-25 10:42:54.000000000 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_log_recover.c 2007-07-25 19:10:12.185803348 +1000 @@ -3302,28 +3302,18 @@ xlog_recover_process_iunlinks( } -#ifdef DEBUG +#define XLOG_CHKSUM_SEED 0x4c585343 /* 'XLCS' */ STATIC void -xlog_pack_data_checksum( +xlog_pack_data_crc( xlog_t *log, xlog_in_core_t *iclog, int size) { - int i; - uint *up; - uint chksum = 0; - - up = (uint *)iclog->ic_datap; - /* divide length by 4 to get # words */ - for (i = 0; i < (size >> 2); i++) { - chksum ^= INT_GET(*up, ARCH_CONVERT); - up++; - } - INT_SET(iclog->ic_header.h_chksum, ARCH_CONVERT, chksum); + u32 crc; + + crc = crc32c(XLOG_CHKSUM_SEED, iclog->ic_datap, size); + iclog->ic_header.h_crc = cpu_to_be32(crc); } -#else -#define xlog_pack_data_checksum(log, iclog, size) -#endif /* * Stamp cycle number in every block @@ -3340,7 +3330,7 @@ xlog_pack_data( xfs_caddr_t dp; xlog_in_core_2_t *xhdr; - xlog_pack_data_checksum(log, iclog, size); + xlog_pack_data_crc(log, iclog, size); cycle_lsn = CYCLE_LSN_DISK(iclog->ic_header.h_lsn); @@ -3368,41 +3358,38 @@ xlog_pack_data( } } -#if defined(DEBUG) && defined(XFS_LOUD_RECOVERY) STATIC void -xlog_unpack_data_checksum( +xlog_unpack_data_crc( xlog_rec_header_t *rhead, xfs_caddr_t dp, xlog_t *log) { - uint *up = (uint *)dp; - uint chksum = 0; - int i; - - /* divide length by 4 to get # words */ - for (i=0; i < INT_GET(rhead->h_len, ARCH_CONVERT) >> 2; i++) { - chksum ^= INT_GET(*up, ARCH_CONVERT); - up++; - } - if (chksum != INT_GET(rhead->h_chksum, ARCH_CONVERT)) { - if (rhead->h_chksum || - ((log->l_flags & XLOG_CHKSUM_MISMATCH) == 0)) { - cmn_err(CE_DEBUG, - "XFS: LogR chksum mismatch: was (0x%x) is (0x%x)\n", - INT_GET(rhead->h_chksum, ARCH_CONVERT), chksum); - cmn_err(CE_DEBUG, -"XFS: Disregard message if filesystem was created with non-DEBUG kernel"); - if (XFS_SB_VERSION_HASLOGV2(&log->l_mp->m_sb)) { - cmn_err(CE_DEBUG, - "XFS: LogR this is a LogV2 filesystem\n"); - } - log->l_flags |= XLOG_CHKSUM_MISMATCH; - } + u32 hdr_crc; + u32 crc; + int size; + + /* XXX: XLOG_LSUNITTOB for v2 logs? Initial tests + * seem to indicate it's not needed... */ + size = BBTOB(BTOBB(INT_GET(rhead->h_len, ARCH_CONVERT))); + hdr_crc = be32_to_cpu(rhead->h_crc); + crc = crc32c(XLOG_CHKSUM_SEED, dp, size); + if (hdr_crc != crc) { + cmn_err(CE_ALERT, "XFS Recovery: Log Record CRC error: " + "was (0x%x), computed (0x%x), size 0x%x.\n", + hdr_crc, crc, size); + print_hex_dump(KERN_ALERT, "record: ", 0, 32, 1, dp, 32, 0); + + /* + * XXX: What should we do here? if we've detected a + * log record corruption, then we can't recovery past + * this point, right? + * + * Leave it to a higher layer to detect this flag and + * act appropriately? + */ + log->l_flags |= XLOG_CHKSUM_MISMATCH; } } -#else -#define xlog_unpack_data_checksum(rhead, dp, log) -#endif STATIC void xlog_unpack_data( @@ -3412,6 +3399,7 @@ xlog_unpack_data( { int i, j, k; xlog_in_core_2_t *xhdr; + xfs_caddr_t odp = dp; for (i = 0; i < BTOBB(INT_GET(rhead->h_len, ARCH_CONVERT)) && i < (XLOG_HEADER_CYCLE_SIZE / BBSIZE); i++) { @@ -3429,7 +3417,7 @@ xlog_unpack_data( } } - xlog_unpack_data_checksum(rhead, dp, log); + xlog_unpack_data_crc(rhead, odp, log); } STATIC int Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_linux.h =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_linux.h 2007-07-23 17:16:56.000000000 +1000 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_linux.h 2007-07-25 11:47:07.411080498 +1000 @@ -76,6 +76,7 @@ #include <linux/notifier.h> #include <linux/delay.h> #include <linux/log2.h> +#include <linux/crc32c.h> #include <asm/page.h> #include <asm/div64.h> Index: 2.6.x-xfs-new/fs/xfs/Kconfig =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/Kconfig 2007-01-16 10:54:14.000000000 +1100 +++ 2.6.x-xfs-new/fs/xfs/Kconfig 2007-07-25 12:16:50.804060182 +1000 @@ -1,6 +1,6 @@ config XFS_FS tristate "XFS filesystem support" - depends on BLOCK + depends on BLOCK && LIBCRC32C help XFS is a high performance journaling filesystem which originated on the SGI IRIX platform. It is completely multi-threaded, can ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RFC: log record CRC validation 2007-07-26 23:31 ` David Chinner @ 2007-07-27 1:24 ` Michael Nishimoto 2007-07-27 6:59 ` David Chinner 2007-07-28 2:00 ` William J. Earl 1 sibling, 1 reply; 18+ messages in thread From: Michael Nishimoto @ 2007-07-27 1:24 UTC (permalink / raw) To: David Chinner; +Cc: markgw, xfs-dev, xfs-oss The log checksum code has not been used since the development phase of xfs. It did work at one point because I remember using it and then decided to disable it and use just the current cycle stamping technique. The checksum code was just advisory, so I could see if it ever occurred during development. When a CRC error is found, your suggestion is correct. Recovery should backup and process only completely good log records. The code backs up in this same fashion when it encounters a region of missing sector updates because of the async nature of log writes and disk caches. At this point, I'm not convinced that xfs needs to do CRCs on the xfs log because the size of an xfs log is relatively small. Michael > Date: Wed, 25 Jul 2007 19:24:45 +1000 > From: David Chinner <dgc@sgi.com> > To: xfs-dev <xfs-dev@sgi.com> > Cc: xfs-oss <xfs@oss.sgi.com> > Subject: RFC: log record CRC validation > > Folks, > > I've just fixed up the never-used-debug log record checksumming > code with an eye to permanently enabling it for production > filesystems. > > Firstly, I updated the simple 32 bit wide XOR checksum to use the > crc32c module. This places an new dependency on XFS - it will now > depends on CONFIG_LIBCRC32C. I'm also not sure what the best > method to use is - the little endian or big endian CRC algorithm > so I just went for the default (crc32c()). > > This then resulted in recovery failing to verify the checksums, > and it turns out that is because xfs_pack_data() gets passed a > padded buffer and size to checksum (padded to 512 bytes), whereas > the unpacking (recovery) only checksummed the unpadded record > length. Hence this code probably never worked reliably if anyone > ever enabled it. > > This does bring up a question - probably for Tim - do we only get > rounded to BBs or do we get rounded to the log stripe unit when > packing the log records before writeout? It seems froma quick test > that it is only BBs, but confirmation would be good.... > > The next question is the hard one. What do we do when we detect > a log record CRC error? Right now it just warns and sets a flag > in the log. I think it should probably prevent log replay from > replaying past this point (i.e. trim the head back to the last > good log record) but I'm not sure what the best thing to do here. > > Comments? > > Cheers, > > Dave. > -- > Dave Chinner > Principal Engineer > SGI Australian Software Group > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RFC: log record CRC validation 2007-07-27 1:24 ` Michael Nishimoto @ 2007-07-27 6:59 ` David Chinner 2007-08-01 0:49 ` Michael Nishimoto 0 siblings, 1 reply; 18+ messages in thread From: David Chinner @ 2007-07-27 6:59 UTC (permalink / raw) To: Michael Nishimoto; +Cc: David Chinner, markgw, xfs-dev, xfs-oss On Thu, Jul 26, 2007 at 06:24:51PM -0700, Michael Nishimoto wrote: > The log checksum code has not been used since the > development phase of xfs. It did work at one point because I > remember using it and then decided to disable it and use just > the current cycle stamping technique. The checksum code was > just advisory, so I could see if it ever occurred during > development. > > When a CRC error is found, your suggestion is correct. Recovery > should backup and process only completely good log records. The code > backs up in this same fashion when it encounters a region of > missing sector updates because of the async nature of log > writes and disk caches. Yes, but that's usually only in the last 8 log-buffers worth of the the log that the hole exists in (i.e. 256k by default). However, if the tail block has a CRC error, we've got to through away the entire log and that, like zeroing a dirty log from xfs_repair, generally results in a corrupted filesystem image. An example of where this could be a problem is reusing a just-freed extent. Before reusing it we force the log to get the transaction on disk and rely on log replay to ensure that the block is freed in the event of a crash. We then go and write over the contents of the block. If that log transaction is not replayed (that freed the extent) then we've overwritten the previous contents of that extent and so the "current" contents of the extent after log replay are wrong. IOWs, I think that if we come across a bad CRC in a log record we can replay up to that point but we've still got to abort the mount and ask the user to run repair.... > At this point, I'm not convinced that xfs needs to do CRCs on > the xfs log because the size of an xfs log is relatively small. Sure, but the same argument can be made about the superblock, or an AGF and a directory block. That doesn't mean that they'll never have an error. Statistically speaking, the log contains that blocks in the filesystem we most frequently do I/O to, so it's the most likely region to see an I/O path induced bit error. If we see one on recovery...... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RFC: log record CRC validation 2007-07-27 6:59 ` David Chinner @ 2007-08-01 0:49 ` Michael Nishimoto 2007-08-01 2:24 ` David Chinner 0 siblings, 1 reply; 18+ messages in thread From: Michael Nishimoto @ 2007-08-01 0:49 UTC (permalink / raw) To: David Chinner; +Cc: markgw, xfs-dev, xfs-oss David Chinner wrote: > On Thu, Jul 26, 2007 at 06:24:51PM -0700, Michael Nishimoto wrote: > > The log checksum code has not been used since the > > development phase of xfs. It did work at one point because I > > remember using it and then decided to disable it and use just > > the current cycle stamping technique. The checksum code was > > just advisory, so I could see if it ever occurred during > > development. > > > > When a CRC error is found, your suggestion is correct. Recovery > > should backup and process only completely good log records. The code > > backs up in this same fashion when it encounters a region of > > missing sector updates because of the async nature of log > > writes and disk caches. > > Yes, but that's usually only in the last 8 log-buffers worth of the > the log that the hole exists in (i.e. 256k by default). However, if > the tail block has a CRC error, we've got to through away the entire > log and that, like zeroing a dirty log from xfs_repair, generally > results in a corrupted filesystem image. > > An example of where this could be a problem is reusing a just-freed > extent. Before reusing it we force the log to get the transaction on > disk and rely on log replay to ensure that the block is freed in the > event of a crash. We then go and write over the contents of the > block. If that log transaction is not replayed (that freed the > extent) then we've overwritten the previous contents of that extent > and so the "current" contents of the extent after log replay are wrong. > > IOWs, I think that if we come across a bad CRC in a log record we > can replay up to that point but we've still got to abort the mount > and ask the user to run repair.... > > > At this point, I'm not convinced that xfs needs to do CRCs on > > the xfs log because the size of an xfs log is relatively small. > > Sure, but the same argument can be made about the superblock, > or an AGF and a directory block. That doesn't mean that they'll > never have an error. > > Statistically speaking, the log contains that blocks in the > filesystem we most frequently do I/O to, so it's the most likely > region to see an I/O path induced bit error. If we see one > on recovery...... > > Cheers, > > Dave. > -- > Dave Chinner > Principal Engineer > SGI Australian Software Group You are correct. I didn't need the example -- just a reminder that CRC errors can occur anywhere, not just in the last 8 log-buffers worth of data. ;-) And yes, repair is necessary now because we are no longer replaying a transaction which we thought was committed. I have a request. Can this be made a mkfs.xfs option, so it can be disabled? What are your plans for adding CRCs to other metadata objects? Michael ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RFC: log record CRC validation 2007-08-01 0:49 ` Michael Nishimoto @ 2007-08-01 2:24 ` David Chinner 2007-08-01 2:36 ` Barry Naujok 2007-08-01 12:11 ` Andi Kleen 0 siblings, 2 replies; 18+ messages in thread From: David Chinner @ 2007-08-01 2:24 UTC (permalink / raw) To: Michael Nishimoto; +Cc: David Chinner, markgw, xfs-dev, xfs-oss On Tue, Jul 31, 2007 at 05:49:18PM -0700, Michael Nishimoto wrote: > David Chinner wrote: > >On Thu, Jul 26, 2007 at 06:24:51PM -0700, Michael Nishimoto wrote: > > > When a CRC error is found, your suggestion is correct. Recovery > > > should backup and process only completely good log records. The code > > > backs up in this same fashion when it encounters a region of > > > missing sector updates because of the async nature of log > > > writes and disk caches. ...... > >IOWs, I think that if we come across a bad CRC in a log record we > >can replay up to that point but we've still got to abort the mount > >and ask the user to run repair.... ..... > You are correct. I didn't need the example -- just a reminder that > CRC errors can occur anywhere, not just in the last 8 log-buffers > worth of data. ;-) > > And yes, repair is necessary now because we are no longer replaying > a transaction which we thought was committed. Ok, I'll make the mount replay up to previous good record and then abort with -EUCLEAN. > I have a request. Can this be made a mkfs.xfs option, so it can be > disabled? It's a definite possibility, because.... > What are your plans for adding CRCs to other metadata objects? Other objects will require some on-disk format change, and that will require a feature bit to be set. Basically, we can add CRCs to individual inodes without a version bump as we have some empty space in the v2 inode core. That will make v2 inodes the default, but we already have a superblock bit for that and all versions of linux support v2 inodes so there is no issue there. This will require userspace tool changes, though, because tools like repair will revert v2 inodes back to v1 format if the are not using project id's or the link count fits into 16 bits..... I haven't looked at great depth into other structures in terms of implementation details. I know that if we use a 16 bit CRC on directories we can get away without a on-disk format change as the xfs_da_blkinfo structure has 16 bits of padding. However, given that directory block size can reach 64k, a CRC16 check is really only capable of single bit error detection. Hence I think we really need CRC32 here which means an on-disk format change. For the other types of btree blocks we will also need on-disk changes as there is no padding we can use to hold a CRC in the headers. Adding CRCs to the SB, AGF, AGI and AFGL is trivial but really needs a feature bit to say they are there. So, yes, this sort of change will require a feature bit and that means it would be a mkfs option. I'd prefer to have all the changes ready to go before releasing them into the wild so that we only need a single feature bit, though there is the possibility of separating the directory changes out into "version 3" directories. Conversion of existing filesystems is harder because of the repacking of structures required. if we are going to convert filesystems, I would expect it to be done offline by xfs_repair (Hi Barry!). Given that it will already need tobe modified to validate CRCs, I think this is probably the best approach to this problem right now. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RFC: log record CRC validation 2007-08-01 2:24 ` David Chinner @ 2007-08-01 2:36 ` Barry Naujok 2007-08-01 2:43 ` David Chinner 2007-08-01 12:11 ` Andi Kleen 1 sibling, 1 reply; 18+ messages in thread From: Barry Naujok @ 2007-08-01 2:36 UTC (permalink / raw) To: David Chinner, Michael Nishimoto; +Cc: markgw, xfs-dev, xfs-oss On Wed, 01 Aug 2007 12:24:18 +1000, David Chinner <dgc@sgi.com> wrote: > On Tue, Jul 31, 2007 at 05:49:18PM -0700, Michael Nishimoto wrote: > >> What are your plans for adding CRCs to other metadata objects? > > Other objects will require some on-disk format change, and that will > require a feature bit to be set. > > Basically, we can add CRCs to individual inodes without a version bump > as we have some empty space in the v2 inode core. That will make v2 > inodes > the default, but we already have a superblock bit for that and all > versions > of linux support v2 inodes so there is no issue there. This will require > userspace tool changes, though, because tools like repair will revert v2 > inodes back to v1 format if the are not using project id's or the link > count > fits into 16 bits..... No, repair does not revert v2 inodes back to v1. Currently, inodes are created as v1 in the kernel and moved to v2 as required. Barry. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RFC: log record CRC validation 2007-08-01 2:36 ` Barry Naujok @ 2007-08-01 2:43 ` David Chinner 0 siblings, 0 replies; 18+ messages in thread From: David Chinner @ 2007-08-01 2:43 UTC (permalink / raw) To: Barry Naujok; +Cc: David Chinner, Michael Nishimoto, markgw, xfs-dev, xfs-oss On Wed, Aug 01, 2007 at 12:36:21PM +1000, Barry Naujok wrote: > On Wed, 01 Aug 2007 12:24:18 +1000, David Chinner <dgc@sgi.com> wrote: > > >On Tue, Jul 31, 2007 at 05:49:18PM -0700, Michael Nishimoto wrote: > > > >>What are your plans for adding CRCs to other metadata objects? > > > >Other objects will require some on-disk format change, and that will > >require a feature bit to be set. > > > >Basically, we can add CRCs to individual inodes without a version bump > >as we have some empty space in the v2 inode core. That will make v2 > >inodes > >the default, but we already have a superblock bit for that and all > >versions > >of linux support v2 inodes so there is no issue there. This will require > >userspace tool changes, though, because tools like repair will revert v2 > >inodes back to v1 format if the are not using project id's or the link > >count > >fits into 16 bits..... > > No, repair does not revert v2 inodes back to v1. Currently, inodes > are created as v1 in the kernel and moved to v2 as required. Ah, yes, you're right - it's too easy to make flow mistakes in 1200 line functions.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RFC: log record CRC validation 2007-08-01 2:24 ` David Chinner 2007-08-01 2:36 ` Barry Naujok @ 2007-08-01 12:11 ` Andi Kleen 1 sibling, 0 replies; 18+ messages in thread From: Andi Kleen @ 2007-08-01 12:11 UTC (permalink / raw) To: David Chinner; +Cc: Michael Nishimoto, markgw, xfs-dev, xfs-oss David Chinner <dgc@sgi.com> writes: > I haven't looked at great depth into other structures in terms of > implementation details. I know that if we use a 16 bit CRC on > directories we can get away without a on-disk format change as the > xfs_da_blkinfo structure has 16 bits of padding. However, given that > directory block size can reach 64k, a CRC16 check is really only > capable of single bit error detection. Hence I think we really need > CRC32 here which means an on-disk format change. When the directory format is changed it would be nice to also support DT_* types at the same time. They can speed up some operations nicely because file system walkers can avoid a stat() (and seek to the inode) just to find out if a name is a directory or not. Right now there is no space for this unfortunately. -Andi ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RFC: log record CRC validation 2007-07-26 23:31 ` David Chinner 2007-07-27 1:24 ` Michael Nishimoto @ 2007-07-28 2:00 ` William J. Earl 2007-07-28 14:03 ` Andi Kleen 2007-07-31 5:30 ` David Chinner 1 sibling, 2 replies; 18+ messages in thread From: William J. Earl @ 2007-07-28 2:00 UTC (permalink / raw) To: xfs-oss; +Cc: David Chinner, Michael Nishimoto, markgw David Chinner wrote: > On Thu, Jul 26, 2007 at 10:53:02AM -0700, Michael Nishimoto wrote: > > ... >> Is CRC checking being added to xfs log data? >> > > Yes. it's a little used debug option right now, and I'm > planning on making it default behaviour. > > >> If so, what data has been collected to show that this needs to be added? >> > > The size of high-end filesystems are now at the same order of > magnitude as the bit error rate of the storage hardware. e.g. 1PB = > 10^16 bits. The bit error rate of high end FC drives? 1 in 10^16 > bits. For "enterprise" SATA drives? 1 in 10^15 bits. For desktop > SATA drives it's 1 in 10^14 bits (i.e. 1 in 10TB). > > We've got filesystems capable of moving > 2 x 10^16 bits of data > *per day* and we see lots of instances of multi-TB arrays made out > of desktop SATA disks. Given the recent studies of long-term disk > reliability, these vendor figures are likely to be the best error > rates we can hope for..... > > IOWs, we don't need evidence to justify this sort of error detection > detection because simple maths says there is going to be errors. We > have to deal with that, and hence we are going to be adding CRC > checking to on-disk metadata structures so we can detect bit errors > that would otherwise go undetected and result in filesystem > corruption. > > This means that instead of getting shutdown reports for some strange > and unreproducable btree corruption, we'll get a shutdown for a CRC > failure on the btree block. It is very likely that this will occur > much earlier than a subtle btree corruption would otherwise be > detected and hence we'll be less likely to propagate errors around > the fileystem. > ... Mike Nishimoto pointed out this thread to me, and suggested I reply, since I have worked on the analyzing the disk failure modes. First, note that the claimed bit error rates are rates of reported bad blocks, not rates of silent data corruption. The latter, while not quoted, are far lower. With large modern disks, it is unrealistic to not use RAID 1, RAID 5, or RAID 6 to mask reported disk bit errors. A CRC in a filesystem data structure, however, is not required to detect disk errors, even without RAID protection, since the disks report error correction failures. The rates of 1 in 10**14 (desktop SATA), 1 in 10**15 (Enterprise SATA and some FC), and 1 in 10**16 (some FC) bits read are detected and reported error rates. From conversations with drive vendors, these are actually fairly conservative, and assume you write the data once and read it after 10 years without rewriting it, as in an archive, which is the worst case, since you have ten years of accumulated deterioration. With RAID protection, seeing reported errors which are not masked by the RAID layer is extremely unlikely, except in the case of a drive failure, where it is necessary to read all of the surviving drives to reconstruct the contents of the lost drive. With 500 GB drives and a 7D+1P RAID 5 group, we need to read 350 GB to rebuild the RAID array using a replacement drive, which is to say about 2.8 * 10**12 bits. This implies we would see one block not reconstructed in every 35 rebuilds on desktop SATA drives, if the data were archive data written once and the rebuild happened after 10 years. We would expect about one rebuild in 10 years. With more RAID groups, the chance of data loss grows very high. With 100 groups, we would see a rebuild every month or so, so we would expect an unmasked read error every few years. With better drives, the chance is of course reduced, but is still significant in multi-PB storage systems. RAID 6 largely eliminates the chance of seeing a read error, at some cost in performance. Note that some people have reported both much higher annual failure rates for drives (which increases the frequency of RAID reconstructions and hence the chance of data loss) and higher read error rates. Based on personal experience with a large number of drives, I believe that both of these are a consequence of systems (including software and disk host bus adapters) dealing poorly with common transient disk problems, not actual errors on the surface of the disk. For example, if the drive firmware gets confused and stops talking, the controller will treat it as a read timeout, and the software may simply report "read failure", which in turn may be interpreted as "drive failed", even though an adequate drive reset would return the drive to service with no harm done to the data. In addition, many people have taken up using desktop drives for primary storage, for which they are not designed. Desktop drives are typically rated at 800,000 to 1,000,000 hours MTBF at 30% duty cycle. Using them at 100% duty cycle drastically decreases their MTBF, which in turn drastically increases the rate of unmasked read errors, as a consequence of the extra RAID reconstructions. Lastly, quite a few "white box" vendors have shipped chassis which do not adequately cool and power the drives, and excessive heat can also drastically reduce the MTBF. In a well-designed chassis, with software which correctly recovers from transient drive issues, I have observed higher than claimed MTBF and much lower than claimed bit error rates. The undetected error rate from a modern drive is not quoted publicly, but the block ECC is quite strong (since it has to mask raw bit error rates as high as 1 in 10**3) and hence can detect most error scenarios and correct many of them. None of the above, however, implies that we need CRCs on filesystem data structures. That is, if you get EIO on a disk read, then you don't need a CRC to know the block is bad. Other concerns, however, can motivate having CRCs. In particular, if the path between drive and memory can corrupt the data, then CRCs can help us recover to some extent. This has been a recurring problem with various technologies, but was particularly common on IDE (PATA) drives with their vulnerable cables, where mishandling of the cable could lead to silent data corruption. CRCs on just filesystem data structures, however, only help with metadata integrity, leaving file data integrity in the dark. Some vendors at various times, such as Tandem and NetApp, have added a checksum to each block, usually when they were having data integrity issues which turned out in the end to be bad cables or bad software, but which could be masked by RAID recovery, if detected by the block checksum. It is usually cost-effective, however, to simply select a reliable disk subsystem. With SATA, SAS, and FC, which have link-level integrity checks, silent data corruption on the link is unlikely. This leaves mainly the host bus adapter, the main memory, and the path between them. If those are bad, however, it is hard to see how much the filesystem can help. In conclusion, I doubt that CRCs are worth the added complexity. If I wanted to mask flaky hardware, I would look at using RAID 6 and validating parity on all reads and doing RAID recovery on any errors, but doing offline testing of the hardware and repairing or replacing any broken bits would be yet simpler. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RFC: log record CRC validation 2007-07-28 2:00 ` William J. Earl @ 2007-07-28 14:03 ` Andi Kleen 2007-07-31 5:30 ` David Chinner 1 sibling, 0 replies; 18+ messages in thread From: Andi Kleen @ 2007-07-28 14:03 UTC (permalink / raw) To: William J. Earl; +Cc: xfs-oss, David Chinner, Michael Nishimoto, markgw "William J. Earl" <earl@agami.com> writes: > It is usually > cost-effective, however, to simply select a reliable disk subsystem. In practice people don't run fully reliable disk subsystems unfortunately. And also a previously reliable disk subsystem might degrade over time. It's better to handle this case. Also CPUs have gotten a lot faster recently with more cache misses dominating than actual operations; CRCs on relatively small data items that are already manipulated by the CPU (like file system metadata) should be essentially free. > With SATA, SAS, and FC, which have link-level integrity checks, silent > data corruption on the link is unlikely. For TCP/IP networking that is in theory true too -- with L2 networks usually having CRCs. In practice the TCP/UDP checksum finds a lot of corruptions wherever they come from and there are even some that are only detected by stronger checksums (like in SSL) or go undetected. While disk subsystems are probably doing a bit better than IP networks in terms of error rates they are far from perfect either. -Andi ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RFC: log record CRC validation 2007-07-28 2:00 ` William J. Earl 2007-07-28 14:03 ` Andi Kleen @ 2007-07-31 5:30 ` David Chinner 2007-08-01 1:32 ` William J. Earl 1 sibling, 1 reply; 18+ messages in thread From: David Chinner @ 2007-07-31 5:30 UTC (permalink / raw) To: William J. Earl; +Cc: xfs-oss, David Chinner, Michael Nishimoto, markgw On Fri, Jul 27, 2007 at 07:00:32PM -0700, William J. Earl wrote: > David Chinner wrote: > >On Thu, Jul 26, 2007 at 10:53:02AM -0700, Michael Nishimoto wrote: > > > >... > >>Is CRC checking being added to xfs log data? > >> > > > >Yes. it's a little used debug option right now, and I'm > >planning on making it default behaviour. > > > > > >>If so, what data has been collected to show that this needs to be added? > >> > > > >The size of high-end filesystems are now at the same order of > >magnitude as the bit error rate of the storage hardware. e.g. 1PB = > >10^16 bits. The bit error rate of high end FC drives? 1 in 10^16 > >bits. For "enterprise" SATA drives? 1 in 10^15 bits. For desktop > >SATA drives it's 1 in 10^14 bits (i.e. 1 in 10TB). > > First, note that the claimed bit error rates are rates of > reported bad blocks, not rates of silent data corruption. The latter, > while not quoted, are far lower. Ok, fair enough, but in the absense of numbers and the fact that real world MTBF numbers are lower than what mfg's quote I'm always going to assume that this is the ballpark. [snip stuff about raid6, drive data, I/O path corruptions, etc] In summary you are effectively saying this: "if you spend enough money on your storage, then the filesystem doesn't need to worry about integrity." I've heard exactly the same lecture you've just given from other (ex-)XFS engineers that integrity is the total responsibility of the block device. SGI used to ensure that XFS only ran on hardware that followed this mantra and so could get away with that approach to filesystem error detection. But XFS doesn't live in that world any more. It stopped being true when XFS got ported to linux. XFS lives in the world of commodity hardware as well as the high end now and we are getting more and more situations where we are having to make tradeoffs for preventing corruption on commodity hardware. e.g. I/O barrier support for disks with volatile write caches. IMO, continuing down this same "the block device is perfect" path is a "head in the sand" approach. By ignoring the fact that errors can and do occur, we're screwing ourselves when something does actually go wrong because we haven't put in place the mechanisms to detect errors because we've assumed they will never happen. We've spent 15 years so far trying to work out what has gone wrong in XFS by adding more and more reactive debug into the code without an eye to a robust solution. We add a chunk of code here to detect that problem, a chunk of code there to detect this problem, and so on. It's just not good enough anymore. Like good security, filesystem integrity is not provided by a single mechanism. "Defense in depth" is what we are aiming to provide here and to do that you have to assume that errors can propagate through every interface into the filesystem. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RFC: log record CRC validation 2007-07-31 5:30 ` David Chinner @ 2007-08-01 1:32 ` William J. Earl 2007-08-01 10:02 ` David Chinner 0 siblings, 1 reply; 18+ messages in thread From: William J. Earl @ 2007-08-01 1:32 UTC (permalink / raw) To: David Chinner; +Cc: xfs-oss, Michael Nishimoto, markgw David Chinner wrote: > On Fri, Jul 27, 2007 at 07:00:32PM -0700, William J. Earl wrote: > >> David Chinner wrote: >> >>> ... >>> The size of high-end filesystems are now at the same order of >>> magnitude as the bit error rate of the storage hardware. e.g. 1PB = >>> 10^16 bits. The bit error rate of high end FC drives? 1 in 10^16 >>> bits. For "enterprise" SATA drives? 1 in 10^15 bits. For desktop >>> SATA drives it's 1 in 10^14 bits (i.e. 1 in 10TB). >>> >> First, note that the claimed bit error rates are rates of >> reported bad blocks, not rates of silent data corruption. The latter, >> while not quoted, are far lower. >> > > Ok, fair enough, but in the absense of numbers and the fact that > real world MTBF numbers are lower than what mfg's quote I'm > always going to assume that this is the ballpark. > > The real world MTBF numbers are worse for people who use drives outside their specified parameters (as at Google) and better, at least in the first few years, for drives which are used inside their parameters, as far as I have seen at Agami and my previous company. Drive MTBF does not relate, however to data corruption, except in the case of a failed RAID reconstruction, and even then the error is reported. > ... > > IMO, continuing down this same "the block device is perfect" path is > a "head in the sand" approach. By ignoring the fact that errors can > and do occur, we're screwing ourselves when something does actually > go wrong because we haven't put in place the mechanisms to detect > errors because we've assumed they will never happen. > > We've spent 15 years so far trying to work out what has gone wrong > in XFS by adding more and more reactive debug into the code without > an eye to a robust solution. We add a chunk of code here to detect > that problem, a chunk of code there to detect this problem, and so > on. It's just not good enough anymore. > > Like good security, filesystem integrity is not provided by a single > mechanism. "Defense in depth" is what we are aiming to provide here > and to do that you have to assume that errors can propagate through > every interface into the filesystem. > > > I understand your argument, but why not simply strengthen the block layer, even you do it with an optional XFS-based checksum scheme on all blocks? That way, you would not wind up detecting metadata corruption and silently ignoring file data corruption For example, suppose you stole one block in N (where you might choose N according to the RAID data stripe size, when running over MD), and used it as a checksum block (storing (N-1)*K subblock checksums)? This in effect would require either RAID full-stripe read-modify-write or at least an extra block read-modify-write for a real block write, but it would give you complete metadata and data integrity checking. This could be an XFS feature or a new MD feature ("checksum" layer). This would clearly be somewhat expensive for random writes, much like RAID 6, and also expensive for random reads, unless the N were equal to the RAID block size, but, as with the NetApp and Tandem software checksums, it would assure that a high level of data integrity. If the RAID block size were moderately large, say 64 KB, then you could take one 4 KB block in 16 and pay only about 6% of the total space. With modern disks, the extra sequential transfers (even if you have to read 64 KB to get 4 KB) would not be that significant, since rotational latency is the largest cost, not the actual data transfer. You would also use more main memory, since you would prefer to buffer at least the 4 KB checksum block associated with any 4 KB data blocks from the RAID block. The extra memory would be between 6% and 100%, depending on how much locality there is in block accesses. This in turn would cause more disk accesses, due to the cache holding fewer real data blocks. [[HTML alternate version deleted]] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RFC: log record CRC validation 2007-08-01 1:32 ` William J. Earl @ 2007-08-01 10:02 ` David Chinner 0 siblings, 0 replies; 18+ messages in thread From: David Chinner @ 2007-08-01 10:02 UTC (permalink / raw) To: William J. Earl; +Cc: David Chinner, xfs-oss, Michael Nishimoto, markgw On Tue, Jul 31, 2007 at 06:32:59PM -0700, William J. Earl wrote: > David Chinner wrote: > >IMO, continuing down this same "the block device is perfect" path is > >a "head in the sand" approach. By ignoring the fact that errors can > >and do occur, we're screwing ourselves when something does actually > >go wrong because we haven't put in place the mechanisms to detect > >errors because we've assumed they will never happen. > > > >We've spent 15 years so far trying to work out what has gone wrong > >in XFS by adding more and more reactive debug into the code without > >an eye to a robust solution. We add a chunk of code here to detect > >that problem, a chunk of code there to detect this problem, and so > >on. It's just not good enough anymore. > > > >Like good security, filesystem integrity is not provided by a single > >mechanism. "Defense in depth" is what we are aiming to provide here > >and to do that you have to assume that errors can propagate through > >every interface into the filesystem. > > > > > > > I understand your argument, but why not simply strengthen the > block layer, even you do it with an optional XFS-based checksum scheme > on all blocks? You could probably do this using dm-crypt and an integrity hash. Unfortunately, I don't trust dm-crypt as far as I can throw it because we've been severely bitten by bugs in dm-crypt. Specifically the bug that existed from 2.6.14 through to 2.6.19 where it reported success to readahead bios that had been cancelled due to block device congestion and hence returning uninitialised bios as "complete" and error free. This resulted in XFS shutdowns because the only code in the entire of Linux that triggered this problem was the XFS btree readahead. Every time this occurred the finger was pointed at XFS because it detected the corruption and we had *zero* information telling us what the problem might have been. A btree block CRC failure would have *immediately* told us the block layer had returned us bad data. This is where I come back to defense in depth. Block device level integrity checking is not sufficient as bugs in this layer still need to be caught by the filesystem. I stand by this even if I implemented the block layer code myself - I don't write perfect code and nobody else does, either. Hence we *must* assume that the layer that returned the block is imperfect even if we wrote it.... > That way, you would not wind up detecting metadata > corruption and silently ignoring file data corruption Data integrity has different selection criteria compared to ensuring structural integrity of the filesystem. That is, different users have different requirements for data integrity and performance. e.g. bit errors in video and audio data don't matter but performance does, yet we still need to protect the filesystem metadata against the same bit errors. Hence, IMO, we really do need to treat them seperately to retain flexibility in the filesystem for different applications. I would prefer to aim for per-inode selection of data CRCs and allow inheritance of the attribute through the directory heirachy. That way the behaviour is user selectable for the subset of the files that the system owner cares about enough to protect. If you want to protect the entire filesystem, set the flag on the root directory at mkfs time.... > For example, > suppose you stole one block in N (where you might choose N according to > the RAID data stripe size, when running over MD), and used it as a > checksum block (storing (N-1)*K subblock checksums)? This in effect > would require either RAID full-stripe read-modify-write or at least an > extra block read-modify-write for a real block write, but it would give > you complete metadata and data integrity checking. This could be an > XFS feature or a new MD feature ("checksum" layer). dm-crypt. But, see above. > This would clearly be somewhat expensive for random writes, > much like RAID 6, and also expensive for random reads, unless the N were > equal to the RAID block size, but, as with the NetApp and Tandem > software checksums, it would assure that a high level of data integrity. Good integrity, but random writes is one of XFS's weaknesses and this would just make it worse. Netapp leverages both WAFL's linearisation of random writes and specialised hardware (NVRAM) to avoid this performance proble altogether but we can't. It's a showstopper, IMO. No matter which way I look at it, block layer integrity checking provides insufficient correctness guarantees while neutralising XFS's strengths and magnifying it's weaknesses. To me it just doesn't make sense for XFS to go down this path when there are other options that don't have these drawbacks. Basically, it makes much more sense to me to break the problem down into it's component pieces and provide protection for each of those pieces one at a time without introducing new issues. This way we don't compromise the strengths of XFS or reduce the flexibility of the filesystem in any way whilst providing better protection against errors that cause corruption. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2007-08-01 11:17 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20070725092445.GT12413810@sgi.com>
2007-07-25 10:14 ` RFC: log record CRC validation Mark Goodwin
2007-07-26 5:55 ` David Chinner
2007-07-26 23:01 ` Andi Kleen
2007-07-26 23:50 ` David Chinner
2007-07-26 17:53 ` Michael Nishimoto
2007-07-26 23:31 ` David Chinner
2007-07-27 1:24 ` Michael Nishimoto
2007-07-27 6:59 ` David Chinner
2007-08-01 0:49 ` Michael Nishimoto
2007-08-01 2:24 ` David Chinner
2007-08-01 2:36 ` Barry Naujok
2007-08-01 2:43 ` David Chinner
2007-08-01 12:11 ` Andi Kleen
2007-07-28 2:00 ` William J. Earl
2007-07-28 14:03 ` Andi Kleen
2007-07-31 5:30 ` David Chinner
2007-08-01 1:32 ` William J. Earl
2007-08-01 10:02 ` David Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox