public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] xfs: fixes for 3.10-rc6
@ 2013-06-12  2:19 Dave Chinner
  2013-06-12  2:19 ` [PATCH 1/3] xfs: don't shutdown log recovery on validation errors Dave Chinner
                   ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Dave Chinner @ 2013-06-12  2:19 UTC (permalink / raw)
  To: xfs; +Cc: bpm

Hi Ben,

Here are the fixes I have for 3.10-rc6. The first fixes the log
recovery regression that Dave jones reported, the second fixes a bug
in the CRC on-disk format on 32 bit systems, and the third fixes a
BMBT block corruption when CRCs are enabled. All are simple, short
fixes, so please consider them for 3.10-rc6.

Thanks,

Dave.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/3] xfs: don't shutdown log recovery on validation errors
  2013-06-12  2:19 [PATCH 0/3] xfs: fixes for 3.10-rc6 Dave Chinner
@ 2013-06-12  2:19 ` Dave Chinner
  2013-06-13  1:04   ` Ben Myers
  2013-06-12  2:19 ` [PATCH 2/3] xfs: fix implicit padding in directory and attr CRC formats Dave Chinner
  2013-06-12  2:19 ` [PATCH 3/3] xfs: ensure btree root split sets blkno correctly Dave Chinner
  2 siblings, 1 reply; 31+ messages in thread
From: Dave Chinner @ 2013-06-12  2:19 UTC (permalink / raw)
  To: xfs; +Cc: bpm

From: Dave Chinner <dchinner@redhat.com>

Unfortunately, we cannot guarantee that items logged multiple times
and replayed by log recovery do not take objects back in time. When
theya re taken back in time, the go into an intermediate state which
is corrupt, and hence verification that occurs on this intermediate
state causes log recovery to abort with a corruption shutdown.

Instead of causing a shutdown and unmountable filesystem, don't
verify post-recovery items before they are written to disk. This is
less than optimal, but there is no way to detect this issue for
non-CRC filesystems If log recovery successfully completes, this
will be undone and the object will be consistent by subsequent
transactions that are replayed, so in most cases we don't need to
take drastic action.

For CRC enabled filesystems, leave the verifiers in place - we need
to call them to recalculate the CRCs on the objects anyway. This
recovery problem canbe solved for such filesystems - we have a LSN
stamped in all metadata at writeback time that we can to determine
whether the item should be replayed or not. This is a separate piece
of work, so is not addressed by this patch.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_recover.c |   19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 45a85ff..7cf5e4e 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -1845,7 +1845,13 @@ xlog_recover_do_inode_buffer(
 	xfs_agino_t		*buffer_nextp;
 
 	trace_xfs_log_recover_buf_inode_buf(mp->m_log, buf_f);
-	bp->b_ops = &xfs_inode_buf_ops;
+
+	/*
+	 * Post recovery validation only works properly on CRC enabled
+	 * filesystems.
+	 */
+	if (xfs_sb_version_hascrc(&mp->m_sb))
+		bp->b_ops = &xfs_inode_buf_ops;
 
 	inodes_per_buf = BBTOB(bp->b_io_length) >> mp->m_sb.sb_inodelog;
 	for (i = 0; i < inodes_per_buf; i++) {
@@ -2205,7 +2211,16 @@ xlog_recover_do_reg_buffer(
 	/* Shouldn't be any more regions */
 	ASSERT(i == item->ri_total);
 
-	xlog_recovery_validate_buf_type(mp, bp, buf_f);
+	/*
+	 * We can only do post recovery validation on items on CRC enabled
+	 * fielsystems as we need to know when the buffer was written to be able
+	 * to determine if we should have replayed the item. If we replay old
+	 * metadata over a newer buffer, then it will enter a temporarily
+	 * inconsistent state resulting in verification failures. Hence for now
+	 * just avoid the verification stage for non-crc filesystems
+	 */
+	if (xfs_sb_version_hascrc(&mp->m_sb))
+		xlog_recovery_validate_buf_type(mp, bp, buf_f);
 }
 
 /*
-- 
1.7.10.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 2/3] xfs: fix implicit padding in directory and attr CRC formats
  2013-06-12  2:19 [PATCH 0/3] xfs: fixes for 3.10-rc6 Dave Chinner
  2013-06-12  2:19 ` [PATCH 1/3] xfs: don't shutdown log recovery on validation errors Dave Chinner
@ 2013-06-12  2:19 ` Dave Chinner
  2013-06-13  0:58   ` Ben Myers
  2013-06-12  2:19 ` [PATCH 3/3] xfs: ensure btree root split sets blkno correctly Dave Chinner
  2 siblings, 1 reply; 31+ messages in thread
From: Dave Chinner @ 2013-06-12  2:19 UTC (permalink / raw)
  To: xfs; +Cc: bpm

From: Dave Chinner <dchinner@redhat.com>

Michael L. Semon has been testing CRC patches ona 32 bit system and
been seeing assert failures in the directory code from xfs/080.
Thanks to Michael's heroic efforts with printk debugging, we found
that the problem was that the last free space being left in the
directory structure was too small to fit a unused tag structure and
it was being corrupted and attempting to log a region out of bounds.
Hence the assert failure looked something like:

.....
#5 calling xfs_dir2_data_log_unused() 36 32
#1 4092 4095 4096
#2 8182 8183 4096
XFS: Assertion failed: first <= last && last < BBTOB(bp->b_length), file: fs/xfs/xfs_trans_buf.c, line: 568

Where #1 showed the first region of the dup being logged (i.e. the
last 4 bytes of a directory buffer) and #2 shows the corrupt values
being calculated from the length of the dup entry which overflowed
the size of the buffer.

It turns out that the problem was not in the logging code, nor in
the freespace handling code. It is an initial condition bug that
only shows up on 32 bit systems. When a new buffer is initialised,
where's the freespace that is set up:

[  172.316249] calling xfs_dir2_leaf_addname() from xfs_dir_createname()
[  172.316346] #9 calling xfs_dir2_data_log_unused()
[  172.316351] #1 calling xfs_trans_log_buf() 60 63 4096
[  172.316353] #2 calling xfs_trans_log_buf() 4094 4095 4096

Note the offset of the first region being logged? It's 60 bytes into
the buffer. Once I saw that, I pretty much knew what the bug was
going to be caused by this.

Essentially, all direct entries are rounded to 8 bytes in length,
and all entries start with an 8 byte alignment. This means that we
can decode inplace as variables are naturally aligned. With the
directory data supposedly starting on a 8 byte boundary, and all
entries padded to 8 bytes, the minimum freespace in a directory
block is supposed to be 8 bytes, which is large enough to fit a
unused data entry structure (6 bytes in size). The fact we only have
4 bytes of free space indicates a directory data block alignment
problem.

And what do you know - there's an implicit hole in the directory
data block header for the CRC format, which means the header is 60
byte on 32 bit intel systems and 64 bytes on 64 bit systems. Needs
padding. And while looking at the structures, I found the same
problem in the attr leaf header. Fix them both.

Note that this only affects 32 bit systems with CRCs enabled.
Everything else is just fine. Note that filesystems created before
this fix on such systems will not be readable with this fix applied.

Reported-by: Michael L. Semon <mlsemon35@gmail.com>
Debugged-by: Michael L. Semon <mlsemon35@gmail.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_attr_leaf.h   |    1 +
 fs/xfs/xfs_dir2_format.h |    5 +++--
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_attr_leaf.h b/fs/xfs/xfs_attr_leaf.h
index f9d7846..444a770 100644
--- a/fs/xfs/xfs_attr_leaf.h
+++ b/fs/xfs/xfs_attr_leaf.h
@@ -128,6 +128,7 @@ struct xfs_attr3_leaf_hdr {
 	__u8			holes;
 	__u8			pad1;
 	struct xfs_attr_leaf_map freemap[XFS_ATTR_LEAF_MAPSIZE];
+	__be32			pad2;		/* 64 bit alignment */
 };
 
 #define XFS_ATTR3_LEAF_CRC_OFF	(offsetof(struct xfs_attr3_leaf_hdr, info.crc))
diff --git a/fs/xfs/xfs_dir2_format.h b/fs/xfs/xfs_dir2_format.h
index 995f1f5..7826782 100644
--- a/fs/xfs/xfs_dir2_format.h
+++ b/fs/xfs/xfs_dir2_format.h
@@ -266,6 +266,7 @@ struct xfs_dir3_blk_hdr {
 struct xfs_dir3_data_hdr {
 	struct xfs_dir3_blk_hdr	hdr;
 	xfs_dir2_data_free_t	best_free[XFS_DIR2_DATA_FD_COUNT];
+	__be32			pad;	/* 64 bit alignment */
 };
 
 #define XFS_DIR3_DATA_CRC_OFF  offsetof(struct xfs_dir3_data_hdr, hdr.crc)
@@ -477,7 +478,7 @@ struct xfs_dir3_leaf_hdr {
 	struct xfs_da3_blkinfo	info;		/* header for da routines */
 	__be16			count;		/* count of entries */
 	__be16			stale;		/* count of stale entries */
-	__be32			pad;
+	__be32			pad;		/* 64 bit alignment */
 };
 
 struct xfs_dir3_icleaf_hdr {
@@ -715,7 +716,7 @@ struct xfs_dir3_free_hdr {
 	__be32			firstdb;	/* db of first entry */
 	__be32			nvalid;		/* count of valid entries */
 	__be32			nused;		/* count of used entries */
-	__be32			pad;		/* 64 bit alignment. */
+	__be32			pad;		/* 64 bit alignment */
 };
 
 struct xfs_dir3_free {
-- 
1.7.10.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 3/3] xfs: ensure btree root split sets blkno correctly
  2013-06-12  2:19 [PATCH 0/3] xfs: fixes for 3.10-rc6 Dave Chinner
  2013-06-12  2:19 ` [PATCH 1/3] xfs: don't shutdown log recovery on validation errors Dave Chinner
  2013-06-12  2:19 ` [PATCH 2/3] xfs: fix implicit padding in directory and attr CRC formats Dave Chinner
@ 2013-06-12  2:19 ` Dave Chinner
  2013-06-13 19:16   ` Ben Myers
  2 siblings, 1 reply; 31+ messages in thread
From: Dave Chinner @ 2013-06-12  2:19 UTC (permalink / raw)
  To: xfs; +Cc: bpm

From: Dave Chinner <dchinner@redhat.com>

For CRC enabled filesystems, the BMBT is rooted in an inode, so it
passes through a difference code path on root splits to the
freespace and inode btrees. This is much less traversed by xfstests
than the other trees. When testing on a 1k block size filesystem,
I've been seeing ASSERT failures in generic/234 like:

XFS: Assertion failed: cur->bc_btnum != XFS_BTNUM_BMAP || cur->bc_private.b.allocated == 0, file: fs/xfs/xfs_btree.c, line: 317

which are generally preceded by a lblock check failure. I noticed
this in the bmbt stats:

$ pminfo -f xfs.btree.block_map

xfs.btree.block_map.lookup
    value 39135

xfs.btree.block_map.compare
    value 268432

xfs.btree.block_map.insrec
    value 15786

xfs.btree.block_map.delrec
    value 13884

xfs.btree.block_map.newroot
    value 2

xfs.btree.block_map.killroot
    value 0
.....

Very little coverage of root splits and merges. Indeed, on a 4k
filesystem, block_map.newroot and block_map.killroot are both zero.
i.e the code is not exercised at all, and it's the only generic
btree infrastruct operation that is not exercised by a default run
of xfstests.

Turns out that on a 1k filesystem, generic/234 accounts for one of
those two root splits, and that is somewhat of a smoking gun. In
fact, it's the same problem we saw in the directory/attr code where
headers are memcpy()d from one block to another without updating the
self describing metadata.

Simple fix - when copying the header out of the root block, make
sure the block number is updated correctly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_btree.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/xfs/xfs_btree.c b/fs/xfs/xfs_btree.c
index 8804b8a..0903960 100644
--- a/fs/xfs/xfs_btree.c
+++ b/fs/xfs/xfs_btree.c
@@ -2544,7 +2544,17 @@ xfs_btree_new_iroot(
 	if (error)
 		goto error0;
 
+	/*
+	 * we can't just memcpy() the root in for CRC enabled btree blocks.
+	 * In that case have to also ensure the blkno remains correct
+	 */
 	memcpy(cblock, block, xfs_btree_block_len(cur));
+	if (cur->bc_flags & XFS_BTREE_CRC_BLOCKS) {
+		if (cur->bc_flags & XFS_BTREE_LONG_PTRS)
+			cblock->bb_u.l.bb_blkno = cpu_to_be64(cbp->b_bn);
+		else
+			cblock->bb_u.s.bb_blkno = cpu_to_be64(cbp->b_bn);
+	}
 
 	be16_add_cpu(&block->bb_level, 1);
 	xfs_btree_set_numrecs(block, 1);
-- 
1.7.10.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] xfs: fix implicit padding in directory and attr CRC formats
  2013-06-12  2:19 ` [PATCH 2/3] xfs: fix implicit padding in directory and attr CRC formats Dave Chinner
@ 2013-06-13  0:58   ` Ben Myers
  2013-06-13  1:40     ` Michael L. Semon
  2013-06-13  2:27     ` Dave Chinner
  0 siblings, 2 replies; 31+ messages in thread
From: Ben Myers @ 2013-06-13  0:58 UTC (permalink / raw)
  To: Dave Chinner, Michael L. Semon; +Cc: xfs

On Wed, Jun 12, 2013 at 12:19:07PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Michael L. Semon has been testing CRC patches ona 32 bit system and
						on a

> been seeing assert failures in the directory code from xfs/080.
> Thanks to Michael's heroic efforts with printk debugging, we found
> that the problem was that the last free space being left in the
> directory structure was too small to fit a unused tag structure and
> it was being corrupted and attempting to log a region out of bounds.
> Hence the assert failure looked something like:
> 
> .....
> #5 calling xfs_dir2_data_log_unused() 36 32
> #1 4092 4095 4096
> #2 8182 8183 4096
     first? 
          last?
               bp->b_length?

> XFS: Assertion failed: first <= last && last < BBTOB(bp->b_length), file: fs/xfs/xfs_trans_buf.c, line: 568
> 
> Where #1 showed the first region of the dup being logged (i.e. the
> last 4 bytes of a directory buffer) and #2 shows the corrupt values
> being calculated from the length of the dup entry which overflowed
> the size of the buffer.
> 
> It turns out that the problem was not in the logging code, nor in
> the freespace handling code. It is an initial condition bug that
> only shows up on 32 bit systems. When a new buffer is initialised,
> where's the freespace that is set up:
> 
> [  172.316249] calling xfs_dir2_leaf_addname() from xfs_dir_createname()
> [  172.316346] #9 calling xfs_dir2_data_log_unused()
> [  172.316351] #1 calling xfs_trans_log_buf() 60 63 4096
> [  172.316353] #2 calling xfs_trans_log_buf() 4094 4095 4096
> 
> Note the offset of the first region being logged? It's 60 bytes into
> the buffer. Once I saw that, I pretty much knew what the bug was
> going to be caused by this.
> 
> Essentially, all direct entries are rounded to 8 bytes in length,
> and all entries start with an 8 byte alignment. This means that we
> can decode inplace as variables are naturally aligned. With the
> directory data supposedly starting on a 8 byte boundary, and all
> entries padded to 8 bytes, the minimum freespace in a directory
> block is supposed to be 8 bytes, which is large enough to fit a
> unused data entry structure (6 bytes in size). The fact we only have
> 4 bytes of free space indicates a directory data block alignment
> problem.
> 
> And what do you know - there's an implicit hole in the directory
> data block header for the CRC format, which means the header is 60
> byte on 32 bit intel systems and 64 bytes on 64 bit systems. Needs
> padding. And while looking at the structures, I found the same
> problem in the attr leaf header. Fix them both.
> 
> Note that this only affects 32 bit systems with CRCs enabled.
> Everything else is just fine. Note that filesystems created before
					 CRC enabled filesystems

I suggest this be added to head off any confusion.

> this fix on such systems will not be readable with this fix applied.
> 
> Reported-by: Michael L. Semon <mlsemon35@gmail.com>
> Debugged-by: Michael L. Semon <mlsemon35@gmail.com>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_attr_leaf.h   |    1 +
>  fs/xfs/xfs_dir2_format.h |    5 +++--
>  2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_attr_leaf.h b/fs/xfs/xfs_attr_leaf.h
> index f9d7846..444a770 100644
> --- a/fs/xfs/xfs_attr_leaf.h
> +++ b/fs/xfs/xfs_attr_leaf.h
> @@ -128,6 +128,7 @@ struct xfs_attr3_leaf_hdr {
>  	__u8			holes;
>  	__u8			pad1;
>  	struct xfs_attr_leaf_map freemap[XFS_ATTR_LEAF_MAPSIZE];
> +	__be32			pad2;		/* 64 bit alignment */
>  };
>  
>  #define XFS_ATTR3_LEAF_CRC_OFF	(offsetof(struct xfs_attr3_leaf_hdr, info.crc))
> diff --git a/fs/xfs/xfs_dir2_format.h b/fs/xfs/xfs_dir2_format.h
> index 995f1f5..7826782 100644
> --- a/fs/xfs/xfs_dir2_format.h
> +++ b/fs/xfs/xfs_dir2_format.h
> @@ -266,6 +266,7 @@ struct xfs_dir3_blk_hdr {
>  struct xfs_dir3_data_hdr {
>  	struct xfs_dir3_blk_hdr	hdr;
>  	xfs_dir2_data_free_t	best_free[XFS_DIR2_DATA_FD_COUNT];
> +	__be32			pad;	/* 64 bit alignment */

I counted these up and it looks fine.  Nice work gents.

Reviewed-by: Ben Myers <bpm@sgi.com>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] xfs: don't shutdown log recovery on validation errors
  2013-06-12  2:19 ` [PATCH 1/3] xfs: don't shutdown log recovery on validation errors Dave Chinner
@ 2013-06-13  1:04   ` Ben Myers
  2013-06-13  2:08     ` Dave Chinner
  0 siblings, 1 reply; 31+ messages in thread
From: Ben Myers @ 2013-06-13  1:04 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Hey Dave,

On Wed, Jun 12, 2013 at 12:19:06PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Unfortunately, we cannot guarantee that items logged multiple times
> and replayed by log recovery do not take objects back in time. When
> theya re taken back in time, the go into an intermediate state which
> is corrupt, and hence verification that occurs on this intermediate
> state causes log recovery to abort with a corruption shutdown.
> 
> Instead of causing a shutdown and unmountable filesystem, don't
> verify post-recovery items before they are written to disk. This is
> less than optimal, but there is no way to detect this issue for
> non-CRC filesystems If log recovery successfully completes, this
> will be undone and the object will be consistent by subsequent
> transactions that are replayed, so in most cases we don't need to
> take drastic action.
> 
> For CRC enabled filesystems, leave the verifiers in place - we need
> to call them to recalculate the CRCs on the objects anyway. This
> recovery problem canbe solved for such filesystems - we have a LSN
> stamped in all metadata at writeback time that we can to determine
> whether the item should be replayed or not. This is a separate piece
> of work, so is not addressed by this patch.

Is there a test case for this one?  How are you reproducing this?

Thanks,
	Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] xfs: fix implicit padding in directory and attr CRC formats
  2013-06-13  0:58   ` Ben Myers
@ 2013-06-13  1:40     ` Michael L. Semon
  2013-06-13  2:27     ` Dave Chinner
  1 sibling, 0 replies; 31+ messages in thread
From: Michael L. Semon @ 2013-06-13  1:40 UTC (permalink / raw)
  To: Ben Myers; +Cc: xfs

On 06/12/2013 08:58 PM, Ben Myers wrote:
> On Wed, Jun 12, 2013 at 12:19:07PM +1000, Dave Chinner wrote:
>> From: Dave Chinner <dchinner@redhat.com>
>>
>> Michael L. Semon has been testing CRC patches ona 32 bit system and
> 						on a
> 
>> been seeing assert failures in the directory code from xfs/080.
>> Thanks to Michael's heroic efforts with printk debugging, we found
>> that the problem was that the last free space being left in the
>> directory structure was too small to fit a unused tag structure and
>> it was being corrupted and attempting to log a region out of bounds.
>> Hence the assert failure looked something like:
>>
>> .....
>> #5 calling xfs_dir2_data_log_unused() 36 32
>> #1 4092 4095 4096
>> #2 8182 8183 4096
>      first? 
>           last?
>                bp->b_length?

           BBTOB(bp->b_length)

This is all terrible numbering on my part...

>> #1 4092 4095 4096
>> #2 8182 8183 4096

xfs_dir2_data_log_unused() calls xfs_trans_log_buf() twice in the same 
function.  #1 is the first call, #2 is the second call, and there's no 
running count.  The printk() is a copy-and-paste of those two function 
calls plus a BBTOB(bp->b_length) below it.

>> #5 calling xfs_dir2_data_log_unused() 36 32

The #5 was caused by numbering all the calls to 
xfs_dir2_data_log_unused() to see if one code path was being called 
every time.  #5 is in the xfs_dir2_data_use_free() function, starting 
with this else-if...

        else if (matchfront) {
                newdup = (xfs_dir2_data_unused_t *)((char *)hdr + offset + len);
                newdup->freetag = cpu_to_be16(XFS_DIR2_DATA_FREE_TAG);
                newdup->length = cpu_to_be16(oldlen - len);
                *xfs_dir2_data_unused_tag_p(newdup) =
                        cpu_to_be16((char *)newdup - (char *)hdr);
                printk( KERN_INFO "#5 calling xfs_dir2_data_log_unused() %d %d\n", oldlen, len );
                xfs_dir2_data_log_unused(tp, bp, newdup);

Sorry about that!

Michael

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] xfs: don't shutdown log recovery on validation errors
  2013-06-13  1:04   ` Ben Myers
@ 2013-06-13  2:08     ` Dave Chinner
  2013-06-13 22:09       ` Ben Myers
  0 siblings, 1 reply; 31+ messages in thread
From: Dave Chinner @ 2013-06-13  2:08 UTC (permalink / raw)
  To: Ben Myers; +Cc: xfs

On Wed, Jun 12, 2013 at 08:04:41PM -0500, Ben Myers wrote:
> Hey Dave,
> 
> On Wed, Jun 12, 2013 at 12:19:06PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Unfortunately, we cannot guarantee that items logged multiple times
> > and replayed by log recovery do not take objects back in time. When
> > theya re taken back in time, the go into an intermediate state which
> > is corrupt, and hence verification that occurs on this intermediate
> > state causes log recovery to abort with a corruption shutdown.
> > 
> > Instead of causing a shutdown and unmountable filesystem, don't
> > verify post-recovery items before they are written to disk. This is
> > less than optimal, but there is no way to detect this issue for
> > non-CRC filesystems If log recovery successfully completes, this
> > will be undone and the object will be consistent by subsequent
> > transactions that are replayed, so in most cases we don't need to
> > take drastic action.
> > 
> > For CRC enabled filesystems, leave the verifiers in place - we need
> > to call them to recalculate the CRCs on the objects anyway. This
> > recovery problem canbe solved for such filesystems - we have a LSN
> > stamped in all metadata at writeback time that we can to determine
> > whether the item should be replayed or not. This is a separate piece
> > of work, so is not addressed by this patch.
> 
> Is there a test case for this one?  How are you reproducing this?

The test case was Dave Jones running sysrq-b on a hung test machine.
The machine would occasionally end up with a corrupt home directory.

http://oss.sgi.com/pipermail/xfs/2013-May/026759.html

Analysis from a metdadump provided by Dave:

http://oss.sgi.com/pipermail/xfs/2013-June/026965.html

And Cai also appeared to be hitting this after a crash on 3.10-rc4,
as it's giving exactly the same "verifier failed during log recovery"
stack trace:

http://oss.sgi.com/pipermail/xfs/2013-June/026889.html

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] xfs: fix implicit padding in directory and attr CRC formats
  2013-06-13  0:58   ` Ben Myers
  2013-06-13  1:40     ` Michael L. Semon
@ 2013-06-13  2:27     ` Dave Chinner
  2013-06-13 21:31       ` Ben Myers
  1 sibling, 1 reply; 31+ messages in thread
From: Dave Chinner @ 2013-06-13  2:27 UTC (permalink / raw)
  To: Ben Myers; +Cc: Michael L. Semon, xfs

On Wed, Jun 12, 2013 at 07:58:19PM -0500, Ben Myers wrote:
> On Wed, Jun 12, 2013 at 12:19:07PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Michael L. Semon has been testing CRC patches ona 32 bit system and
> 						on a
> 
> > been seeing assert failures in the directory code from xfs/080.
> > Thanks to Michael's heroic efforts with printk debugging, we found
> > that the problem was that the last free space being left in the
> > directory structure was too small to fit a unused tag structure and
> > it was being corrupted and attempting to log a region out of bounds.
> > Hence the assert failure looked something like:
> > 
> > .....
> > #5 calling xfs_dir2_data_log_unused() 36 32
> > #1 4092 4095 4096
> > #2 8182 8183 4096
>      first? 
>           last?
>                bp->b_length?

Yup.

> > 
> > Note that this only affects 32 bit systems with CRCs enabled.
> > Everything else is just fine. Note that filesystems created before
> 					 CRC enabled filesystems
> 
> I suggest this be added to head off any confusion.

Sure.  Do I need to resubmit this, or are you going to just modify
the commit message yourself before applying it?

> > index 995f1f5..7826782 100644
> > --- a/fs/xfs/xfs_dir2_format.h
> > +++ b/fs/xfs/xfs_dir2_format.h
> > @@ -266,6 +266,7 @@ struct xfs_dir3_blk_hdr {
> >  struct xfs_dir3_data_hdr {
> >  	struct xfs_dir3_blk_hdr	hdr;
> >  	xfs_dir2_data_free_t	best_free[XFS_DIR2_DATA_FD_COUNT];
> > +	__be32			pad;	/* 64 bit alignment */
> 
> I counted these up and it looks fine.  Nice work gents.

pahole is a much better way of determining structure size - it tells
you exactly what the compiler did, rather than having to assume what
the compiler is going to do...

$ pahole fs/xfs/xfs_dir2_data.o |less
.....
struct xfs_dir3_blk_hdr {
        __be32                     magic;                /*     0     4 */
        __be32                     crc;                  /*     4     4 */
        __be64                     blkno;                /*     8     8 */
        __be64                     lsn;                  /*    16     8 */
        uuid_t                     uuid;                 /*    24    16 */
        __be64                     owner;                /*    40     8 */

        /* size: 48, cachelines: 1, members: 6 */
        /* last cacheline: 48 bytes */
};
struct xfs_dir3_data_hdr {
        struct xfs_dir3_blk_hdr    hdr;                  /*     0    48 */
        xfs_dir2_data_free_t       best_free[3];         /*    48    12 */
        __be32                     pad;                  /*    60     4 */
        /* --- cacheline 1 boundary (64 bytes) --- */

        /* size: 64, cachelines: 1, members: 3 */
};
....


Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3] xfs: ensure btree root split sets blkno correctly
  2013-06-12  2:19 ` [PATCH 3/3] xfs: ensure btree root split sets blkno correctly Dave Chinner
@ 2013-06-13 19:16   ` Ben Myers
  2013-06-14  0:21     ` Dave Chinner
  0 siblings, 1 reply; 31+ messages in thread
From: Ben Myers @ 2013-06-13 19:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Wed, Jun 12, 2013 at 12:19:08PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> For CRC enabled filesystems, the BMBT is rooted in an inode, so it
> passes through a difference code path on root splits to the
		   different			       than

> freespace and inode btrees. This is much less traversed by xfstests
> than the other trees. When testing on a 1k block size filesystem,
> I've been seeing ASSERT failures in generic/234 like:
> 
> XFS: Assertion failed: cur->bc_btnum != XFS_BTNUM_BMAP || cur->bc_private.b.allocated == 0, file: fs/xfs/xfs_btree.c, line: 317
> 
> which are generally preceded by a lblock check failure. I noticed
> this in the bmbt stats:
> 
> $ pminfo -f xfs.btree.block_map
> 
> xfs.btree.block_map.lookup
>     value 39135
> 
> xfs.btree.block_map.compare
>     value 268432
> 
> xfs.btree.block_map.insrec
>     value 15786
> 
> xfs.btree.block_map.delrec
>     value 13884
> 
> xfs.btree.block_map.newroot
>     value 2
> 
> xfs.btree.block_map.killroot
>     value 0
> .....
> 
> Very little coverage of root splits and merges. Indeed, on a 4k
> filesystem, block_map.newroot and block_map.killroot are both zero.
> i.e the code is not exercised at all, and it's the only generic
     .
> btree infrastruct operation that is not exercised by a default run
	infrastructure 

Cleaned those up.

> of xfstests.
> 
> Turns out that on a 1k filesystem, generic/234 accounts for one of
> those two root splits, and that is somewhat of a smoking gun. In
> fact, it's the same problem we saw in the directory/attr code where
> headers are memcpy()d from one block to another without updating the
> self describing metadata.

It is very interesting that this area of code is exercised so infrequently.  I
remember seeing a paper that described a tool that would list codepaths that
are exercised during a test run.  Does that ring a bell?  It seems like this
might be worth looking into more generally.

> Simple fix - when copying the header out of the root block, make
> sure the block number is updated correctly.

Yep, looks fine.

> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Reviewed-by: Ben Myers <bpm@sgi.com>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] xfs: fix implicit padding in directory and attr CRC formats
  2013-06-13  2:27     ` Dave Chinner
@ 2013-06-13 21:31       ` Ben Myers
  0 siblings, 0 replies; 31+ messages in thread
From: Ben Myers @ 2013-06-13 21:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Michael L. Semon, xfs

Hey Dave,

On Thu, Jun 13, 2013 at 12:27:29PM +1000, Dave Chinner wrote:
> On Wed, Jun 12, 2013 at 07:58:19PM -0500, Ben Myers wrote:
> > On Wed, Jun 12, 2013 at 12:19:07PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Michael L. Semon has been testing CRC patches ona 32 bit system and
> > 						on a
> > 
> > > been seeing assert failures in the directory code from xfs/080.
> > > Thanks to Michael's heroic efforts with printk debugging, we found
> > > that the problem was that the last free space being left in the
> > > directory structure was too small to fit a unused tag structure and
> > > it was being corrupted and attempting to log a region out of bounds.
> > > Hence the assert failure looked something like:
> > > 
> > > .....
> > > #5 calling xfs_dir2_data_log_unused() 36 32
> > > #1 4092 4095 4096
> > > #2 8182 8183 4096
> >      first? 
> >           last?
> >                bp->b_length?
> 
> Yup.
> 
> > > 
> > > Note that this only affects 32 bit systems with CRCs enabled.
> > > Everything else is just fine. Note that filesystems created before
> > 					 CRC enabled filesystems
> > 
> > I suggest this be added to head off any confusion.
> 
> Sure.  Do I need to resubmit this, or are you going to just modify
> the commit message yourself before applying it?

Nah, I'll do it.

> > > index 995f1f5..7826782 100644
> > > --- a/fs/xfs/xfs_dir2_format.h
> > > +++ b/fs/xfs/xfs_dir2_format.h
> > > @@ -266,6 +266,7 @@ struct xfs_dir3_blk_hdr {
> > >  struct xfs_dir3_data_hdr {
> > >  	struct xfs_dir3_blk_hdr	hdr;
> > >  	xfs_dir2_data_free_t	best_free[XFS_DIR2_DATA_FD_COUNT];
> > > +	__be32			pad;	/* 64 bit alignment */
> > 
> > I counted these up and it looks fine.  Nice work gents.
> 
> pahole is a much better way of determining structure size - it tells
> you exactly what the compiler did, rather than having to assume what
> the compiler is going to do...
> 
> $ pahole fs/xfs/xfs_dir2_data.o |less

Thanks, I'll check it out.  Applied this.

-Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] xfs: don't shutdown log recovery on validation errors
  2013-06-13  2:08     ` Dave Chinner
@ 2013-06-13 22:09       ` Ben Myers
  2013-06-14  0:13         ` Dave Chinner
  0 siblings, 1 reply; 31+ messages in thread
From: Ben Myers @ 2013-06-13 22:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Hi Dave,

On Thu, Jun 13, 2013 at 12:08:27PM +1000, Dave Chinner wrote:
> On Wed, Jun 12, 2013 at 08:04:41PM -0500, Ben Myers wrote:
> > On Wed, Jun 12, 2013 at 12:19:06PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Unfortunately, we cannot guarantee that items logged multiple times
> > > and replayed by log recovery do not take objects back in time. When
> > > theya re taken back in time, the go into an intermediate state which
> > > is corrupt, and hence verification that occurs on this intermediate
> > > state causes log recovery to abort with a corruption shutdown.
> > > 
> > > Instead of causing a shutdown and unmountable filesystem, don't
> > > verify post-recovery items before they are written to disk. This is
> > > less than optimal, but there is no way to detect this issue for
> > > non-CRC filesystems If log recovery successfully completes, this
> > > will be undone and the object will be consistent by subsequent
> > > transactions that are replayed, so in most cases we don't need to
> > > take drastic action.
> > > 
> > > For CRC enabled filesystems, leave the verifiers in place - we need
> > > to call them to recalculate the CRCs on the objects anyway. This
> > > recovery problem canbe solved for such filesystems - we have a LSN
> > > stamped in all metadata at writeback time that we can to determine
> > > whether the item should be replayed or not. This is a separate piece
> > > of work, so is not addressed by this patch.
> > 
> > Is there a test case for this one?  How are you reproducing this?
> 
> The test case was Dave Jones running sysrq-b on a hung test machine.
> The machine would occasionally end up with a corrupt home directory.
> 
> http://oss.sgi.com/pipermail/xfs/2013-May/026759.html
> 
> Analysis from a metdadump provided by Dave:
> 
> http://oss.sgi.com/pipermail/xfs/2013-June/026965.html
>
> And Cai also appeared to be hitting this after a crash on 3.10-rc4,
> as it's giving exactly the same "verifier failed during log recovery"
> stack trace:
> 
> http://oss.sgi.com/pipermail/xfs/2013-June/026889.html

Thanks.  It appears that the verifiers have found corruption due to a
flaw in log recovery, and the fix you are proposing is to stop using
them.  If we do that, we'll have no way of detecting the corruption and
will end up hanging users of older kernels out to dry.

I think your suggestion that non-debug systems could warn instead of
fail is a good one, but removing the verifier altogether is
inappropriate.

Can you make the metadump available?  I need to understand this better
before I can sign off.  Also:  Any idea how far back this one goes?

Regards,
	Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] xfs: don't shutdown log recovery on validation errors
  2013-06-13 22:09       ` Ben Myers
@ 2013-06-14  0:13         ` Dave Chinner
  2013-06-14 12:55           ` Mark Tinguely
  2013-06-14 16:09           ` Ben Myers
  0 siblings, 2 replies; 31+ messages in thread
From: Dave Chinner @ 2013-06-14  0:13 UTC (permalink / raw)
  To: Ben Myers; +Cc: xfs

On Thu, Jun 13, 2013 at 05:09:03PM -0500, Ben Myers wrote:
> Hi Dave,
> 
> On Thu, Jun 13, 2013 at 12:08:27PM +1000, Dave Chinner wrote:
> > On Wed, Jun 12, 2013 at 08:04:41PM -0500, Ben Myers wrote:
> > > On Wed, Jun 12, 2013 at 12:19:06PM +1000, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > > Unfortunately, we cannot guarantee that items logged multiple times
> > > > and replayed by log recovery do not take objects back in time. When
> > > > theya re taken back in time, the go into an intermediate state which
> > > > is corrupt, and hence verification that occurs on this intermediate
> > > > state causes log recovery to abort with a corruption shutdown.
> > > > 
> > > > Instead of causing a shutdown and unmountable filesystem, don't
> > > > verify post-recovery items before they are written to disk. This is
> > > > less than optimal, but there is no way to detect this issue for
> > > > non-CRC filesystems If log recovery successfully completes, this
> > > > will be undone and the object will be consistent by subsequent
> > > > transactions that are replayed, so in most cases we don't need to
> > > > take drastic action.
> > > > 
> > > > For CRC enabled filesystems, leave the verifiers in place - we need
> > > > to call them to recalculate the CRCs on the objects anyway. This
> > > > recovery problem canbe solved for such filesystems - we have a LSN
> > > > stamped in all metadata at writeback time that we can to determine
> > > > whether the item should be replayed or not. This is a separate piece
> > > > of work, so is not addressed by this patch.
> > > 
> > > Is there a test case for this one?  How are you reproducing this?
> > 
> > The test case was Dave Jones running sysrq-b on a hung test machine.
> > The machine would occasionally end up with a corrupt home directory.
> > 
> > http://oss.sgi.com/pipermail/xfs/2013-May/026759.html
> > 
> > Analysis from a metdadump provided by Dave:
> > 
> > http://oss.sgi.com/pipermail/xfs/2013-June/026965.html
> >
> > And Cai also appeared to be hitting this after a crash on 3.10-rc4,
> > as it's giving exactly the same "verifier failed during log recovery"
> > stack trace:
> > 
> > http://oss.sgi.com/pipermail/xfs/2013-June/026889.html
> 
> Thanks.  It appears that the verifiers have found corruption due to a
> flaw in log recovery, and the fix you are proposing is to stop using
> them.  If we do that, we'll have no way of detecting the corruption and
> will end up hanging users of older kernels out to dry.

We've never detected it before, and it's causing regressions for
multiple people. We *can't fix it* because we can't detect the
situation sanely, and we are not leaving people with old kernels
hanging out to dry. The opposite is true: we are fucking over
current users by preventing log recovery on filesystems that will
recovery perfectly OK and have almost always recovered just fine in
the past.

> I think your suggestion that non-debug systems could warn instead of
> fail is a good one, but removing the verifier altogether is
> inappropriate.

Changing every single verifier in a non-trivial way is not something
I'm about to do for a -rc6 kernel. Removing the verifiers from log
recovery just reverts to the pre-3.8 situation, so is perfectly
acceptable short term solution while we do the more invasive verify
changes.

> Can you make the metadump available?  I need to understand this better
> before I can sign off.  Also:  Any idea how far back this one goes?

No, I can't make the metadump available to you - it was provided
privately and not obfuscated and so you'd have to ask Dave for it.

As to how long this problem has existed? It's a zero-day bug. Like I
said, I've suspected for years that this can happen, and only now do
we have proof of it...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3] xfs: ensure btree root split sets blkno correctly
  2013-06-13 19:16   ` Ben Myers
@ 2013-06-14  0:21     ` Dave Chinner
  0 siblings, 0 replies; 31+ messages in thread
From: Dave Chinner @ 2013-06-14  0:21 UTC (permalink / raw)
  To: Ben Myers; +Cc: xfs

On Thu, Jun 13, 2013 at 02:16:17PM -0500, Ben Myers wrote:
> On Wed, Jun 12, 2013 at 12:19:08PM +1000, Dave Chinner wrote:
> > Turns out that on a 1k filesystem, generic/234 accounts for one of
> > those two root splits, and that is somewhat of a smoking gun. In
> > fact, it's the same problem we saw in the directory/attr code where
> > headers are memcpy()d from one block to another without updating the
> > self describing metadata.
> 
> It is very interesting that this area of code is exercised so infrequently.

Well known problem - I remember that we did xfstests code coverage
analysis way back in 2005 at SGI, and this was something that popped
out. The stats simple confirmed what the code coverage profiling was
telling us...

That's the reason I run 512 byte/1k block size testing all the time
with xfstests - it covers all of these code paths that 4k block size
doesn't cover.

> I
> remember seeing a paper that described a tool that would list codepaths that
> are exercised during a test run.  Does that ring a bell?  It seems like this
> might be worth looking into more generally.

You mean code coverage profiling?

FWIW, RH QA has run some xfstests code coverage analysis recently,
too, and it has mostly confirmed the same results that we got at SGI
years ago - we get roughly 70% code coverage of the kernel code
froma default 4k filesystem xfstests run.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] xfs: don't shutdown log recovery on validation errors
  2013-06-14  0:13         ` Dave Chinner
@ 2013-06-14 12:55           ` Mark Tinguely
  2013-06-14 16:09           ` Ben Myers
  1 sibling, 0 replies; 31+ messages in thread
From: Mark Tinguely @ 2013-06-14 12:55 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Ben Myers, xfs

On 06/13/13 19:13, Dave Chinner wrote:
> On Thu, Jun 13, 2013 at 05:09:03PM -0500, Ben Myers wrote:
>> Hi Dave,
>>
>> On Thu, Jun 13, 2013 at 12:08:27PM +1000, Dave Chinner wrote:
>>> On Wed, Jun 12, 2013 at 08:04:41PM -0500, Ben Myers wrote:
>>>> On Wed, Jun 12, 2013 at 12:19:06PM +1000, Dave Chinner wrote:
>>>>> From: Dave Chinner<dchinner@redhat.com>
>>>>>
>>>>> Unfortunately, we cannot guarantee that items logged multiple times
>>>>> and replayed by log recovery do not take objects back in time. When
>>>>> theya re taken back in time, the go into an intermediate state which
>>>>> is corrupt, and hence verification that occurs on this intermediate
>>>>> state causes log recovery to abort with a corruption shutdown.
>>>>>
>>>>> Instead of causing a shutdown and unmountable filesystem, don't
>>>>> verify post-recovery items before they are written to disk. This is
>>>>> less than optimal, but there is no way to detect this issue for
>>>>> non-CRC filesystems If log recovery successfully completes, this
>>>>> will be undone and the object will be consistent by subsequent
>>>>> transactions that are replayed, so in most cases we don't need to
>>>>> take drastic action.
>>>>>
>>>>> For CRC enabled filesystems, leave the verifiers in place - we need
>>>>> to call them to recalculate the CRCs on the objects anyway. This
>>>>> recovery problem canbe solved for such filesystems - we have a LSN
>>>>> stamped in all metadata at writeback time that we can to determine
>>>>> whether the item should be replayed or not. This is a separate piece
>>>>> of work, so is not addressed by this patch.
>>>>
>>>> Is there a test case for this one?  How are you reproducing this?
>>>
>>> The test case was Dave Jones running sysrq-b on a hung test machine.
>>> The machine would occasionally end up with a corrupt home directory.
>>>
>>> http://oss.sgi.com/pipermail/xfs/2013-May/026759.html
>>>
>>> Analysis from a metdadump provided by Dave:
>>>
>>> http://oss.sgi.com/pipermail/xfs/2013-June/026965.html
>>>
>>> And Cai also appeared to be hitting this after a crash on 3.10-rc4,
>>> as it's giving exactly the same "verifier failed during log recovery"
>>> stack trace:
>>>
>>> http://oss.sgi.com/pipermail/xfs/2013-June/026889.html
>>
>> Thanks.  It appears that the verifiers have found corruption due to a
>> flaw in log recovery, and the fix you are proposing is to stop using
>> them.  If we do that, we'll have no way of detecting the corruption and
>> will end up hanging users of older kernels out to dry.
>
> We've never detected it before, and it's causing regressions for
> multiple people. We *can't fix it* because we can't detect the
> situation sanely, and we are not leaving people with old kernels
> hanging out to dry. The opposite is true: we are fucking over
> current users by preventing log recovery on filesystems that will
> recovery perfectly OK and have almost always recovered just fine in
> the past.
>
>> I think your suggestion that non-debug systems could warn instead of
>> fail is a good one, but removing the verifier altogether is
>> inappropriate.
>
> Changing every single verifier in a non-trivial way is not something
> I'm about to do for a -rc6 kernel. Removing the verifiers from log
> recovery just reverts to the pre-3.8 situation, so is perfectly
> acceptable short term solution while we do the more invasive verify
> changes.
>
>> Can you make the metadump available?  I need to understand this better
>> before I can sign off.  Also:  Any idea how far back this one goes?
>
> No, I can't make the metadump available to you - it was provided
> privately and not obfuscated and so you'd have to ask Dave for it.
>
> As to how long this problem has existed? It's a zero-day bug. Like I
> said, I've suspected for years that this can happen, and only now do
> we have proof of it...
>
> Cheers,
>
> Dave.

My gut feeling for the patch was the same as Ben's, but thinking this 
over, I have to take back all the eloquent curse works. IMO, we have to 
bring the patch in because the goal for Linux 3.10 is to have a stable 
environment for the non-CRC case and the verifier is breaking XFS log 
recovery for the common non-CRC case.
                   ---
In the common case, the verifier is tripping over the fundemental 
difference between how the AIL works (consolidate buffer writes into one 
and it can write anything that made it to log up to l_last_sync_lsn) and 
the log recovery, which works on each modification.

If there is another unknown kind of future write, then it would be nice 
to know and having a warning message would help. Unfortunately, the 
warning may make recovery noisy and falsely concern the users, but we 
are in uncharted waters.

I will blame and put a reviewed-by on the patch.

--Mark.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] xfs: don't shutdown log recovery on validation errors
  2013-06-14  0:13         ` Dave Chinner
  2013-06-14 12:55           ` Mark Tinguely
@ 2013-06-14 16:09           ` Ben Myers
  2013-06-14 16:15             ` Eric Sandeen
  2013-06-14 16:17             ` Dave Jones
  1 sibling, 2 replies; 31+ messages in thread
From: Ben Myers @ 2013-06-14 16:09 UTC (permalink / raw)
  To: Dave Jones; +Cc: xfs

Hey,

On Fri, Jun 14, 2013 at 10:13:06AM +1000, Dave Chinner wrote:
> On Thu, Jun 13, 2013 at 05:09:03PM -0500, Ben Myers wrote:
> > Hi Dave,
> > 
> > On Thu, Jun 13, 2013 at 12:08:27PM +1000, Dave Chinner wrote:
> > > On Wed, Jun 12, 2013 at 08:04:41PM -0500, Ben Myers wrote:
> > > > On Wed, Jun 12, 2013 at 12:19:06PM +1000, Dave Chinner wrote:
> > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > 
> > > > > Unfortunately, we cannot guarantee that items logged multiple times
> > > > > and replayed by log recovery do not take objects back in time. When
> > > > > theya re taken back in time, the go into an intermediate state which
> > > > > is corrupt, and hence verification that occurs on this intermediate
> > > > > state causes log recovery to abort with a corruption shutdown.
> > > > > 
> > > > > Instead of causing a shutdown and unmountable filesystem, don't
> > > > > verify post-recovery items before they are written to disk. This is
> > > > > less than optimal, but there is no way to detect this issue for
> > > > > non-CRC filesystems If log recovery successfully completes, this
> > > > > will be undone and the object will be consistent by subsequent
> > > > > transactions that are replayed, so in most cases we don't need to
> > > > > take drastic action.
> > > > > 
> > > > > For CRC enabled filesystems, leave the verifiers in place - we need
> > > > > to call them to recalculate the CRCs on the objects anyway. This
> > > > > recovery problem canbe solved for such filesystems - we have a LSN
> > > > > stamped in all metadata at writeback time that we can to determine
> > > > > whether the item should be replayed or not. This is a separate piece
> > > > > of work, so is not addressed by this patch.
> > > > 
> > > > Is there a test case for this one?  How are you reproducing this?
> > > 
> > > The test case was Dave Jones running sysrq-b on a hung test machine.
> > > The machine would occasionally end up with a corrupt home directory.
> > > 
> > > http://oss.sgi.com/pipermail/xfs/2013-May/026759.html
> > > 
> > > Analysis from a metdadump provided by Dave:
> > > 
> > > http://oss.sgi.com/pipermail/xfs/2013-June/026965.html
> > >
> > > And Cai also appeared to be hitting this after a crash on 3.10-rc4,
> > > as it's giving exactly the same "verifier failed during log recovery"
> > > stack trace:
> > > 
> > > http://oss.sgi.com/pipermail/xfs/2013-June/026889.html
> > 
> > Thanks.  It appears that the verifiers have found corruption due to a
> > flaw in log recovery, and the fix you are proposing is to stop using
> > them.  If we do that, we'll have no way of detecting the corruption and
> > will end up hanging users of older kernels out to dry.
> 
> We've never detected it before, and it's causing regressions for
> multiple people. We *can't fix it* because we can't detect the
> situation sanely, and we are not leaving people with old kernels
> hanging out to dry. The opposite is true: we are fucking over
> current users by preventing log recovery on filesystems that will
> recovery perfectly OK and have almost always recovered just fine in
> the past.
> 
> > I think your suggestion that non-debug systems could warn instead of
> > fail is a good one, but removing the verifier altogether is
> > inappropriate.
> 
> Changing every single verifier in a non-trivial way is not something
> I'm about to do for a -rc6 kernel. Removing the verifiers from log
> recovery just reverts to the pre-3.8 situation, so is perfectly
> acceptable short term solution while we do the more invasive verify
> changes.
> 
> > Can you make the metadump available?  I need to understand this better
> > before I can sign off.  Also:  Any idea how far back this one goes?
> 
> No, I can't make the metadump available to you - it was provided
> privately and not obfuscated and so you'd have to ask Dave for it.

Dave (Jones), could you make the metadump available to me?  I'd like to
understand this a little bit better.  I'm a bit uncomfortable with the
proposition that we should corrupt silently in this case...

Thanks,
	Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] xfs: don't shutdown log recovery on validation errors
  2013-06-14 16:09           ` Ben Myers
@ 2013-06-14 16:15             ` Eric Sandeen
  2013-06-14 19:08               ` Ben Myers
  2013-06-14 16:17             ` Dave Jones
  1 sibling, 1 reply; 31+ messages in thread
From: Eric Sandeen @ 2013-06-14 16:15 UTC (permalink / raw)
  To: Ben Myers; +Cc: Dave Jones, xfs

On 6/14/13 11:09 AM, Ben Myers wrote:
> Hey,
> 
> On Fri, Jun 14, 2013 at 10:13:06AM +1000, Dave Chinner wrote:
>> On Thu, Jun 13, 2013 at 05:09:03PM -0500, Ben Myers wrote:
>>> Hi Dave,
>>>
>>> On Thu, Jun 13, 2013 at 12:08:27PM +1000, Dave Chinner wrote:
>>>> On Wed, Jun 12, 2013 at 08:04:41PM -0500, Ben Myers wrote:
>>>>> On Wed, Jun 12, 2013 at 12:19:06PM +1000, Dave Chinner wrote:
>>>>>> From: Dave Chinner <dchinner@redhat.com>
>>>>>>
>>>>>> Unfortunately, we cannot guarantee that items logged multiple times
>>>>>> and replayed by log recovery do not take objects back in time. When
>>>>>> theya re taken back in time, the go into an intermediate state which
>>>>>> is corrupt, and hence verification that occurs on this intermediate
>>>>>> state causes log recovery to abort with a corruption shutdown.
>>>>>>
>>>>>> Instead of causing a shutdown and unmountable filesystem, don't
>>>>>> verify post-recovery items before they are written to disk. This is
>>>>>> less than optimal, but there is no way to detect this issue for
>>>>>> non-CRC filesystems If log recovery successfully completes, this
>>>>>> will be undone and the object will be consistent by subsequent
>>>>>> transactions that are replayed, so in most cases we don't need to
>>>>>> take drastic action.
>>>>>>
>>>>>> For CRC enabled filesystems, leave the verifiers in place - we need
>>>>>> to call them to recalculate the CRCs on the objects anyway. This
>>>>>> recovery problem canbe solved for such filesystems - we have a LSN
>>>>>> stamped in all metadata at writeback time that we can to determine
>>>>>> whether the item should be replayed or not. This is a separate piece
>>>>>> of work, so is not addressed by this patch.
>>>>>
>>>>> Is there a test case for this one?  How are you reproducing this?
>>>>
>>>> The test case was Dave Jones running sysrq-b on a hung test machine.
>>>> The machine would occasionally end up with a corrupt home directory.
>>>>
>>>> http://oss.sgi.com/pipermail/xfs/2013-May/026759.html
>>>>
>>>> Analysis from a metdadump provided by Dave:
>>>>
>>>> http://oss.sgi.com/pipermail/xfs/2013-June/026965.html
>>>>
>>>> And Cai also appeared to be hitting this after a crash on 3.10-rc4,
>>>> as it's giving exactly the same "verifier failed during log recovery"
>>>> stack trace:
>>>>
>>>> http://oss.sgi.com/pipermail/xfs/2013-June/026889.html
>>>
>>> Thanks.  It appears that the verifiers have found corruption due to a
>>> flaw in log recovery, and the fix you are proposing is to stop using
>>> them.  If we do that, we'll have no way of detecting the corruption and
>>> will end up hanging users of older kernels out to dry.
>>
>> We've never detected it before, and it's causing regressions for
>> multiple people. We *can't fix it* because we can't detect the
>> situation sanely, and we are not leaving people with old kernels
>> hanging out to dry. The opposite is true: we are fucking over
>> current users by preventing log recovery on filesystems that will
>> recovery perfectly OK and have almost always recovered just fine in
>> the past.
>>
>>> I think your suggestion that non-debug systems could warn instead of
>>> fail is a good one, but removing the verifier altogether is
>>> inappropriate.
>>
>> Changing every single verifier in a non-trivial way is not something
>> I'm about to do for a -rc6 kernel. Removing the verifiers from log
>> recovery just reverts to the pre-3.8 situation, so is perfectly
>> acceptable short term solution while we do the more invasive verify
>> changes.
>>
>>> Can you make the metadump available?  I need to understand this better
>>> before I can sign off.  Also:  Any idea how far back this one goes?
>>
>> No, I can't make the metadump available to you - it was provided
>> privately and not obfuscated and so you'd have to ask Dave for it.
> 
> Dave (Jones), could you make the metadump available to me?  I'd like to
> understand this a little bit better.  I'm a bit uncomfortable with the
> proposition that we should corrupt silently in this case...

Ben, isn't it the case that the corruption would only happen if
log replay failed for some reason (as has always been the case,
verifier or not), but with the verifier in place, it kills replay
even w/o other problems due to a logical problem with the
(recently added) verifiers?

IOW - this seems like an actual functional regression due to the
addition of the verifier, and dchinner's patch gets us back
to the almost-always-fine state we were in prior to the change.

As we're at -rc6, it seems quite reasonable to me as a quick
fix to just short-circuit it for now.

If you have time to analyze dave's metadump that's cool, but
this seems like something that really needs to be addressed
before 3.10 gets out the door.

Whenever you're ready, you can also add:

Reviewed-by: Eric Sandeen <sandeen@redhat.com>

(And dchinner, should this cc: stable for 3.9?)

Thanks,
-Eric

> Thanks,
> 	Ben
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] xfs: don't shutdown log recovery on validation errors
  2013-06-14 16:09           ` Ben Myers
  2013-06-14 16:15             ` Eric Sandeen
@ 2013-06-14 16:17             ` Dave Jones
  2013-06-14 16:31               ` Ben Myers
  1 sibling, 1 reply; 31+ messages in thread
From: Dave Jones @ 2013-06-14 16:17 UTC (permalink / raw)
  To: Ben Myers; +Cc: xfs

On Fri, Jun 14, 2013 at 11:09:40AM -0500, Ben Myers wrote:

 > > > I think your suggestion that non-debug systems could warn instead of
 > > > fail is a good one, but removing the verifier altogether is
 > > > inappropriate.
 > > 
 > > Changing every single verifier in a non-trivial way is not something
 > > I'm about to do for a -rc6 kernel. Removing the verifiers from log
 > > recovery just reverts to the pre-3.8 situation, so is perfectly
 > > acceptable short term solution while we do the more invasive verify
 > > changes.
 > > 
 > > > Can you make the metadump available?  I need to understand this better
 > > > before I can sign off.  Also:  Any idea how far back this one goes?
 > > 
 > > No, I can't make the metadump available to you - it was provided
 > > privately and not obfuscated and so you'd have to ask Dave for it.
 > 
 > Dave (Jones), could you make the metadump available to me?  I'd like to
 > understand this a little bit better.  I'm a bit uncomfortable with the
 > proposition that we should corrupt silently in this case...

Sorry, I don't have it any more.   I'll see if I can recreate the problem
next week and prepare another dump.

	Dave

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] xfs: don't shutdown log recovery on validation errors
  2013-06-14 16:17             ` Dave Jones
@ 2013-06-14 16:31               ` Ben Myers
  0 siblings, 0 replies; 31+ messages in thread
From: Ben Myers @ 2013-06-14 16:31 UTC (permalink / raw)
  To: Dave Jones; +Cc: xfs

Hi Dave,

On Fri, Jun 14, 2013 at 12:17:24PM -0400, Dave Jones wrote:
> On Fri, Jun 14, 2013 at 11:09:40AM -0500, Ben Myers wrote:
> 
>  > > > I think your suggestion that non-debug systems could warn instead of
>  > > > fail is a good one, but removing the verifier altogether is
>  > > > inappropriate.
>  > > 
>  > > Changing every single verifier in a non-trivial way is not something
>  > > I'm about to do for a -rc6 kernel. Removing the verifiers from log
>  > > recovery just reverts to the pre-3.8 situation, so is perfectly
>  > > acceptable short term solution while we do the more invasive verify
>  > > changes.
>  > > 
>  > > > Can you make the metadump available?  I need to understand this better
>  > > > before I can sign off.  Also:  Any idea how far back this one goes?
>  > > 
>  > > No, I can't make the metadump available to you - it was provided
>  > > privately and not obfuscated and so you'd have to ask Dave for it.
>  > 
>  > Dave (Jones), could you make the metadump available to me?  I'd like to
>  > understand this a little bit better.  I'm a bit uncomfortable with the
>  > proposition that we should corrupt silently in this case...
> 
> Sorry, I don't have it any more.   I'll see if I can recreate the problem
> next week and prepare another dump.

Much appreciated.  Thanks!

-Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] xfs: don't shutdown log recovery on validation errors
  2013-06-14 16:15             ` Eric Sandeen
@ 2013-06-14 19:08               ` Ben Myers
  2013-06-14 19:18                 ` Eric Sandeen
  0 siblings, 1 reply; 31+ messages in thread
From: Ben Myers @ 2013-06-14 19:08 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Dave Jones, xfs

Hey Eric,

On Fri, Jun 14, 2013 at 11:15:41AM -0500, Eric Sandeen wrote:
> On 6/14/13 11:09 AM, Ben Myers wrote:
> > On Fri, Jun 14, 2013 at 10:13:06AM +1000, Dave Chinner wrote:
> >> On Thu, Jun 13, 2013 at 05:09:03PM -0500, Ben Myers wrote:
> >>> On Thu, Jun 13, 2013 at 12:08:27PM +1000, Dave Chinner wrote:
> >>>> On Wed, Jun 12, 2013 at 08:04:41PM -0500, Ben Myers wrote:
> >>>>> On Wed, Jun 12, 2013 at 12:19:06PM +1000, Dave Chinner wrote:
> >>>>>> From: Dave Chinner <dchinner@redhat.com>
> >>>>>>
> >>>>>> Unfortunately, we cannot guarantee that items logged multiple times
> >>>>>> and replayed by log recovery do not take objects back in time. When
> >>>>>> theya re taken back in time, the go into an intermediate state which
> >>>>>> is corrupt, and hence verification that occurs on this intermediate
> >>>>>> state causes log recovery to abort with a corruption shutdown.
> >>>>>>
> >>>>>> Instead of causing a shutdown and unmountable filesystem, don't
> >>>>>> verify post-recovery items before they are written to disk. This is
> >>>>>> less than optimal, but there is no way to detect this issue for
> >>>>>> non-CRC filesystems If log recovery successfully completes, this
> >>>>>> will be undone and the object will be consistent by subsequent
> >>>>>> transactions that are replayed, so in most cases we don't need to
> >>>>>> take drastic action.
> >>>>>>
> >>>>>> For CRC enabled filesystems, leave the verifiers in place - we need
> >>>>>> to call them to recalculate the CRCs on the objects anyway. This
> >>>>>> recovery problem canbe solved for such filesystems - we have a LSN
> >>>>>> stamped in all metadata at writeback time that we can to determine
> >>>>>> whether the item should be replayed or not. This is a separate piece
> >>>>>> of work, so is not addressed by this patch.
> >>>>>
> >>>>> Is there a test case for this one?  How are you reproducing this?
> >>>>
> >>>> The test case was Dave Jones running sysrq-b on a hung test machine.
> >>>> The machine would occasionally end up with a corrupt home directory.
> >>>>
> >>>> http://oss.sgi.com/pipermail/xfs/2013-May/026759.html
> >>>>
> >>>> Analysis from a metdadump provided by Dave:
> >>>>
> >>>> http://oss.sgi.com/pipermail/xfs/2013-June/026965.html
> >>>>
> >>>> And Cai also appeared to be hitting this after a crash on 3.10-rc4,
> >>>> as it's giving exactly the same "verifier failed during log recovery"
> >>>> stack trace:
> >>>>
> >>>> http://oss.sgi.com/pipermail/xfs/2013-June/026889.html
> >>>
> >>> Thanks.  It appears that the verifiers have found corruption due to a
> >>> flaw in log recovery, and the fix you are proposing is to stop using
> >>> them.  If we do that, we'll have no way of detecting the corruption and
> >>> will end up hanging users of older kernels out to dry.
> >>
> >> We've never detected it before, and it's causing regressions for
> >> multiple people. We *can't fix it* because we can't detect the
> >> situation sanely, and we are not leaving people with old kernels
> >> hanging out to dry. The opposite is true: we are fucking over
> >> current users by preventing log recovery on filesystems that will
> >> recovery perfectly OK and have almost always recovered just fine in
> >> the past.
> >>
> >>> I think your suggestion that non-debug systems could warn instead of
> >>> fail is a good one, but removing the verifier altogether is
> >>> inappropriate.
> >>
> >> Changing every single verifier in a non-trivial way is not something
> >> I'm about to do for a -rc6 kernel. Removing the verifiers from log
> >> recovery just reverts to the pre-3.8 situation, so is perfectly
> >> acceptable short term solution while we do the more invasive verify
> >> changes.
> >>
> >>> Can you make the metadump available?  I need to understand this better
> >>> before I can sign off.  Also:  Any idea how far back this one goes?
> >>
> >> No, I can't make the metadump available to you - it was provided
> >> privately and not obfuscated and so you'd have to ask Dave for it.
> > 
> > Dave (Jones), could you make the metadump available to me?  I'd like to
> > understand this a little bit better.  I'm a bit uncomfortable with the
> > proposition that we should corrupt silently in this case...
> 
> Ben, isn't it the case that the corruption would only happen if
> log replay failed for some reason (as has always been the case,
> verifier or not), but with the verifier in place, it kills replay
> even w/o other problems due to a logical problem with the
> (recently added) verifiers?

It seems like the verifier prevented corruption from hitting disk during
log replay.  It is enforcing a partial replay up to the point where the
corruption occurred.  Now you should be able to zero the log and the
filesystem is not corrupted.

> IOW - this seems like an actual functional regression due to the
> addition of the verifier, and dchinner's patch gets us back
> to the almost-always-fine state we were in prior to the change.

Oh, the spin doctor is *in*!

This isn't a logical problem with the verifier, it's a logical problem
with log replay.  We need to find a way for recovery to know whether a
given transaction should be replayed.  Fixing that is nontrivial.

> As we're at -rc6, it seems quite reasonable to me as a quick
> fix to just short-circuit it for now.

If we're talking about a short term fix, that's fine.  This should be
conditional on CONFIG_XFS_DEBUG and marked as such.

Long term, removing the verifiers is the wrong thing to do here.  We
need to fix the recovery bug and then remove this temporary workaround.  

> If you have time to analyze dave's metadump that's cool, but
> this seems like something that really needs to be addressed
> before 3.10 gets out the door.

If this really is a day one bug then it's been out the door almost
twenty years.  And you want to hurry now?  ;)

> Whenever you're ready, you can also add:
> 
> Reviewed-by: Eric Sandeen <sandeen@redhat.com>

Sure.

> (And dchinner, should this cc: stable for 3.9?)

Looks like verifiers were added in 3.7.  We should Cc stable.

Thanks,
	Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] xfs: don't shutdown log recovery on validation errors
  2013-06-14 19:08               ` Ben Myers
@ 2013-06-14 19:18                 ` Eric Sandeen
  2013-06-14 19:44                   ` Ben Myers
  0 siblings, 1 reply; 31+ messages in thread
From: Eric Sandeen @ 2013-06-14 19:18 UTC (permalink / raw)
  To: Ben Myers; +Cc: Dave Jones, xfs

On 6/14/13 2:08 PM, Ben Myers wrote:
> Hey Eric,
> 
> On Fri, Jun 14, 2013 at 11:15:41AM -0500, Eric Sandeen wrote:
>> On 6/14/13 11:09 AM, Ben Myers wrote:
>>> On Fri, Jun 14, 2013 at 10:13:06AM +1000, Dave Chinner wrote:
>>>> On Thu, Jun 13, 2013 at 05:09:03PM -0500, Ben Myers wrote:
>>>>> On Thu, Jun 13, 2013 at 12:08:27PM +1000, Dave Chinner wrote:
>>>>>> On Wed, Jun 12, 2013 at 08:04:41PM -0500, Ben Myers wrote:
>>>>>>> On Wed, Jun 12, 2013 at 12:19:06PM +1000, Dave Chinner wrote:
>>>>>>>> From: Dave Chinner <dchinner@redhat.com>
>>>>>>>>
>>>>>>>> Unfortunately, we cannot guarantee that items logged multiple times
>>>>>>>> and replayed by log recovery do not take objects back in time. When
>>>>>>>> theya re taken back in time, the go into an intermediate state which
>>>>>>>> is corrupt, and hence verification that occurs on this intermediate
>>>>>>>> state causes log recovery to abort with a corruption shutdown.
>>>>>>>>
>>>>>>>> Instead of causing a shutdown and unmountable filesystem, don't
>>>>>>>> verify post-recovery items before they are written to disk. This is
>>>>>>>> less than optimal, but there is no way to detect this issue for
>>>>>>>> non-CRC filesystems If log recovery successfully completes, this
>>>>>>>> will be undone and the object will be consistent by subsequent
>>>>>>>> transactions that are replayed, so in most cases we don't need to
>>>>>>>> take drastic action.
>>>>>>>>
>>>>>>>> For CRC enabled filesystems, leave the verifiers in place - we need
>>>>>>>> to call them to recalculate the CRCs on the objects anyway. This
>>>>>>>> recovery problem canbe solved for such filesystems - we have a LSN
>>>>>>>> stamped in all metadata at writeback time that we can to determine
>>>>>>>> whether the item should be replayed or not. This is a separate piece
>>>>>>>> of work, so is not addressed by this patch.
>>>>>>>
>>>>>>> Is there a test case for this one?  How are you reproducing this?
>>>>>>
>>>>>> The test case was Dave Jones running sysrq-b on a hung test machine.
>>>>>> The machine would occasionally end up with a corrupt home directory.
>>>>>>
>>>>>> http://oss.sgi.com/pipermail/xfs/2013-May/026759.html
>>>>>>
>>>>>> Analysis from a metdadump provided by Dave:
>>>>>>
>>>>>> http://oss.sgi.com/pipermail/xfs/2013-June/026965.html
>>>>>>
>>>>>> And Cai also appeared to be hitting this after a crash on 3.10-rc4,
>>>>>> as it's giving exactly the same "verifier failed during log recovery"
>>>>>> stack trace:
>>>>>>
>>>>>> http://oss.sgi.com/pipermail/xfs/2013-June/026889.html
>>>>>
>>>>> Thanks.  It appears that the verifiers have found corruption due to a
>>>>> flaw in log recovery, and the fix you are proposing is to stop using
>>>>> them.  If we do that, we'll have no way of detecting the corruption and
>>>>> will end up hanging users of older kernels out to dry.
>>>>
>>>> We've never detected it before, and it's causing regressions for
>>>> multiple people. We *can't fix it* because we can't detect the
>>>> situation sanely, and we are not leaving people with old kernels
>>>> hanging out to dry. The opposite is true: we are fucking over
>>>> current users by preventing log recovery on filesystems that will
>>>> recovery perfectly OK and have almost always recovered just fine in
>>>> the past.
>>>>
>>>>> I think your suggestion that non-debug systems could warn instead of
>>>>> fail is a good one, but removing the verifier altogether is
>>>>> inappropriate.
>>>>
>>>> Changing every single verifier in a non-trivial way is not something
>>>> I'm about to do for a -rc6 kernel. Removing the verifiers from log
>>>> recovery just reverts to the pre-3.8 situation, so is perfectly
>>>> acceptable short term solution while we do the more invasive verify
>>>> changes.
>>>>
>>>>> Can you make the metadump available?  I need to understand this better
>>>>> before I can sign off.  Also:  Any idea how far back this one goes?
>>>>
>>>> No, I can't make the metadump available to you - it was provided
>>>> privately and not obfuscated and so you'd have to ask Dave for it.
>>>
>>> Dave (Jones), could you make the metadump available to me?  I'd like to
>>> understand this a little bit better.  I'm a bit uncomfortable with the
>>> proposition that we should corrupt silently in this case...
>>
>> Ben, isn't it the case that the corruption would only happen if
>> log replay failed for some reason (as has always been the case,
>> verifier or not), but with the verifier in place, it kills replay
>> even w/o other problems due to a logical problem with the
>> (recently added) verifiers?
> 
> It seems like the verifier prevented corruption from hitting disk during
> log replay.  

It detected a an inconsistent *interim* state during replay, which is
always made correct by log replay completion.  But it *stopped* that log
replay completion.  And caused log replay to fail.  And mount to fail.
This is *new* behavior, and bad.

As I understand it.

> It is enforcing a partial replay up to the point where the
> corruption occurred.  Now you should be able to zero the log and the
> filesystem is not corrupted.
> 
>> IOW - this seems like an actual functional regression due to the
>> addition of the verifier, and dchinner's patch gets us back
>> to the almost-always-fine state we were in prior to the change.
> 
> Oh, the spin doctor is *in*!

This is not spin.

> This isn't a logical problem with the verifier, it's a logical problem
> with log replay.  We need to find a way for recovery to know whether a
> given transaction should be replayed.  Fixing that is nontrivial.

Right.

And it's been around for years.  The verifier now detects that
interim state, and makes things *worse* than they would be had log
replay been allowed to continue.

Fixing the interim state may be nontrivial; allowing log replay
to continue to a consistent state as it always has *is* trivial,
it's what's done in Dave's small patch.

>> As we're at -rc6, it seems quite reasonable to me as a quick
>> fix to just short-circuit it for now.
> 
> If we're talking about a short term fix, that's fine.  This should be
> conditional on CONFIG_XFS_DEBUG and marked as such.
> 
> Long term, removing the verifiers is the wrong thing to do here.  We
> need to fix the recovery bug and then remove this temporary workaround.  
> 
>> If you have time to analyze dave's metadump that's cool, but
>> this seems like something that really needs to be addressed
>> before 3.10 gets out the door.
> 
> If this really is a day one bug then it's been out the door almost
> twenty years.  And you want to hurry now?  ;)

We seem to be talking past each other.

The corrupted interim state has been around for years.  Up until
now, log replay completion left things in perfect state.

The verifier now *breaks replay* at that interim point.
Were it allowed to continue, everything would be fine.

As things stand, it is not fine, and this is a recent change
which Dave is trying to correct.

Leaving it in place will cause filesystems which were replaying
logs just fine until recently to now fail with no good way out.

-Eric

>> Whenever you're ready, you can also add:
>>
>> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
> 
> Sure.
> 
>> (And dchinner, should this cc: stable for 3.9?)
> 
> Looks like verifiers were added in 3.7.  We should Cc stable.
> 
> Thanks,
> 	Ben
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] xfs: don't shutdown log recovery on validation errors
  2013-06-14 19:18                 ` Eric Sandeen
@ 2013-06-14 19:44                   ` Ben Myers
  2013-06-14 19:54                     ` Eric Sandeen
  2013-06-15  0:56                     ` Dave Chinner
  0 siblings, 2 replies; 31+ messages in thread
From: Ben Myers @ 2013-06-14 19:44 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Dave Jones, xfs

Hey Eric,

On Fri, Jun 14, 2013 at 02:18:20PM -0500, Eric Sandeen wrote:
> On 6/14/13 2:08 PM, Ben Myers wrote:
> > On Fri, Jun 14, 2013 at 11:15:41AM -0500, Eric Sandeen wrote:
> >> Ben, isn't it the case that the corruption would only happen if
> >> log replay failed for some reason (as has always been the case,
> >> verifier or not), but with the verifier in place, it kills replay
> >> even w/o other problems due to a logical problem with the
> >> (recently added) verifiers?
> > 
> > It seems like the verifier prevented corruption from hitting disk during
> > log replay.  
> 
> It detected a an inconsistent *interim* state during replay, which is
> always made correct by log replay completion.  But it *stopped* that log
> replay completion.  And caused log replay to fail.  And mount to fail.
> This is *new* behavior, and bad.
> 
> As I understand it.
>
> > It is enforcing a partial replay up to the point where the
> > corruption occurred.  Now you should be able to zero the log and the
> > filesystem is not corrupted.
> > 
> >> IOW - this seems like an actual functional regression due to the
> >> addition of the verifier, and dchinner's patch gets us back
> >> to the almost-always-fine state we were in prior to the change.
> > 
> > Oh, the spin doctor is *in*!
> 
> This is not spin.
> 
> > This isn't a logical problem with the verifier, it's a logical problem
> > with log replay.  We need to find a way for recovery to know whether a
> > given transaction should be replayed.  Fixing that is nontrivial.
> 
> Right.
> 
> And it's been around for years.  The verifier now detects that
> interim state, and makes things *worse* than they would be had log
> replay been allowed to continue.
> 
> Fixing the interim state may be nontrivial; allowing log replay
> to continue to a consistent state as it always has *is* trivial,
> it's what's done in Dave's small patch.
>
> >> As we're at -rc6, it seems quite reasonable to me as a quick
> >> fix to just short-circuit it for now.
> > 
> > If we're talking about a short term fix, that's fine.  This should be
> > conditional on CONFIG_XFS_DEBUG and marked as such.
> > 
> > Long term, removing the verifiers is the wrong thing to do here.  We
> > need to fix the recovery bug and then remove this temporary workaround.  
> > 
> >> If you have time to analyze dave's metadump that's cool, but
> >> this seems like something that really needs to be addressed
> >> before 3.10 gets out the door.
> > 
> > If this really is a day one bug then it's been out the door almost
> > twenty years.  And you want to hurry now?  ;)
> 
> We seem to be talking past each other.
> 
> The corrupted interim state has been around for years.  Up until
> now, log replay completion left things in perfect state.
> 
> The verifier now *breaks replay* at that interim point.
> Were it allowed to continue, everything would be fine.
> 
> As things stand, it is not fine, and this is a recent change
> which Dave is trying to correct.
> 
> Leaving it in place will cause filesystems which were replaying
> logs just fine until recently to now fail with no good way out.

That is consistent with my understanding of the problem...

Unfortunately log replay is broken.  The verifier has detected this and stopped
replay.  Ideally the solution would be to fix log replay, but that is going to
take some time.  So, in the near term we're just going to disable the verifier
to allow replay to complete.

I'm suggesting that this disabling be done conditionally on CONFIG_XFS_DEBUG so
that developers still have a chance at hitting the log replay problem, and a
comment should be added explaining that we've disabled the verifier due to a
specific bug as a temporary workaround and we'll re-enable the verifier once
it's fixed.  I'll update the patch and repost.

Are you guys arguing that the log replay bug should not be fixed?

-Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] xfs: don't shutdown log recovery on validation errors
  2013-06-14 19:44                   ` Ben Myers
@ 2013-06-14 19:54                     ` Eric Sandeen
  2013-06-14 20:22                       ` Ben Myers
  2013-06-15  0:56                     ` Dave Chinner
  1 sibling, 1 reply; 31+ messages in thread
From: Eric Sandeen @ 2013-06-14 19:54 UTC (permalink / raw)
  To: Ben Myers; +Cc: Dave Jones, xfs

On 6/14/13 2:44 PM, Ben Myers wrote:
> Hey Eric,
> 
> On Fri, Jun 14, 2013 at 02:18:20PM -0500, Eric Sandeen wrote:
>> On 6/14/13 2:08 PM, Ben Myers wrote:
>>> On Fri, Jun 14, 2013 at 11:15:41AM -0500, Eric Sandeen wrote:
>>>> Ben, isn't it the case that the corruption would only happen if
>>>> log replay failed for some reason (as has always been the case,
>>>> verifier or not), but with the verifier in place, it kills replay
>>>> even w/o other problems due to a logical problem with the
>>>> (recently added) verifiers?
>>>
>>> It seems like the verifier prevented corruption from hitting disk during
>>> log replay.  
>>
>> It detected a an inconsistent *interim* state during replay, which is
>> always made correct by log replay completion.  But it *stopped* that log
>> replay completion.  And caused log replay to fail.  And mount to fail.
>> This is *new* behavior, and bad.
>>
>> As I understand it.
>>
>>> It is enforcing a partial replay up to the point where the
>>> corruption occurred.  Now you should be able to zero the log and the
>>> filesystem is not corrupted.
>>>
>>>> IOW - this seems like an actual functional regression due to the
>>>> addition of the verifier, and dchinner's patch gets us back
>>>> to the almost-always-fine state we were in prior to the change.
>>>
>>> Oh, the spin doctor is *in*!
>>
>> This is not spin.
>>
>>> This isn't a logical problem with the verifier, it's a logical problem
>>> with log replay.  We need to find a way for recovery to know whether a
>>> given transaction should be replayed.  Fixing that is nontrivial.
>>
>> Right.
>>
>> And it's been around for years.  The verifier now detects that
>> interim state, and makes things *worse* than they would be had log
>> replay been allowed to continue.
>>
>> Fixing the interim state may be nontrivial; allowing log replay
>> to continue to a consistent state as it always has *is* trivial,
>> it's what's done in Dave's small patch.
>>
>>>> As we're at -rc6, it seems quite reasonable to me as a quick
>>>> fix to just short-circuit it for now.
>>>
>>> If we're talking about a short term fix, that's fine.  This should be
>>> conditional on CONFIG_XFS_DEBUG and marked as such.
>>>
>>> Long term, removing the verifiers is the wrong thing to do here.  We
>>> need to fix the recovery bug and then remove this temporary workaround.  
>>>
>>>> If you have time to analyze dave's metadump that's cool, but
>>>> this seems like something that really needs to be addressed
>>>> before 3.10 gets out the door.
>>>
>>> If this really is a day one bug then it's been out the door almost
>>> twenty years.  And you want to hurry now?  ;)
>>
>> We seem to be talking past each other.
>>
>> The corrupted interim state has been around for years.  Up until
>> now, log replay completion left things in perfect state.
>>
>> The verifier now *breaks replay* at that interim point.
>> Were it allowed to continue, everything would be fine.
>>
>> As things stand, it is not fine, and this is a recent change
>> which Dave is trying to correct.
>>
>> Leaving it in place will cause filesystems which were replaying
>> logs just fine until recently to now fail with no good way out.
> 
> That is consistent with my understanding of the problem...
> 
> Unfortunately log replay is broken.  The verifier has detected this and stopped
> replay.  Ideally the solution would be to fix log replay, but that is going to
> take some time.  So, in the near term we're just going to disable the verifier
> to allow replay to complete.

Right, that's what we're hoping - for 3.10 right?

Maybe the talking-past-each-other was only that part.  I thought you didn't
want to disable it for now.

> I'm suggesting that this disabling be done conditionally on CONFIG_XFS_DEBUG so
> that developers still have a chance at hitting the log replay problem, 

so that real-world users will still feel the pain ...?

Or did you say that backwards (really only *disabling* it under debug?)

Ok, confirmed on IRC you mean to disable it if *NOT* debug, enable it
under debug.

> and a
> comment should be added explaining that we've disabled the verifier due to a
> specific bug as a temporary workaround and we'll re-enable the verifier once
> it's fixed.  I'll update the patch and repost.

Maybe if the verifiers were *on* under debug that'd make sense.

I think putting it under the config is overkill, since anyone who wants
to fix it is surely capable of re-enabling it in the code.  But if that
avoids an impasse, I don't much care.

> Are you guys arguing that the log replay bug should not be fixed?

Speaking for myself, I'm not arguing that, not at all.
(not that I know how to fix it, either)

-Eric

> -Ben
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] xfs: don't shutdown log recovery on validation errors
  2013-06-14 19:54                     ` Eric Sandeen
@ 2013-06-14 20:22                       ` Ben Myers
  2013-06-28 18:54                         ` Dave Jones
  0 siblings, 1 reply; 31+ messages in thread
From: Ben Myers @ 2013-06-14 20:22 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Dave Jones, xfs

Hey Eric,

On Fri, Jun 14, 2013 at 02:54:24PM -0500, Eric Sandeen wrote:
> On 6/14/13 2:44 PM, Ben Myers wrote:
> > Hey Eric,
> > 
> > On Fri, Jun 14, 2013 at 02:18:20PM -0500, Eric Sandeen wrote:
> >> On 6/14/13 2:08 PM, Ben Myers wrote:
> >>> On Fri, Jun 14, 2013 at 11:15:41AM -0500, Eric Sandeen wrote:
> >>>> Ben, isn't it the case that the corruption would only happen if
> >>>> log replay failed for some reason (as has always been the case,
> >>>> verifier or not), but with the verifier in place, it kills replay
> >>>> even w/o other problems due to a logical problem with the
> >>>> (recently added) verifiers?
> >>>
> >>> It seems like the verifier prevented corruption from hitting disk during
> >>> log replay.  
> >>
> >> It detected a an inconsistent *interim* state during replay, which is
> >> always made correct by log replay completion.  But it *stopped* that log
> >> replay completion.  And caused log replay to fail.  And mount to fail.
> >> This is *new* behavior, and bad.
> >>
> >> As I understand it.
> >>
> >>> It is enforcing a partial replay up to the point where the
> >>> corruption occurred.  Now you should be able to zero the log and the
> >>> filesystem is not corrupted.
> >>>
> >>>> IOW - this seems like an actual functional regression due to the
> >>>> addition of the verifier, and dchinner's patch gets us back
> >>>> to the almost-always-fine state we were in prior to the change.
> >>>
> >>> Oh, the spin doctor is *in*!
> >>
> >> This is not spin.
> >>
> >>> This isn't a logical problem with the verifier, it's a logical problem
> >>> with log replay.  We need to find a way for recovery to know whether a
> >>> given transaction should be replayed.  Fixing that is nontrivial.
> >>
> >> Right.
> >>
> >> And it's been around for years.  The verifier now detects that
> >> interim state, and makes things *worse* than they would be had log
> >> replay been allowed to continue.
> >>
> >> Fixing the interim state may be nontrivial; allowing log replay
> >> to continue to a consistent state as it always has *is* trivial,
> >> it's what's done in Dave's small patch.
> >>
> >>>> As we're at -rc6, it seems quite reasonable to me as a quick
> >>>> fix to just short-circuit it for now.
> >>>
> >>> If we're talking about a short term fix, that's fine.  This should be
> >>> conditional on CONFIG_XFS_DEBUG and marked as such.
> >>>
> >>> Long term, removing the verifiers is the wrong thing to do here.  We
> >>> need to fix the recovery bug and then remove this temporary workaround.  
> >>>
> >>>> If you have time to analyze dave's metadump that's cool, but
> >>>> this seems like something that really needs to be addressed
> >>>> before 3.10 gets out the door.
> >>>
> >>> If this really is a day one bug then it's been out the door almost
> >>> twenty years.  And you want to hurry now?  ;)
> >>
> >> We seem to be talking past each other.
> >>
> >> The corrupted interim state has been around for years.  Up until
> >> now, log replay completion left things in perfect state.
> >>
> >> The verifier now *breaks replay* at that interim point.
> >> Were it allowed to continue, everything would be fine.
> >>
> >> As things stand, it is not fine, and this is a recent change
> >> which Dave is trying to correct.
> >>
> >> Leaving it in place will cause filesystems which were replaying
> >> logs just fine until recently to now fail with no good way out.
> > 
> > That is consistent with my understanding of the problem...
> > 
> > Unfortunately log replay is broken.  The verifier has detected this and stopped
> > replay.  Ideally the solution would be to fix log replay, but that is going to
> > take some time.  So, in the near term we're just going to disable the verifier
> > to allow replay to complete.
> 
> Right, that's what we're hoping - for 3.10 right?

Yep.  But it should also go back through -stable.

> Maybe the talking-past-each-other was only that part.  I thought you didn't
> want to disable it for now.

I didn't want to disable it.  That was yesterday.  Mark set me straight on how
easy this should be to hit.

> > I'm suggesting that this disabling be done conditionally on CONFIG_XFS_DEBUG so
> > that developers still have a chance at hitting the log replay problem, 
> 
> so that real-world users will still feel the pain ...?
> 
> Or did you say that backwards (really only *disabling* it under debug?)
> 
> Ok, confirmed on IRC you mean to disable it if *NOT* debug, enable it
> under debug.

Right.

> > and a
> > comment should be added explaining that we've disabled the verifier due to a
> > specific bug as a temporary workaround and we'll re-enable the verifier once
> > it's fixed.  I'll update the patch and repost.
> 
> Maybe if the verifiers were *on* under debug that'd make sense.

Sorry for the confusion.

> I think putting it under the config is overkill, since anyone who wants
> to fix it is surely capable of re-enabling it in the code.  But if that
> avoids an impasse, I don't much care.

Great.  Apparently I'm being pedantic today.  I'd like it to be clear that we
intend for this to be a temporary workaround....

> > Are you guys arguing that the log replay bug should not be fixed?
> 
> Speaking for myself, I'm not arguing that, not at all.
> (not that I know how to fix it, either)

...until log replay is fixed.  Hell.  I'll just pull it in as-is.

Reviewed-by: Ben Myers <bpm@sgi.com>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] xfs: don't shutdown log recovery on validation errors
  2013-06-14 19:44                   ` Ben Myers
  2013-06-14 19:54                     ` Eric Sandeen
@ 2013-06-15  0:56                     ` Dave Chinner
  2013-06-17 14:53                       ` Ben Myers
  1 sibling, 1 reply; 31+ messages in thread
From: Dave Chinner @ 2013-06-15  0:56 UTC (permalink / raw)
  To: Ben Myers; +Cc: Dave Jones, Eric Sandeen, xfs

On Fri, Jun 14, 2013 at 02:44:53PM -0500, Ben Myers wrote:
> Unfortunately log replay is broken.  The verifier has detected this and stopped
> replay.  Ideally the solution would be to fix log replay, but that is going to
> take some time.  So, in the near term we're just going to disable the verifier
> to allow replay to complete.
> 
> I'm suggesting that this disabling be done conditionally on CONFIG_XFS_DEBUG so
> that developers still have a chance at hitting the log replay problem, and a
> comment should be added explaining that we've disabled the verifier due to a
> specific bug as a temporary workaround and we'll re-enable the verifier once
> it's fixed.  I'll update the patch and repost.
> 
> Are you guys arguing that the log replay bug should not be fixed?

I'm not arguing that it should not be fixed, I'm *stating* that it
*can't be fixed* for non-CRC filesystems. That is, the best we can
do is work around this deficiency in log recovery with some kind of
self-defusing warning....

To fix it properly, you need to know the age of the object being
overwritten relative to the age of overwrite data. For non-CRC
filesystems we don't have that information in the metadata object
being overwritten. We can't even correctly identify the object being
overwritten.

So, it's simply not fixable in log recovery for non-CRC filesystems,
and the LSN stamped in every piece of metadata at writeback time for
CRC enabled filesystems is designed precisely to avoid this problem.

Indeed, the LSN stamp is a far more effective method than what used
to be in the inode core to try to avoid the problem for unlogged
inode size updates to try to prevent log recovery from replaying
inode core updates over more recently written inodes - the
di_flushiter field.  Note the comment in xfs_flush_int():

        /*
         * bump the flush iteration count, used to detect flushes which
         * postdate a log record during recovery. This is redundant as we now
         * log every change and hence this can't happen. Still, it doesn't hurt.
         */
        ip->i_d.di_flushiter++;

And this in xlog_recover_inode_pass2():

        /* Skip replay when the on disk inode is newer than the log one */
        if (dicp->di_flushiter < be16_to_cpu(dip->di_flushiter)) {

This recovery problem has been around forever, and it we cannot fix
log recovery with age information in the metadata on disk that
recovery is going overwrite. CRC enabled filesystems have that
information on disk, existing filesystems don't. Therefore, we can
only solve the recovery problem for CRC enabled filesystems...

We could probably also avoid the problem by modifying the way we do
writeback from the AIL to limit it only to objects at the tail LSN,
but that has a horrific performance penalty associated with it for
many common workloads because of the way relogging works. And for a
problem that I've suspected has occurred maybe 5 times in 10 years,
modifying metadata writeback to avoid this problem is a bad tradeoff
to make for just about everyone...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] xfs: don't shutdown log recovery on validation errors
  2013-06-15  0:56                     ` Dave Chinner
@ 2013-06-17 14:53                       ` Ben Myers
  2013-06-18  1:22                         ` Dave Chinner
  0 siblings, 1 reply; 31+ messages in thread
From: Ben Myers @ 2013-06-17 14:53 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Dave Jones, Eric Sandeen, xfs

Hey Dave,

On Sat, Jun 15, 2013 at 10:56:39AM +1000, Dave Chinner wrote:
> On Fri, Jun 14, 2013 at 02:44:53PM -0500, Ben Myers wrote:
> > Unfortunately log replay is broken.  The verifier has detected this and stopped
> > replay.  Ideally the solution would be to fix log replay, but that is going to
> > take some time.  So, in the near term we're just going to disable the verifier
> > to allow replay to complete.
> > 
> > I'm suggesting that this disabling be done conditionally on CONFIG_XFS_DEBUG so
> > that developers still have a chance at hitting the log replay problem, and a
> > comment should be added explaining that we've disabled the verifier due to a
> > specific bug as a temporary workaround and we'll re-enable the verifier once
> > it's fixed.  I'll update the patch and repost.
> > 
> > Are you guys arguing that the log replay bug should not be fixed?
> 
> I'm not arguing that it should not be fixed,
      
Good.

> I'm *stating* that it
> *can't be fixed* for non-CRC filesystems. That is, the best we can
> do is work around this deficiency in log recovery with some kind of
> self-defusing warning....

I think you're mistaken here.  In order to prevent this from happening on
non-crc filesystems we need only ensure that all of the buffer modifications in
the log for a given buffer are played back onto the buffer before we write it
out to disk.  My impression is that the final state of the buffer is always the
correct one, so it's a matter of preventing any intermediate states from being
written.

> To fix it properly, you need to know the age of the object being
> overwritten relative to the age of overwrite data. For non-CRC
> filesystems we don't have that information in the metadata object
> being overwritten.

Only if you take the tack that you are not going to write intermediate state
into the buffer.  I agree this is one way to fix the problem, but I'm not sure
that it's the only option.

> We can't even correctly identify the object being
> overwritten.

Can you be more specific about what you mean there?

> So, it's simply not fixable in log recovery for non-CRC filesystems,
> and the LSN stamped in every piece of metadata at writeback time for
> CRC enabled filesystems is designed precisely to avoid this problem.

Again, I'm not so sure that there aren't other ways to fix this, which I'm
going to continue to look in to.  I think you're giving up a bit prematurely.

> Indeed, the LSN stamp is a far more effective method than what used
> to be in the inode core to try to avoid the problem for unlogged
> inode size updates to try to prevent log recovery from replaying
> inode core updates over more recently written inodes - the
> di_flushiter field.  Note the comment in xfs_flush_int():
> 
>         /*
>          * bump the flush iteration count, used to detect flushes which
>          * postdate a log record during recovery. This is redundant as we now
>          * log every change and hence this can't happen. Still, it doesn't hurt.
>          */
>         ip->i_d.di_flushiter++;
> 
> And this in xlog_recover_inode_pass2():
> 
>         /* Skip replay when the on disk inode is newer than the log one */
>         if (dicp->di_flushiter < be16_to_cpu(dip->di_flushiter)) {

Why is using the lsn more effective than the counter?  Seems like the counter
would be adequate as is.

> This recovery problem has been around forever, and it we cannot fix
> log recovery with age information in the metadata on disk that
> recovery is going overwrite. CRC enabled filesystems have that
> information on disk, existing filesystems don't. Therefore, we can
> only solve the recovery problem for CRC enabled filesystems...

Nah, I think we can find another way.  I am also curious to look back at some
older recovery code to see if this truly is a day one bug.

> We could probably also avoid the problem by modifying the way we do
> writeback from the AIL to limit it only to objects at the tail LSN,
> but that has a horrific performance penalty associated with it for
> many common workloads because of the way relogging works. And for a
> problem that I've suspected has occurred maybe 5 times in 10 years,
> modifying metadata writeback to avoid this problem is a bad tradeoff
> to make for just about everyone...

How would limiting writeback from the AIL to only structures at the tail
resolve the issue of playback of intermediate states onto newer buffers during
recovery?  I don't understand what you're getting at here.  Anyway, I agree
that it would be better to try and avoid any massive performance impacts during
regular operation.  A performance impact during recovery is less of a concern
but should be avoided if possible.

I think there may be some creative ways to resolve this problem for older
filesystems which are worth looking into.

Regards,
Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] xfs: don't shutdown log recovery on validation errors
  2013-06-17 14:53                       ` Ben Myers
@ 2013-06-18  1:22                         ` Dave Chinner
  0 siblings, 0 replies; 31+ messages in thread
From: Dave Chinner @ 2013-06-18  1:22 UTC (permalink / raw)
  To: Ben Myers; +Cc: Dave Jones, Eric Sandeen, xfs

On Mon, Jun 17, 2013 at 09:53:57AM -0500, Ben Myers wrote:
> Hey Dave,
> 
> On Sat, Jun 15, 2013 at 10:56:39AM +1000, Dave Chinner wrote:
> > On Fri, Jun 14, 2013 at 02:44:53PM -0500, Ben Myers wrote:
> > > Unfortunately log replay is broken.  The verifier has detected this and stopped
> > > replay.  Ideally the solution would be to fix log replay, but that is going to
> > > take some time.  So, in the near term we're just going to disable the verifier
> > > to allow replay to complete.
> > > 
> > > I'm suggesting that this disabling be done conditionally on CONFIG_XFS_DEBUG so
> > > that developers still have a chance at hitting the log replay problem, and a
> > > comment should be added explaining that we've disabled the verifier due to a
> > > specific bug as a temporary workaround and we'll re-enable the verifier once
> > > it's fixed.  I'll update the patch and repost.
> > > 
> > > Are you guys arguing that the log replay bug should not be fixed?
> > 
> > I'm not arguing that it should not be fixed,
>       
> Good.
> 
> > I'm *stating* that it
> > *can't be fixed* for non-CRC filesystems. That is, the best we can
> > do is work around this deficiency in log recovery with some kind of
> > self-defusing warning....
> 
> I think you're mistaken here.  In order to prevent this from happening on
> non-crc filesystems we need only ensure that all of the buffer modifications in
> the log for a given buffer are played back onto the buffer before we write it
> out to disk.  My impression is that the final state of the buffer is always the
> correct one, so it's a matter of preventing any intermediate states from being
> written.

Sure. That's theoretically possible to do, but in practice it's
extremely problematic.

First of all, the log can be larger than system RAM, so you can't
hold all the changes in memory to replay the completely before
allowing writeback. So, we have to have writeback in some form. And
if we have to have writeback, that means we need to ensure we are
writing checkpoints as a whole unit of change....

If we don't write back checkpoints as a whole, then as a result of
omitting certain metadata we end up with an inconsistent state on
disk - exactly the problem we are trying to avoid. i.e. not writing
back metadata that changed in future checkpoints *guarantees* the
filesystem goes through inconsistent intermediate states.

And then we have to track every item in the log to determine if it
is used in multiple checkpoints. This is basically the same as the
cancelled buffer tracking, which now means every object requires
another ~32 bytes of RAM to track and a new scalable index mechanism
to that can track several hundred thousand objects in a scalable
fashion and has low overhead random insert, lookup and delete
behaviour.

> > To fix it properly, you need to know the age of the object being
> > overwritten relative to the age of overwrite data. For non-CRC
> > filesystems we don't have that information in the metadata object
> > being overwritten.
> 
> Only if you take the tack that you are not going to write intermediate state
> into the buffer.  I agree this is one way to fix the problem, but I'm not sure
> that it's the only option.
> 
> > We can't even correctly identify the object being
> > overwritten.
> 
> Can you be more specific about what you mean there?

Not all metadata that is logged contains self describing information
in non-CRC filesystems. If we can't identify metadata reliably, we
can't make any decisions base don the contents....

> > Indeed, the LSN stamp is a far more effective method than what used
> > to be in the inode core to try to avoid the problem for unlogged
> > inode size updates to try to prevent log recovery from replaying
> > inode core updates over more recently written inodes - the
> > di_flushiter field.  Note the comment in xfs_flush_int():
> > 
> >         /*
> >          * bump the flush iteration count, used to detect flushes which
> >          * postdate a log record during recovery. This is redundant as we now
> >          * log every change and hence this can't happen. Still, it doesn't hurt.
> >          */
> >         ip->i_d.di_flushiter++;
> > 
> > And this in xlog_recover_inode_pass2():
> > 
> >         /* Skip replay when the on disk inode is newer than the log one */
> >         if (dicp->di_flushiter < be16_to_cpu(dip->di_flushiter)) {
> 
> Why is using the lsn more effective than the counter?  Seems like the counter
> would be adequate as is.

The counter is only valid within the life of a single inode. It
doesn't track the inode across allocation and freeing....

> > We could probably also avoid the problem by modifying the way we do
> > writeback from the AIL to limit it only to objects at the tail LSN,
> > but that has a horrific performance penalty associated with it for
> > many common workloads because of the way relogging works. And for a
> > problem that I've suspected has occurred maybe 5 times in 10 years,
> > modifying metadata writeback to avoid this problem is a bad tradeoff
> > to make for just about everyone...
> 
> How would limiting writeback from the AIL to only structures at the tail
> resolve the issue of playback of intermediate states onto newer buffers during
> recovery? I don't understand what you're getting at here.

The source of the problem is that we write back metadata
with too-recent information in it. What determines "too recent"? The
LSN of the log item which determines it's position in the AIL. If we
want to avoid writing back something that is "too recent", then
we've got to stop newer objects on the AIL from being written to
disk before the older modifications are cleared from the log.

i.e. we can avoid the recovery problem by serialising metadata
flushing to a single checkpoint at a time. i.e. we flush the objects
from the tail LSN only and wait for them to complete and move the
tail of the log forwards before we write the next set of objects
that are now at the tail LSN....

> I think there may be some creative ways to resolve this problem for older
> filesystems which are worth looking into.

Good luck, but I think you are embarking on a fool's errand.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] xfs: don't shutdown log recovery on validation errors
  2013-06-14 20:22                       ` Ben Myers
@ 2013-06-28 18:54                         ` Dave Jones
  2013-06-28 19:24                           ` Ben Myers
  0 siblings, 1 reply; 31+ messages in thread
From: Dave Jones @ 2013-06-28 18:54 UTC (permalink / raw)
  To: Ben Myers; +Cc: Linus Torvalds, Eric Sandeen, xfs

On Fri, Jun 14, 2013 at 03:22:23PM -0500, Ben Myers wrote:

 > > > Are you guys arguing that the log replay bug should not be fixed?
 > > 
 > > Speaking for myself, I'm not arguing that, not at all.
 > > (not that I know how to fix it, either)
 > 
 > ...until log replay is fixed.  Hell.  I'll just pull it in as-is.
 > 
 > Reviewed-by: Ben Myers <bpm@sgi.com>

Is this on its way to Linus ? Given it's a regression over 3.9, it probably should be.

	Dave

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] xfs: don't shutdown log recovery on validation errors
  2013-06-28 18:54                         ` Dave Jones
@ 2013-06-28 19:24                           ` Ben Myers
  2013-06-28 19:28                             ` Dave Jones
  0 siblings, 1 reply; 31+ messages in thread
From: Ben Myers @ 2013-06-28 19:24 UTC (permalink / raw)
  To: Dave Jones; +Cc: Eric Sandeen, Linus Torvalds, xfs

Hi Dave,

On Fri, Jun 28, 2013 at 02:54:06PM -0400, Dave Jones wrote:
> On Fri, Jun 14, 2013 at 03:22:23PM -0500, Ben Myers wrote:
> 
>  > > > Are you guys arguing that the log replay bug should not be fixed?
>  > > 
>  > > Speaking for myself, I'm not arguing that, not at all.
>  > > (not that I know how to fix it, either)
>  > 
>  > ...until log replay is fixed.  Hell.  I'll just pull it in as-is.
>  > 
>  > Reviewed-by: Ben Myers <bpm@sgi.com>
> 
> Is this on its way to Linus ? Given it's a regression over 3.9, it probably should be.

This one was in my pull request for -rc6.  Looks like Linus merged it here:

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=d58c6ff0b779c5adae2a8596fde69cb45f2a5d68

We still need to propose this one for -stable, IIRC.

Thanks,
	Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] xfs: don't shutdown log recovery on validation errors
  2013-06-28 19:24                           ` Ben Myers
@ 2013-06-28 19:28                             ` Dave Jones
  2013-06-28 19:31                               ` Ben Myers
  0 siblings, 1 reply; 31+ messages in thread
From: Dave Jones @ 2013-06-28 19:28 UTC (permalink / raw)
  To: Ben Myers; +Cc: Eric Sandeen, Linus Torvalds, xfs

On Fri, Jun 28, 2013 at 02:24:20PM -0500, Ben Myers wrote:
 > Hi Dave,
 > 
 > On Fri, Jun 28, 2013 at 02:54:06PM -0400, Dave Jones wrote:
 > > On Fri, Jun 14, 2013 at 03:22:23PM -0500, Ben Myers wrote:
 > > 
 > >  > > > Are you guys arguing that the log replay bug should not be fixed?
 > >  > > 
 > >  > > Speaking for myself, I'm not arguing that, not at all.
 > >  > > (not that I know how to fix it, either)
 > >  > 
 > >  > ...until log replay is fixed.  Hell.  I'll just pull it in as-is.
 > >  > 
 > >  > Reviewed-by: Ben Myers <bpm@sgi.com>
 > > 
 > > Is this on its way to Linus ? Given it's a regression over 3.9, it probably should be.
 > 
 > This one was in my pull request for -rc6.  Looks like Linus merged it here:
 > 
 > http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=d58c6ff0b779c5adae2a8596fde69cb45f2a5d68
 > 
 > We still need to propose this one for -stable, IIRC.

Bah, no Reported-by:

	Dave

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] xfs: don't shutdown log recovery on validation errors
  2013-06-28 19:28                             ` Dave Jones
@ 2013-06-28 19:31                               ` Ben Myers
  0 siblings, 0 replies; 31+ messages in thread
From: Ben Myers @ 2013-06-28 19:31 UTC (permalink / raw)
  To: Dave Jones; +Cc: Eric Sandeen, Linus Torvalds, xfs

On Fri, Jun 28, 2013 at 03:28:28PM -0400, Dave Jones wrote:
> On Fri, Jun 28, 2013 at 02:24:20PM -0500, Ben Myers wrote:
>  > Hi Dave,
>  > 
>  > On Fri, Jun 28, 2013 at 02:54:06PM -0400, Dave Jones wrote:
>  > > On Fri, Jun 14, 2013 at 03:22:23PM -0500, Ben Myers wrote:
>  > > 
>  > >  > > > Are you guys arguing that the log replay bug should not be fixed?
>  > >  > > 
>  > >  > > Speaking for myself, I'm not arguing that, not at all.
>  > >  > > (not that I know how to fix it, either)
>  > >  > 
>  > >  > ...until log replay is fixed.  Hell.  I'll just pull it in as-is.
>  > >  > 
>  > >  > Reviewed-by: Ben Myers <bpm@sgi.com>
>  > > 
>  > > Is this on its way to Linus ? Given it's a regression over 3.9, it probably should be.
>  > 
>  > This one was in my pull request for -rc6.  Looks like Linus merged it here:
>  > 
>  > http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=d58c6ff0b779c5adae2a8596fde69cb45f2a5d68
>  > 
>  > We still need to propose this one for -stable, IIRC.
> 
> Bah, no Reported-by:

Sorry Dave.  I'll do better next time.

Regards,
	Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2013-06-28 19:31 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-06-12  2:19 [PATCH 0/3] xfs: fixes for 3.10-rc6 Dave Chinner
2013-06-12  2:19 ` [PATCH 1/3] xfs: don't shutdown log recovery on validation errors Dave Chinner
2013-06-13  1:04   ` Ben Myers
2013-06-13  2:08     ` Dave Chinner
2013-06-13 22:09       ` Ben Myers
2013-06-14  0:13         ` Dave Chinner
2013-06-14 12:55           ` Mark Tinguely
2013-06-14 16:09           ` Ben Myers
2013-06-14 16:15             ` Eric Sandeen
2013-06-14 19:08               ` Ben Myers
2013-06-14 19:18                 ` Eric Sandeen
2013-06-14 19:44                   ` Ben Myers
2013-06-14 19:54                     ` Eric Sandeen
2013-06-14 20:22                       ` Ben Myers
2013-06-28 18:54                         ` Dave Jones
2013-06-28 19:24                           ` Ben Myers
2013-06-28 19:28                             ` Dave Jones
2013-06-28 19:31                               ` Ben Myers
2013-06-15  0:56                     ` Dave Chinner
2013-06-17 14:53                       ` Ben Myers
2013-06-18  1:22                         ` Dave Chinner
2013-06-14 16:17             ` Dave Jones
2013-06-14 16:31               ` Ben Myers
2013-06-12  2:19 ` [PATCH 2/3] xfs: fix implicit padding in directory and attr CRC formats Dave Chinner
2013-06-13  0:58   ` Ben Myers
2013-06-13  1:40     ` Michael L. Semon
2013-06-13  2:27     ` Dave Chinner
2013-06-13 21:31       ` Ben Myers
2013-06-12  2:19 ` [PATCH 3/3] xfs: ensure btree root split sets blkno correctly Dave Chinner
2013-06-13 19:16   ` Ben Myers
2013-06-14  0:21     ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox