All of lore.kernel.org
 help / color / mirror / Atom feed
From: Liu Bo <bo.li.liu@oracle.com>
To: Filipe Manana <fdmanana@suse.com>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH] Btrfs: fix corruption after write/fsync failure + fsync + log recovery
Date: Tue, 26 Aug 2014 11:32:48 +0800	[thread overview]
Message-ID: <20140826033247.GA28985@localhost.localdomain> (raw)
In-Reply-To: <1408959780-4217-1-git-send-email-fdmanana@suse.com>

On Mon, Aug 25, 2014 at 10:43:00AM +0100, Filipe Manana wrote:
> While writing to a file, in inode.c:cow_file_range() (and same applies to
> submit_compressed_extents()), after reserving an extent for the file data,
> we create a new extent map for the written range and insert it into the
> extent map cache. After that, we create an ordered operation, but if it
> fails (due to a transient/temporary-ENOMEM), we return without dropping
> that extent map, which points to a reserved extent that is freed when we
> return. A subsequent incremental fsync (when the btrfs inode doesn't have
> the flag BTRFS_INODE_NEEDS_FULL_SYNC) considers this extent map valid and
> logs a file extent item based on that extent map, which points to a disk
> extent that doesn't contain valid data - it was freed by us earlier, at this
> point it might contain any random/garbage data.
> 
> Therefore, if we reach an error condition when cowing a file range after
> we added the new extent map to the cache, drop it from the cache before
> returning.
> 
> Some sequence of steps that lead to this:
> 
>     $ mkfs.btrfs -f /dev/sdd
>     $ mount -o commit=9999 /dev/sdd /mnt
>     $ cd /mnt
> 
>     $ xfs_io -f -c "pwrite -S 0x01 -b 4096 0 4096" -c "fsync" foo
>     $ xfs_io -c "pwrite -S 0x02 -b 4096 4096 4096"
>     $ sync
> 
>     $ od -t x1 foo
>     0000000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
>     *
>     0010000 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02
>     *
>     0020000
> 
>     $ xfs_io -c "pwrite -S 0xa1 -b 4096 0 4096" foo
> 
>     # Now this write + fsync fail with -ENOMEM, which was returned by
>     # btrfs_add_ordered_extent() in inode.c:cow_file_range().
>     $ xfs_io -c "pwrite -S 0xff -b 4096 4096 4096" foo
>     $ xfs_io -c "fsync" foo
>     fsync: Cannot allocate memory
> 
>     # Now do a new write + fsync, which will succeed. Our previous
>     # -ENOMEM was a transient/temporary error.
>     $ xfs_io -c "pwrite -S 0xee -b 4096 16384 4096" foo
>     $ xfs_io -c "fsync" foo
> 
>     # Our file content (in page cache) is now:
>     $ od -t x1 foo
>     0000000 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1
>     *
>     0010000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>     *
>     0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>     *
>     0040000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
>     *
>     0050000
> 
>     # Now reboot the machine, and mount the fs, so that fsync log replay
>     # takes place.
> 
>     # The file content is now weird, in particular the first 8Kb, which
>     # do not match our data before nor after the sync command above.
>     $ od -t x1 foo
>     0000000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
>     *
>     0010000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
>     *
>     0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>     *
>     0040000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
>     *
>     0050000
> 
>     # In fact these first 4Kb are a duplicate of the last 4kb block.
>     # The last write got an extent map/file extent item that points to
>     # the same disk extent that we got in the write+fsync that failed
>     # with the -ENOMEM error. btrfs-debug-tree and btrfsck allow us to
>     # verify that:
> 
>     $ btrfs-debug-tree /dev/sdd
>     (...)
> 	item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53
> 		extent data disk byte 12582912 nr 8192
> 		extent data offset 0 nr 8192 ram 8192
> 	item 7 key (257 EXTENT_DATA 8192) itemoff 15766 itemsize 53
> 		extent data disk byte 0 nr 0
> 		extent data offset 0 nr 8192 ram 8192
> 	item 8 key (257 EXTENT_DATA 16384) itemoff 15713 itemsize 53
> 		extent data disk byte 12582912 nr 4096
> 		extent data offset 0 nr 4096 ram 4096
> 
>     $ umount /dev/sdd
>     $ btrfsck /dev/sdd
>     Checking filesystem on /dev/sdd
>     UUID: db5e60e1-050d-41e6-8c7f-3d742dea5d8f
>     checking extents
>     extent item 12582912 has multiple extent items
>     ref mismatch on [12582912 4096] extent item 1, found 2
>     Backref bytes do not match extent backref, bytenr=12582912, ref bytes=4096, backref bytes=8192
>     backpointer mismatch on [12582912 4096]
>     Errors found in extent allocation tree or chunk allocation
>     checking free space cache
>     checking fs roots
>     root 5 inode 257 errors 1000, some csum missing
>     found 131074 bytes used err is 1
>     total csum bytes: 4
>     total tree bytes: 131072
>     total fs tree bytes: 32768
>     total extent tree bytes: 16384
>     btree space waste bytes: 123404
>     file data blocks allocated: 274432
>      referenced 274432
>     Btrfs v3.14.1-96-gcc7fd5a-dirty
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/inode.c | 12 +++++++++---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index c678dea..16e8146 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -792,8 +792,12 @@ retry:
>  						ins.offset,
>  						BTRFS_ORDERED_COMPRESSED,
>  						async_extent->compress_type);
> -		if (ret)
> +		if (ret) {
> +			btrfs_drop_extent_cache(inode, async_extent->start,
> +						async_extent->start +
> +						async_extent->ram_size - 1, 0);
>  			goto out_free_reserve;
> +		}
>  
>  		/*
>  		 * clear dirty, set writeback and unlock the pages.

What about the 'if (ret) {}' after btrfs_add_ordered_extent_compress()?
It looks similar to this case.

thanks,
-liubo

> @@ -985,14 +989,14 @@ static noinline int cow_file_range(struct inode *inode,
>  		ret = btrfs_add_ordered_extent(inode, start, ins.objectid,
>  					       ram_size, cur_alloc_size, 0);
>  		if (ret)
> -			goto out_reserve;
> +			goto out_drop_extent_cache;
>  
>  		if (root->root_key.objectid ==
>  		    BTRFS_DATA_RELOC_TREE_OBJECTID) {
>  			ret = btrfs_reloc_clone_csums(inode, start,
>  						      cur_alloc_size);
>  			if (ret)
> -				goto out_reserve;
> +				goto out_drop_extent_cache;
>  		}
>  
>  		if (disk_num_bytes < cur_alloc_size)
> @@ -1020,6 +1024,8 @@ static noinline int cow_file_range(struct inode *inode,
>  out:
>  	return ret;
>  
> +out_drop_extent_cache:
> +	btrfs_drop_extent_cache(inode, start, start + ram_size - 1, 0);
>  out_reserve:
>  	btrfs_free_reserved_extent(root, ins.objectid, ins.offset, 1);
>  out_unlock:
> -- 
> 1.9.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

  reply	other threads:[~2014-08-26  3:34 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-08-25  9:43 [PATCH] Btrfs: fix corruption after write/fsync failure + fsync + log recovery Filipe Manana
2014-08-26  3:32 ` Liu Bo [this message]
2014-08-26  7:56   ` Filipe David Manana
2014-08-26  8:28     ` Liu Bo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140826033247.GA28985@localhost.localdomain \
    --to=bo.li.liu@oracle.com \
    --cc=fdmanana@suse.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.