From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail.kernel.org ([198.145.29.136]:38536 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751742AbdCCPgt (ORCPT ); Fri, 3 Mar 2017 10:36:49 -0500 Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id 17590201F2 for ; Fri, 3 Mar 2017 15:36:41 +0000 (UTC) Received: from mail-it0-f43.google.com (mail-it0-f43.google.com [209.85.214.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id C36F520212 for ; Fri, 3 Mar 2017 15:36:37 +0000 (UTC) Received: by mail-it0-f43.google.com with SMTP id m27so15134597iti.1 for ; Fri, 03 Mar 2017 07:36:37 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <20170303003642.GB26698@lim.localdomain> References: <1436888088-8631-1-git-send-email-fdmanana@kernel.org> <20170302221821.GA26698@lim.localdomain> <20170303003642.GB26698@lim.localdomain> From: Filipe Manana Date: Fri, 3 Mar 2017 15:36:36 +0000 Message-ID: Subject: Re: [PATCH] Btrfs: fix file corruption after cloning inline extents To: Liu Bo Cc: "linux-btrfs@vger.kernel.org" , Filipe Manana Content-Type: text/plain; charset=UTF-8 Sender: linux-btrfs-owner@vger.kernel.org List-ID: On Fri, Mar 3, 2017 at 12:36 AM, Liu Bo wrote: > On Thu, Mar 02, 2017 at 02:18:21PM -0800, Liu Bo wrote: >> On Tue, Jul 14, 2015 at 04:34:48PM +0100, fdmanana@kernel.org wrote: >> > From: Filipe Manana >> > >> > Using the clone ioctl (or extent_same ioctl, which calls the same extent >> > cloning function as well) we end up allowing copy an inline extent from >> > the source file into a non-zero offset of the destination file. This is >> > something not expected and that the btrfs code is not prepared to deal >> > with - all inline extents must be at a file offset equals to 0. >> > >> >> Somehow I failed to reproduce the BUG_ON with this case. >> >> > For example, the following excerpt of a test case for fstests triggers >> > a crash/BUG_ON() on a write operation after an inline extent is cloned >> > into a non-zero offset: >> > >> > _scratch_mkfs >>$seqres.full 2>&1 >> > _scratch_mount >> > >> > # Create our test files. File foo has the same 2K of data at offset 4K >> > # as file bar has at its offset 0. >> > $XFS_IO_PROG -f -s -c "pwrite -S 0xaa 0 4K" \ >> > -c "pwrite -S 0xbb 4k 2K" \ >> > -c "pwrite -S 0xcc 8K 4K" \ >> > $SCRATCH_MNT/foo | _filter_xfs_io >> > >> > # File bar consists of a single inline extent (2K size). >> > $XFS_IO_PROG -f -s -c "pwrite -S 0xbb 0 2K" \ >> > $SCRATCH_MNT/bar | _filter_xfs_io >> > >> > # Now call the clone ioctl to clone the extent of file bar into file >> > # foo at its offset 4K. This made file foo have an inline extent at >> > # offset 4K, something which the btrfs code can not deal with in future >> > # IO operations because all inline extents are supposed to start at an >> > # offset of 0, resulting in all sorts of chaos. >> > # So here we validate that clone ioctl returns an EOPNOTSUPP, which is >> > # what it returns for other cases dealing with inlined extents. >> > $CLONER_PROG -s 0 -d $((4 * 1024)) -l $((2 * 1024)) \ >> > $SCRATCH_MNT/bar $SCRATCH_MNT/foo >> > >> > # Because of the inline extent at offset 4K, the following write made >> > # the kernel crash with a BUG_ON(). >> > $XFS_IO_PROG -c "pwrite -S 0xdd 6K 2K" $SCRATCH_MNT/foo | _filter_xfs_io >> > >> >> On 4.10, after allowing to clone an inline extent to dst file's offset greater >> than zero, I followed the test case manually and got these >> >> [root@localhost trinity]# /home/btrfs-progs/btrfs-debugfs -f /mnt/btrfs/foo >> (257 0): ram 4096 disk 12648448 disk_size 4096 >> (257 4096): ram 2048 disk 0 disk_size 2048 -- inline >> (257 8192): ram 4096 disk 12656640 disk_size 4096 >> file: /mnt/btrfs/foo extents 3 disk size 10240 logical size 12288 ratio 1.20 >> >> [root@localhost trinity]# xfs_io -f -c "pwrite 6k 2k" /mnt/btrfs/foo >> wrote 2048/2048 bytes at offset 6144 >> 2 KiB, 1 ops; 0.0000 sec (12.520 MiB/sec and 6410.2564 ops/sec) >> >> [root@localhost trinity]# sync >> [root@localhost trinity]# /home/btrfs-progs/btrfs-debugfs -f /mnt/btrfs/foo >> (257 0): ram 4096 disk 12648448 disk_size 4096 >> (257 4096): ram 4096 disk 12582912 disk_size 4096 >> (257 8192): ram 4096 disk 12656640 disk_size 4096 >> file: /mnt/btrfs/foo extents 3 disk size 12288 logical size 12288 ratio 1.00 >> >> >> Looks like we now are able to cope with these inline extents? > > I went back to test against v4.1 and v4.5, it turns out that we got the below > BUG_ON() in MM and -EIO when writing to the inline extent, because of the fact > that, when writing to the page that covers the inline extent, firstly it reads > page to get an uptodate page for writing, in readpage(), for inline extent, > btrfs_get_extent() always goes to search fs tree to read inline data out to page > and then tries to insert a em, -EEXIST would be returned if there is an existing > one. > > However, after commit 8dff9c853410 ("Btrfs: deal with duplciates during > extent_map insertion in btrfs_get_extent"), we have that fixed, so now we can > read/write inline extent even they've been mixed with other regular extents. > > But...I'm not 100% sure whether such files (with mixing inline with regular) > would have any other problems rather than read/write. Let me know if you could > think of a corruption due to that. Without thinking too much and after doing some quick tests that passed successfully, I'm not seeing where things can go wrong. However it's odd to have a mix of inline and non-inline extents, since the only cases where we create inline extents is for zero offsets and their size is smaller than page_size. I am not entirely sure if, even after the side effects of that commit, it would be safe to allow clone operation to leave inline extents at the destination like this. A lot more testing and verification should be done. thanks > > Thanks, > > -liubo >> >> >> Thanks, >> >> -liubo >> >> >> > status=0 >> > exit >> > >> > The stack trace of the BUG_ON() triggered by the last write is: >> > >> > [152154.035903] ------------[ cut here ]------------ >> > [152154.036424] kernel BUG at mm/page-writeback.c:2286! >> > [152154.036424] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC >> > [152154.036424] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc acpi_cpu$ >> > [152154.036424] CPU: 2 PID: 17873 Comm: xfs_io Tainted: G W 4.1.0-rc6-btrfs-next-11+ #2 >> > [152154.036424] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014 >> > [152154.036424] task: ffff880429f70990 ti: ffff880429efc000 task.ti: ffff880429efc000 >> > [152154.036424] RIP: 0010:[] [] clear_page_dirty_for_io+0x1e/0x90 >> > [152154.036424] RSP: 0018:ffff880429effc68 EFLAGS: 00010246 >> > [152154.036424] RAX: 0200000000000806 RBX: ffffea0006a6d8f0 RCX: 0000000000000001 >> > [152154.036424] RDX: 0000000000000000 RSI: ffffffff81155d1b RDI: ffffea0006a6d8f0 >> > [152154.036424] RBP: ffff880429effc78 R08: ffff8801ce389fe0 R09: 0000000000000001 >> > [152154.036424] R10: 0000000000002000 R11: ffffffffffffffff R12: ffff8800200dce68 >> > [152154.036424] R13: 0000000000000000 R14: ffff8800200dcc88 R15: ffff8803d5736d80 >> > [152154.036424] FS: 00007fbf119f6700(0000) GS:ffff88043d280000(0000) knlGS:0000000000000000 >> > [152154.036424] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> > [152154.036424] CR2: 0000000001bdc000 CR3: 00000003aa555000 CR4: 00000000000006e0 >> > [152154.036424] Stack: >> > [152154.036424] ffff8803d5736d80 0000000000000001 ffff880429effcd8 ffffffffa04e97c1 >> > [152154.036424] ffff880429effd68 ffff880429effd60 0000000000000001 ffff8800200dc9c8 >> > [152154.036424] 0000000000000001 ffff8800200dcc88 0000000000000000 0000000000001000 >> > [152154.036424] Call Trace: >> > [152154.036424] [] lock_and_cleanup_extent_if_need+0x147/0x18d [btrfs] >> > [152154.036424] [] __btrfs_buffered_write+0x245/0x4c8 [btrfs] >> > [152154.036424] [] ? btrfs_file_write_iter+0x150/0x3e0 [btrfs] >> > [152154.036424] [] ? btrfs_file_write_iter+0x15f/0x3e0 [btrfs] >> > [152154.036424] [] btrfs_file_write_iter+0x2cc/0x3e0 [btrfs] >> > [152154.036424] [] __vfs_write+0x7c/0xa5 >> > [152154.036424] [] vfs_write+0xa0/0xe4 >> > [152154.036424] [] SyS_pwrite64+0x64/0x82 >> > [152154.036424] [] system_call_fastpath+0x12/0x6f >> > [152154.036424] Code: 48 89 c7 e8 0f ff ff ff 5b 41 5c 5d c3 0f 1f 44 00 00 55 48 89 e5 41 54 53 48 89 fb e8 ae ef 00 00 49 89 c4 48 8b 03 a8 01 75 02 <0f> 0b 4d 85 e4 74 59 49 8b 3c 2$ >> > [152154.036424] RIP [] clear_page_dirty_for_io+0x1e/0x90 >> > [152154.036424] RSP >> > [152154.242621] ---[ end trace e3d3376b23a57041 ]--- >> > >> > Fix this by returning the error EOPNOTSUPP if an attempt to copy an >> > inline extent into a non-zero offset happens, just like what is done for >> > other scenarios that would require copying/splitting inline extents, >> > which were introduced by the following commits: >> > >> > 00fdf13a2e9f ("Btrfs: fix a crash of clone with inline extents's split") >> > 3f9e3df8da3c ("btrfs: replace error code from btrfs_drop_extents") >> > >> > Cc: stable@vger.kernel.org >> > Signed-off-by: Filipe Manana >> > --- >> > fs/btrfs/ioctl.c | 14 ++++++++++++++ >> > 1 file changed, 14 insertions(+) >> > >> > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c >> > index d389815..0770c91 100644 >> > --- a/fs/btrfs/ioctl.c >> > +++ b/fs/btrfs/ioctl.c >> > @@ -3588,6 +3588,20 @@ process_slot: >> > u64 trim = 0; >> > u64 aligned_end = 0; >> > >> > + /* >> > + * Don't copy an inline extent into an offset >> > + * greater than zero. Having an inline extent >> > + * at such an offset results in chaos as btrfs >> > + * isn't prepared for such cases. Just skip >> > + * this case for the same reasons as commented >> > + * at btrfs_ioctl_clone(). >> > + */ >> > + if (last_dest_end > 0) { >> > + ret = -EOPNOTSUPP; >> > + btrfs_end_transaction(trans, root); >> > + goto out; >> > + } >> > + >> > if (off > key.offset) { >> > skip = off - key.offset; >> > new_key.offset += skip; >> > -- >> > 2.1.3 >> > >> > -- >> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> > the body of a message to majordomo@vger.kernel.org >> > More majordomo info at http://vger.kernel.org/majordomo-info.html >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html