From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from extserverfr1.prnet.org ([188.165.43.41]:44939 "EHLO extserverfr1.prnet.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753208AbaJNRAY (ORCPT ); Tue, 14 Oct 2014 13:00:24 -0400 Message-ID: <543D569A.8000103@prnet.org> Date: Tue, 14 Oct 2014 19:00:10 +0200 From: David Arendt MIME-Version: 1.0 To: Duncan <1i5t5.duncan@cox.net>, linux-btrfs@vger.kernel.org Subject: Re: btrfs random filesystem corruption in kernel 3.17 References: <543450DC.90504@prnet.org> <1412714780.2374.0@mail.thefacebook.com> <543A61EE.7070200@prnet.org> <543C35C3.9070002@prnet.org> In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Sender: linux-btrfs-owner@vger.kernel.org List-ID: The corruption seems to be worse than expected. In kernel 3.16.5 I can not mount this filesystem read/write. I'm in progress of doing a tar - mkfs.btrfs - untar recovery and staying on 3.16.5 for now. [ 55.465584] parent transid verify failed on 51150848 wanted 272368 found 276401 [ 55.468415] parent transid verify failed on 918274048 wanted 273135 found 274590 [ 55.470915] parent transid verify failed on 508444672 wanted 274054 found 276617 [ 55.473758] parent transid verify failed on 18317623296 wanted 275876 found 278431 [ 55.476240] parent transid verify failed on 127254528 wanted 276488 found 276490 [ 55.479494] ------------[ cut here ]------------ [ 55.479499] WARNING: CPU: 1 PID: 1723 at fs/btrfs/extent-tree.c:876 btrfs_lookup_extent_info+0x44c/0x490() [ 55.479500] Modules linked in: [ 55.479502] CPU: 1 PID: 1723 Comm: ls Not tainted 3.16.5 #1 [ 55.479502] Hardware name: ASUS All Series/H87M-PRO, BIOS 2101 07/21/2014 [ 55.479503] 0000000000000000 0000000000000009 ffffffff816ff873 0000000000000000 [ 55.479504] ffffffff81078261 ffff8807f7084770 ffff8807ed8ca000 000000003dcf4000 [ 55.479506] ffff8807f7133de0 0000000000000000 ffffffff812be9bc 0000000000004000 [ 55.479507] Call Trace: [ 55.479511] [] ? dump_stack+0x41/0x51 [ 55.479514] [] ? warn_slowpath_common+0x81/0xb0 [ 55.479515] [] ? btrfs_lookup_extent_info+0x44c/0x490 [ 55.479516] [] ? btrfs_alloc_free_block+0x2c8/0x450 [ 55.479519] [] ? update_ref_for_cow+0x1ff/0x3f0 [ 55.479520] [] ? __btrfs_cow_block+0x23a/0x5a0 [ 55.479522] [] ? btrfs_buffer_uptodate+0x6d/0x80 [ 55.479524] [] ? btrfs_cow_block+0x126/0x190 [ 55.479525] [] ? btrfs_search_slot+0x1fd/0xaa0 [ 55.479527] [] ? btrfs_truncate_inode_items+0x123/0x8e0 [ 55.479529] [] ? btrfs_evict_inode+0x32a/0x490 [ 55.479532] [] ? unlock_new_inode+0x3a/0x60 [ 55.479533] [] ? __inode_wait_for_writeback+0x65/0xb0 [ 55.479536] [] ? wake_atomic_t_function+0x30/0x30 [ 55.479537] [] ? evict+0xa6/0x160 [ 55.479539] [] ? btrfs_orphan_cleanup+0x1ed/0x430 [ 55.479540] [] ? btrfs_lookup_dentry+0x358/0x4c0 [ 55.479542] [] ? btrfs_lookup+0x9/0x30 [ 55.479543] [] ? lookup_real+0x14/0x50 [ 55.479545] [] ? __lookup_hash+0x32/0x50 [ 55.479546] [] ? lookup_slow+0x48/0xc0 [ 55.479547] [] ? path_lookupat+0x73c/0x770 [ 55.479550] [] ? posix_acl_xattr_get+0x40/0xb0 [ 55.479551] [] ? generic_getxattr+0x50/0x80 [ 55.479552] [] ? filename_lookup.isra.51+0x2e/0x90 [ 55.479554] [] ? user_path_at_empty+0x5f/0xb0 [ 55.479555] [] ? user_path_at_empty+0x69/0xb0 [ 55.479556] [] ? vfs_fstatat+0x40/0x90 [ 55.479557] [] ? SyS_newlstat+0x12/0x30 [ 55.479559] [] ? path_put+0xd/0x20 [ 55.479560] [] ? SyS_getxattr+0x57/0x80 [ 55.479562] [] ? system_call_fastpath+0x16/0x1b [ 55.479563] ---[ end trace a8ad56fd476f7474 ]--- [ 55.479564] BTRFS: error (device sda2) in update_ref_for_cow:1018: errno=-30 Readonly filesystem [ 55.479565] BTRFS info (device sda2): forced readonly [ 55.479565] ------------[ cut here ]------------ [ 55.479567] WARNING: CPU: 1 PID: 1723 at fs/btrfs/super.c:259 __btrfs_abort_transaction+0x5a/0x140() [ 55.479567] BTRFS: Transaction aborted (error -30) [ 55.479568] Modules linked in: [ 55.479569] CPU: 1 PID: 1723 Comm: ls Tainted: G W 3.16.5 #1 [ 55.479569] Hardware name: ASUS All Series/H87M-PRO, BIOS 2101 07/21/2014 [ 55.479570] 0000000000000000 0000000000000009 ffffffff816ff873 ffff8807f2dcf788 [ 55.479571] ffffffff81078261 00000000ffffffe2 ffff8807ed8ca000 ffff8807f7133de0 [ 55.479572] ffffffff8184d800 0000000000000488 ffffffff81078345 ffffffff8197afd8 [ 55.479573] Call Trace: [ 55.479574] [] ? dump_stack+0x41/0x51 [ 55.479576] [] ? warn_slowpath_common+0x81/0xb0 [ 55.479578] [] ? warn_slowpath_fmt+0x45/0x50 [ 55.479579] [] ? __btrfs_abort_transaction+0x5a/0x140 [ 55.479580] [] ? __btrfs_cow_block+0x432/0x5a0 [ 55.479582] [] ? btrfs_buffer_uptodate+0x6d/0x80 [ 55.479583] [] ? btrfs_cow_block+0x126/0x190 [ 55.479584] [] ? btrfs_search_slot+0x1fd/0xaa0 [ 55.479586] [] ? btrfs_truncate_inode_items+0x123/0x8e0 [ 55.479587] [] ? btrfs_evict_inode+0x32a/0x490 [ 55.479588] [] ? unlock_new_inode+0x3a/0x60 [ 55.479590] [] ? __inode_wait_for_writeback+0x65/0xb0 [ 55.479591] [] ? wake_atomic_t_function+0x30/0x30 [ 55.479592] [] ? evict+0xa6/0x160 [ 55.479594] [] ? btrfs_orphan_cleanup+0x1ed/0x430 [ 55.479595] [] ? btrfs_lookup_dentry+0x358/0x4c0 [ 55.479596] [] ? btrfs_lookup+0x9/0x30 [ 55.479598] [] ? lookup_real+0x14/0x50 [ 55.479599] [] ? __lookup_hash+0x32/0x50 [ 55.479600] [] ? lookup_slow+0x48/0xc0 [ 55.479601] [] ? path_lookupat+0x73c/0x770 [ 55.479603] [] ? posix_acl_xattr_get+0x40/0xb0 [ 55.479605] [] ? generic_getxattr+0x50/0x80 [ 55.479606] [] ? filename_lookup.isra.51+0x2e/0x90 [ 55.479607] [] ? user_path_at_empty+0x5f/0xb0 [ 55.479608] [] ? user_path_at_empty+0x69/0xb0 [ 55.479609] [] ? vfs_fstatat+0x40/0x90 [ 55.479610] [] ? SyS_newlstat+0x12/0x30 [ 55.479611] [] ? path_put+0xd/0x20 [ 55.479613] [] ? SyS_getxattr+0x57/0x80 [ 55.479614] [] ? system_call_fastpath+0x16/0x1b [ 55.479615] ---[ end trace a8ad56fd476f7475 ]--- [ 55.479620] BTRFS error (device sda2): Error removing orphan entry, stopping orphan cleanup [ 55.479621] BTRFS critical (device sda2): could not do orphan cleanup -22 [ 83.454294] parent transid verify failed on 51150848 wanted 272368 found 276401 [ 83.454945] parent transid verify failed on 918274048 wanted 273135 found 274590 [ 83.455601] parent transid verify failed on 508444672 wanted 274054 found 276617 [ 83.456251] parent transid verify failed on 18317623296 wanted 275876 found 278431 [ 83.456897] parent transid verify failed on 127254528 wanted 276488 found 276490 [ 84.647964] parent transid verify failed on 51150848 wanted 272368 found 276401 [ 84.648612] parent transid verify failed on 918274048 wanted 273135 found 274590 [ 84.649267] parent transid verify failed on 508444672 wanted 274054 found 276617 [ 84.649913] parent transid verify failed on 18317623296 wanted 275876 found 278431 [ 84.650557] parent transid verify failed on 127254528 wanted 276488 found 276490 On 10/14/14 12:36 AM, Duncan wrote: > Rich Freeman posted on Mon, 13 Oct 2014 16:42:14 -0400 as excerpted: > >> On Mon, Oct 13, 2014 at 4:27 PM, David Arendt wrote: >>> From my own experience and based on what other people are saying, I >>> think there is a random btrfs filesystem corruption problem in kernel >>> 3.17 at least related to snapshots, therefore I decided to post using >>> another subject to draw attention from people not concerned about btrfs >>> send to it. More information can be found in the brtfs send posts. >>> >>> Did the filesystem you tried to balance contain snapshots ? Read only >>> ones ? >> The filesystem contains numerous subvolumes and snapshots, many of which >> are read-only. I'm managing many with snapper. >> >> The similarity of the transid verify errors made me think this issue is >> related, and the root cause may have nothing to do with btrfs send. >> >> As far as I can tell these errors aren't having any affect on my data - >> hopefully the system is catching the problems before there are actual >> disk writes/etc. > Summarizing what I've seen on the threads... > > 1) The bug seems to be read-only snapshot related. The connection to > send is that send creates read-only snapshots, but people creating read- > only snapshots for other purposes are now reporting the same problem, so > it's not send, it's the read-only snapshots. > > 2) Writable snapshots haven't been implicated yet, and the working set > from which the snapshots are taken doesn't seem to be affected, either. > So in that sense it's not affecting ordinary usage, only the read-only > snapshots themselves. > > 3) More problematic, however, is the fact that these apparently corrupted > read-only snapshots often are not listed properly and can't be deleted, > tho I'm not sure if that's /all/ the corrupted snapshots or only part of > them. So while it may not affect ordinary operation in the short term, > over time until there's a fix, people routinely doing read-only snapshots > are going to be getting more and more of these undeletable snapshots, and > depending on whether the eventual patch only prevents more or can > actually fix the bad ones (possibly via btrfs check or the like), > affected filesystems may ultimately have to be blown away and recreated > with a fresh mkfs, in ordered to kill the currently undeletable snapshots. > > So the first thing to do would be to shut off whatever's making read-only > snapshots, so you don't make the problem worse while it's being > investigated. For those who can do that without too big an interruption > to their normal routine (who don't depend on send/receive, for instance), > just keep it off for the time being. For those who depend on read-only > snapshots (send-receive for backup and the data is too valuable to not do > the backups for a few days), consider switching back to 3.16-stable -- > from 3.16.3 at least, the patch for the compress bug is there, so that > shouldn't be a problem. > > And if you're affected, be aware that until we have a fix, we don't know > if it'll be possible to remove the affected and currently undeletable > snapshots. If it's not, at some point you'll need to do a fresh > mkfs.btrfs, to get rid of the damage. Since the bug doesn't appear to > affect writable snapshots or the "head" from which snapshots are made, > it's not urgent, and a full fix is likely to include a patch to detect > and fix the problem as well, but until we know what the problem is we > can't be sure of that, so be prepared to do that mkfs at some point, as > at this point it's possible that's the only way you'll be able to kill > the corrupted snapshots. > > 4) Total speculation on my part, but given the wanted transid (aka > generation, in different contexts) is significantly lower than the found > transid, and the fact that the problem appears to be limited to > /read-only/ snapshots, my first suspicion is that something's getting > updated that would normally apply to all snapshots, but the read-only > nature of the snapshots is preventing the full update there. The transid > of the block is updated, but the snapshot being read-only is preventing > update of the pointer in that snapshot accordingly. > > What I do /not/ know is whether the bug is that something's getting > updated that should NOT be, and it's simply the read-only snapshots > letting us know about it since the writable snapshots are fully updated, > even if that breaks the snapshot (breaking writable snapshots in a > different and currently undetected way), or if instead, it's a legitimate > update, like a balance simply moving the snapshot around but not > affecting it otherwise, and the bug is that the read-only snapshots > aren't allowing the legitimate update. > > Either way, this more or less developed over the weekend, and it's Monday > now, so the devs should be on it. If it's anything like the 3.15/3.16 > compression bug, it'll take some time for them to properly trace it, and > then to figure out an appropriate fix, but they will. Chances are we'll > have at least some decent progress on a trace by Friday, and maybe even a > good-to-go patch. =:^) >