From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from aserp1040.oracle.com ([141.146.126.69]:27811 "EHLO aserp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751396AbdITTXK (ORCPT ); Wed, 20 Sep 2017 15:23:10 -0400 Date: Wed, 20 Sep 2017 12:19:16 -0700 From: Liu Bo To: dsterba@suse.cz, Paul Jones , "linux-btrfs@vger.kernel.org" Subject: Re: kernel BUG at fs/btrfs/extent_io.c:1989 Message-ID: <20170920191916.GA10216@lim.localdomain> Reply-To: bo.li.liu@oracle.com References: <20170918170930.GC6357@dhcp-10-211-47-181.usdhcp.oraclecorp.com> <20170919150725.GF29043@twin.jikos.cz> <20170919161239.GA18597@dhcp-10-211-47-181.usdhcp.oraclecorp.com> <20170920125356.GI29043@twin.jikos.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii In-Reply-To: <20170920125356.GI29043@twin.jikos.cz> Sender: linux-btrfs-owner@vger.kernel.org List-ID: On Wed, Sep 20, 2017 at 02:53:57PM +0200, David Sterba wrote: > On Tue, Sep 19, 2017 at 10:12:39AM -0600, Liu Bo wrote: > > On Tue, Sep 19, 2017 at 05:07:25PM +0200, David Sterba wrote: > > > On Tue, Sep 19, 2017 at 11:32:46AM +0000, Paul Jones wrote: > > > > > This 'mirror 0' looks fishy, (as mirror comes from btrfs_io_bio->mirror_num, > > > > > which should be at least 1 if raid1 setup is in use.) > > > > > > > > > > Not sure if 4.13.2-gentoo made any changes on btrfs, but can you please > > > > > verify with the upstream kernel, say, v4.13? > > > > > > > > It's basically a vanilla kernel with a handful of unrelated patches. > > > > The filesystem fell apart overnight, there were a few thousand > > > > checksum errors and eventually it went read-only. I tried to remount > > > > it, but got open_ctree failed. Btrfs check segfaulted, lowmem mode > > > > completed with so many errors I gave up and will restore from the > > > > backup. > > > > > > > > I think I know the problem now - the lvm cache was in writeback mode > > > > (by accident) so during a defrag there would be gigabytes of unwritten > > > > data in memory only, which was all lost when the system crashed > > > > (motherboard failure). No wonder the filesystem didn't quite survive. > > > > > > Yeah, the caching layer was my first suspicion, and lack of propagating > > > of the barriers. Good that you were able to confirm that as the root cause. > > > > > > > I must say though, I'm seriously impressed at the data integrity of > > > > BTRFS - there were near 10,000 checksum errors, 4 which were > > > > uncorrectable, and from what I could tell nearly all of the data was > > > > still intact according to rsync checksums. > > > > > > Yay! > > > > But still don't get why mirror_num is 0, do you have an idea on how > > does writeback cache make that? > > My first idea was that the cached blocks were zeroed, so we'd see the ino > and mirror as 0. But this is not correct as the blocks would not pass > the checksum tests, so the blocks must be from some previous generation. > Ie. the transid verify failure. And all the error reports appear after > that so I'm slightly suspicious about the way it's actually reported. > > btrfs_print_data_csum_error takes mirror from either io_bio or > compressed_bio structures, so there might be a case when the structures > are initialized. If the transid check is ok, then the structures are > updated. If the check fails we'd see the initial mirror number. All of > that is just a hypothesis, I haven't checked with the code. > Thanks a lot for the input, you're right, mirror_num 0 should come from compressed read where it doesn't record the bbio->mirror_num but the mirror passing from the upper layer, and it's not metadata as we don't yet compress metadata, so this all makes sense. I think it also disables the ability of read-repair from raid1 for compressed data, and that's what caused the bug where it hits BUG_ON(mirror_num == 0) in cleanup_io_failure(). The good news is that I can reproduce it, will send a patch and a testcase. > I don't have a theoretical explanation for the ino 0. The inode pointer > that goes to btrfs_print_data_csum_error should be from a properly > initialized inode and we print the number using btrfs_ino. That will use > the vfs i_ino value and we should never get 0 out of that. ino 0 comes from metadata read-repair, some cleanup may be needed to make it less confusing. thanks, -liubo