From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-xfs-owner@vger.kernel.org>
Received: from userp1040.oracle.com ([156.151.31.81]:47541 "EHLO
        userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1750713AbdISFiv (ORCPT
        <rfc822;linux-xfs@vger.kernel.org>); Tue, 19 Sep 2017 01:38:51 -0400
Date: Mon, 18 Sep 2017 22:38:43 -0700
From: "Darrick J. Wong" <darrick.wong@oracle.com>
Subject: Re: xfs clones crash issue - illegal state 13 in block map
Message-ID: <20170919053843.GA7117@magnolia>
References: <CAOQ4uxgCanEpWx20QvvLvwBbjJiJ2U3oz-e1ZXrv3ZV2d3EPRQ@mail.gmail.com>
 <20170907161330.GA6540@magnolia>
 <CAOQ4uxibPmwRS3P5A-k2JQvCX_hq9Mb_cSOThe-qcVd0136PpQ@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAOQ4uxibPmwRS3P5A-k2JQvCX_hq9Mb_cSOThe-qcVd0136PpQ@mail.gmail.com>
Sender: linux-xfs-owner@vger.kernel.org
List-ID: <linux-xfs.vger.kernel.org>
List-Id: xfs
To: Amir Goldstein <amir73il@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>, linux-xfs <linux-xfs@vger.kernel.org>

On Fri, Sep 08, 2017 at 11:34:58AM +0300, Amir Goldstein wrote:
> On Thu, Sep 7, 2017 at 7:13 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > On Thu, Sep 07, 2017 at 03:58:56PM +0300, Amir Goldstein wrote:
> >> Hi guys,
> >>
> >> I am getting these errors often when running the crash tests
> >> with cloned files (generic/502 in my xfstests patches).
> >>
> >> Hitting these errors requires first fixing 2 other issues
> >> that shadow over this issue:
> >> "xfs: fix incorrect log_flushed on fsync" (in master)
> >> "xfs: fix leftover CoW extent after truncate"
> >> available on my tree based on Darrick's simple fix:
> >> https://github.com/amir73il/linux/commits/xfs-fsync
> >>
> >> I get the errors more often (1 out of 5) on a 100G fs on spinning disk.
> >> On a 10G fs on SSD they are less frequent.
> >> The log in this email was captured on patched stable 4.9.47 kernel,
> >> but I am getting the same errors on patched upstream kernel.
> >>
> >> I wasn't able to create a deterministic reproducer, so attaching
> >> the full log from a failed test along with an IO log that can be
> >> replayed on your disk to examine the outcome.
> >>
> >> Following is the output of fsx process #5, which is the process
> >> that wrote the problematic testfile5.mark0 to the log.
> >> This process performs only read,zero,fsync before creating
> >> the log mark.
> >> The file testfile5 was cloned from an origin 256K file before
> >> running fsx.
> >> Later, I used the random seed 35484 in this log for all
> >> processes and it seemed to increase the probability for failure.
> >>
> >> # /old/home/amir/src/xfstests-dev/ltp/fsx -N 100 -d -k -P
> >> /mnt/test/fsxtests -i /dev/mapper/logwrites-test -S 0 -j 5
> >> /mnt/scratch/testfile5
> >> Seed set to 35484
> >> file_size=262144
> >> 5: 1 read 0x3f959 thru 0x3ffff (0x6a7 bytes)
> >> 5: 2 zero from 0x3307e to 0x34f74, (0x1ef6 bytes)
> >> 5: 3 fsync
> >> 5: Dumped fsync buffer to testfile5.mark0
> >>
> >> In order to get to the crash state you need to get my
> >> xfstests replay-log patches and replay the attached log
> >> on a >= 100G scratch device:
> >>
> >> # ./src/log-writes/replay-log --log log.xfs.testfile5.mark0 --replay
> >> $SCRATCH_DEV --end-mark testfile5.mark0
> >> # mount $SCRATCH_DEV $SCRATCH_MNT
> >> # umount $SCRATCH_MNT
> >> # xfs_repair -n $SCRATCH_DEV
> >> Phase 1 - find and verify superblock...
> >> Phase 2 - using internal log
> >>         - zero log...
> >>         - scan filesystem freespace and inode maps...
> >>         - found root inode chunk
> >> Phase 3 - for each AG...
> >>         - scan (but don't clear) agi unlinked lists...
> >>         - process known inodes and perform inode discovery...
> >>         - agno = 0
> >>
> >> fatal error -- illegal state 13 in block map 376
> >>
> >> Can anyone provide some insight?
> >
> > Looks like I missed a couple of extent states in process_bmbt_reclist_int.
> >
> > What happens if you add the following (only compile tested) patch to
> > xfsprogs?
> 
> This is what happens:
> 
> Phase 1 - find and verify superblock...
> Phase 2 - using internal log
>         - zero log...
>         - scan filesystem freespace and inode maps...
>         - found root inode chunk
> Phase 3 - for each AG...
>         - scan (but don't clear) agi unlinked lists...
>         - process known inodes and perform inode discovery...
>         - agno = 0
> data fork in regular inode 134 claims CoW block 376
> correcting nextents for inode 134
> bad data fork in inode 134
> would have cleared inode 134
>         - agno = 1
>         - agno = 2
>         - agno = 3
>         - process newly discovered inodes...
> Phase 4 - check for duplicate blocks...
>         - setting up duplicate extent list...
> unknown block state, ag 0, block 376
> unknown block state, ag 1, block 16
>         - check for inodes claiming duplicate blocks...
>         - agno = 0
>         - agno = 1
>         - agno = 2
> entry "testfile2" in shortform directory 128 references free inode 134
>         - agno = 3
> would have junked entry "testfile2" in directory inode 128
> imap claims in-use inode 134 is free, would correct imap
> Missing reverse-mapping record for (0/376) len 1 owner 134 off 19
> No modify flag set, skipping phase 5
> Phase 6 - check inode connectivity...
>         - traversing filesystem ...
>         - traversal finished ...
>         - moving disconnected inodes to lost+found ...
> Phase 7 - verify link counts...
> No modify flag set, skipping filesystem flush and exiting.

Hmm.

# xfs_db bugs.img
xfs_db> inode 134
xfs_db> bmap
data offset 0 startblock 24 (0/24) count 19 flag 0
data offset 19 startblock 376 (0/376) count 1 flag 0
data offset 25 startblock 8388624 (1/16) count 4 flag 1
data offset 29 startblock 53 (0/53) count 35 flag 0

Ok, so inode 134 thinks it owns block 376.

xfs_db> agf 0
xfs_db> addr refcntroot
xfs_db> p
...
recs[1-22] = [startblock,blockcount,refcount,cowflag] 
1:[24,1,11,0] 
...
22:[87,1,6,0]
xfs_db>

Ok, so the refcountbt doesn't know about block 376, so it's neither
shared nor a cow staging extent.

xfs_db> agf 0 
xfs_db> addr rmaproot
xfs_db> p
...
recs[1-38] = [startblock,blockcount,owner,offset,extentflag,attrfork,bmbtblock] 
1:[0,1,-3,0,0,0,0] 
...
38:[376,1,-9,0,0,0,0]
xfs_db> 

So it seems that the rmapbt is behind the times, since record 38 should
be [376,1,134,19,0,0,0].  Either that's because there's something wrong
with the regular deferred ops mechanism, or log recovery of deferred ops.
Let's have a look at the log...

# xfs_logprint bugs.img | grep x178 -B3
Oper (426): tid: 5c75a3f  len: 48  clientid: TRANS  flags: none
CUI:  #regs: 1  num_extents: 2  id: 0xffff8f0e842f74b8
(s: 0x2b, l: 1, f: 0x2) 
(s: 0x178, l: 1, f: 0x4) 

...ok, so we XFS_REFCOUNT_FREE_COW block 376, which removes it from from
the refcountbt...

----------------------------------------------------------------------------
Oper (427): tid: 5c75a3f  len: 48  clientid: TRANS  flags: none
BUI:  #regs: 1  num_extents: 1  id: 0xffff8f0e81f39000
(s: 0x178, l: 1, own: 134, off: 19, f: 0x1) 

...now we map block 376 into the data fork of inode 134 at offset 19...

--
----------------------------------------------------------------------------
Oper (447): tid: 5c75a3f  len: 48  clientid: TRANS  flags: none
RUI:  #regs: 1  num_extents: 1  id: 0xffff8f0e48254000
(s: 0x178, l: 1, own: -9, off: 0, f: 0x8) 

...now we XFS_RMAP_EXTENT_FREE the CoW extent @ block 376 from the rmap,
which ought to remove the record...

--
----------------------------------------------------------------------------
Oper (453): tid: e0ecea6a  len: 48  clientid: TRANS  flags: none
RUI:  #regs: 1  num_extents: 1  id: 0xffff8f0e48254530
(s: 0x178, l: 1, own: 134, off: 19, f: 0x2) 

...and XFS_RMAP_EXTENT_MAP_SHARED a new rmapping for block 376, which
should add the [376,1,134,19,0,0,0] that we'd expect to be there...

--
----------------------------------------------------------------------------
Oper (13): tid: 17c5790c  len: 48  clientid: TRANS  flags: none
RUI:  #regs: 1  num_extents: 1  id: 0xffff925955341f20
(s: 0x178, l: 1, own: 134, off: 19, f: 0x2) 

...strange, another rmap add?  Did the RUIs fail to complete?

# xfs_logprint bugs.img | egrep '(0xffff8f0e48254530|0xffff925955341f20)'
RUI:  #regs: 1  num_extents: 1  id: 0xffff8f0e48254530
RUD:  #regs: 1                   id: 0xffff8f0e48254530
RUI:  #regs: 1  num_extents: 1  id: 0xffff925955341f20
RUD:  #regs: 1                   id: 0xffff925955341f20

...the first RUI is logged as operation 453 in tid e0ecea6a, and it has
an RUD at 460.  However, I don't see a COMMIT, so I gather that log
recovery tried to replay the RUI as operation 13 in tid 17c5790c.

----------------------------------------------------------------------------
Oper (13): tid: 17c5790c  len: 48  clientid: TRANS  flags: none
RUI:  #regs: 1  num_extents: 1  id: 0xffff925955341f20
(s: 0x178, l: 1, own: 134, off: 19, f: 0x2) 
----------------------------------------------------------------------------
Oper (14): tid: 17c5790c  len: 16  clientid: TRANS  flags: none
RUD:  #regs: 1                   id: 0xffff925955341f20
----------------------------------------------------------------------------
Oper (15): tid: 17c5790c  len: 16  clientid: TRANS  flags: none
RUD:  #regs: 1                   id: 0xffff8f0e48254000
----------------------------------------------------------------------------
Oper (16): tid: 17c5790c  len: 24  clientid: TRANS  flags: none
BUF:  #regs: 3   start blkno: 40 (0x28)  len: 8  bmap size: 1  flags: 0x2000
Oper (17): tid: 17c5790c  len: 128  clientid: TRANS  flags: none
BUF DATA
 0 33424d52 31000000 ffffffff ffffffff        0 28000000  1000000 38000000 
 8 3674196c 564cfddf 7798e496 591b6410        0 42249385        0  1000000 
10 ffffffff fdffffff        0        0  1000000  2000000 ffffffff fbffffff 
18        0        0  3000000  2000000 ffffffff faffffff        0        0 

Oper (18): tid: 17c5790c  len: 128  clientid: TRANS  flags: none
BUF DATA
 0        0        0 4d010000  9000000        0 8a000000        0 2a000000 
 8 56010000  d000000 ffffffff f7ffffff        0        0 65010000 6b000000 
10 ffffffff f7ffffff        0        0 78010000  1000000        0 86000000 
18        0 13000000        0        0        0        0        0        0 

----------------------------------------------------------------------------
Oper (19): tid: 17c5790c  len: 0  clientid: TRANS  flags: COMMIT 

Ok, so those last two lines in operation 16 show that we've indeed added the
new record to the rmapbt... <shrug> <confused> <tired>??

Are we sure that the log recovery actions actually get checkpointed
correctly?  And just to check my own sanity, we generate a snapshot and
then recover it and run xfs_repair, right?

--D

> 
> >
> > (Normally I'd say send a metadump too for us mere mortals to work with,
> > though I'm about to plunge into weddingland so I likely won't be able to
> > do much until the 18th.)
> >
> 
> Attached (used xfs_metadump -ao)
> Soon we will all be gods with powers to replay history ;)
> 
> > ((Eric: If this doesn't turn out to be a totally garbage patch, feel
> > free to add it to xfsprogs.))
> >
> > --D
> >