All of lore.kernel.org
 help / color / mirror / Atom feed
From: Brian Foster <bfoster@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>, linux-xfs@vger.kernel.org
Subject: Re: [PATCH 1/2] xfs: zero length symlinks are not valid
Date: Thu, 21 Jun 2018 07:46:18 -0400	[thread overview]
Message-ID: <20180621114618.GA2834@bfoster> (raw)
In-Reply-To: <20180620225918.GN19934@dastard>

On Thu, Jun 21, 2018 at 08:59:18AM +1000, Dave Chinner wrote:
> On Wed, Jun 20, 2018 at 07:50:48AM -0400, Brian Foster wrote:
> > On Wed, Jun 20, 2018 at 09:22:42AM +1000, Dave Chinner wrote:
> > > On Tue, Jun 19, 2018 at 12:28:40PM -0400, Brian Foster wrote:
> > > > On Tue, Jun 19, 2018 at 08:48:10AM -0700, Darrick J. Wong wrote:
> > > > > On Tue, Jun 19, 2018 at 07:54:34AM -0400, Brian Foster wrote:
> > > > > > On Tue, Jun 19, 2018 at 08:42:59AM +1000, Dave Chinner wrote:
> > > > > > Rather, I'm asking about the pre-existing code that we remove. The hunk
> > > > > > just above from xfs_inactive_symlink():
> > > > > > 
> > > > > > -	/*
> > > > > > -	 * Zero length symlinks _can_ exist.
> > > > > > -	 */
> > > > > > -	if (!pathlen) {
> > > > > > -		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > > > -		return 0;
> > > > > > -	}
> > > > > > 
> > > > > > ... obviously suggests that there could have been a situation where we'd
> > > > > > see a zero-length symlink on entry to xfs_inactive_symlink(). This
> > > > > > patch, however, focuses on fixing the transient zero-length symlink
> > > > > > caused by xfs_inactive_symlink_rmt() (which comes after the above
> > > > > > check).
> > > 
> > > It can't happen through normal VFS paths, and I don't think it can
> > > happen through log recovery. That's why I replaced this with an ASSERT.
> > > 
> > > In the normal behaviour case, zero length symlinks were created
> > > /after/ this point in time, and we've always been unable to read
> > > such inodes because xfs_dinode_verify() has always rejected zero
> > > length symlink inodes.
> > > 
> > 
> > That was my initial understanding as well..
> > 
> > > However, log recovery doesn't trip over the transient inode state
> > > because it does not use xfs_dinode_verify() for inode recovery - it
> > > reads the buffer with xfs_inode_buf_ops, and that just checks the
> > > inode numbers and then recovers whatever is in the log over the top.
> > > It never checks for zero length symlinks on recovery, and it never
> > > goes through the dinode verifier (on read or write!) to catch this.
> > > 
> > > It's not until we then recover the remaining intent chain that we go
> > > through xfs_iget/xfs_iread/xfs_inactive/xfs_ifree, and that
> > > xfs_iget() call runs xfs_dinode_verify(). If we've already recovered
> > > any part of the remote symlink removal intent chain (and we must
> > > have to be replaying EFIs!) this should see a zero length symlink
> > > inode. AIUI, we have no evidence that the verifier has ever fired at
> > > this point in time, even when recovering known transient zero length
> > > states.
> > > 
> > 
> > Hmm, not sure we're talking about the same thing. I ran a quick
> > experiment to try and clear up my confusion here, at least. I injected a
> > shutdown at the point inactive creates the zero sized symlink and
> > created/removed a remote symlink. On a remount, I hit the following
> > during log recovery:
> > 
> > [  685.931834] XFS (dm-3): Starting recovery (logdev: internal)
> > [  685.993014] XFS (dm-3): Metadata corruption detected at xfs_dinode_verify+0x331/0x490 [xfs], inode 0x85 dinode
> > [  685.996287] XFS (dm-3): Unmount and run xfs_repair
> > [  685.996911] XFS (dm-3): First 128 bytes of corrupted metadata buffer:
> > ...
> > [  686.006647] Call Trace:
> > [  686.006922]  dump_stack+0x85/0xcb
> > [  686.007338]  xfs_iread+0xeb/0x220 [xfs]
> > [  686.007820]  xfs_iget+0x4bd/0x1100 [xfs]
> > [  686.008344]  xlog_recover_process_one_iunlink+0x4d/0x3c0 [xfs]
> > ...
> > 
> > That seems to show that the verifier can impede unlinked list processing
> > if the change is at least present in the log.
> 
> Which happens after we process intents - ah, the bmap intents only
> occur from swap extents, so the inode is not read via xfs_iget()
> in this case until unlinked list processing.
> 
> So, yeah, we are talking about the same thing - it's the second
> phase of recovery that reads inodes through xfs_iget() (for whatever
> reason) that will trip over the xfs_dinode_verifier.
> 
> > If I recreate that same dirty log state and mount with this patch
> > applied (note that the fs is created without this patch to simulate an
> > old kernel that has not changed i_mode in the same transaction that sets
> > di_size = 0) along with a hack to avoid the check in
> > xfs_dinode_verify(), I now hit the new assert and corruption error
> > that's been placed in xfs_inactive_symlink().
> > 
> > So to Darrick's point, that seems to show that this is a vector to the
> > pre-existing len == 0 check in xfs_inactive_symlink(). Given that, it
> > seems to me that if we want to handle recovery from this state, we'd
> > still need to work around the verifier check and retain the initial
> > di_size == 0 check in xfs_inactive_symlink().
> 
> I think we should get rid of the transient state, not continue to
> work around it. Because the transient state only exists in a log
> recovery context, we can change the behaviour and not care about
> older kernels still having the problem because all that is needed to
> avoid the issue is a clean log when upgrading the kernel.
> 

Right... that sounds reasonable to me, but we still need to consider how
we handle a filesystem already in this state. BTW, was this a user
report or a manufactured corruption due to fuzzing/shutdown testing or
something?

It's probably an extremely rare real-world situation to encounter so
perhaps this approach is good enough on its own and we just punt. I
think we at least need to document that in the commit log, however.
IIRC, the filesystem still actually mounted during the test I ran the
other day, despite the verifier failure. Taking a look at that codepath,
we have this:

       /*
         * We can't read in the inode this bucket points to, or this inode
         * is messed up.  Just ditch this bucket of inodes.  We will lose
         * some inodes and space, but at least we won't hang.
         *
         * Call xlog_recover_clear_agi_bucket() to perform a transaction to
         * clear the inode pointer in the bucket.
         */
        xlog_recover_clear_agi_bucket(mp, agno, bucket);

... which explains why that occurs: we just nuke the AGI unlinked list
bucket after failing to read an inode from it. I guess that means we
leak some inodes, but those were already freed so that's not really a
data loss vector. This otherwise doesn't seem to affect the recovery or
mount sequence. The leaked inodes can be recovered by a repair at a
later point. Is that overall behavior consistent with what you've seen
on the corrupted image?

Given that additional context, I don't feel too strongly about needing
to specially handle the "zero length symlink already exists in the dirty
log due to xfs_inactive_symlink_rmt()" case. Chances are the mount that
reported the error already nuked the state in the first place, so users
shouldn't really end up "stuck" somewhere between needing a kernel fix
or an 'xfs_repair -L' run (but if that's the approach we take, please
just note the tradeoff in the commit log). Just my .02.

Brian

> > > i.e all the cases I've seen where repair complains about symlinks
> > > with "dfork format 2 and size 0" it is because the log is dirty and
> > > hasn't been replayed. Mounting the filesystem and running log
> > > recovery makes the problem go away, and this is what lead to me
> > > removing the zeor length handling - the verifier should already
> > > be firing if it is recovering an intent on a zero length symlink
> > > inode...
> > > 
> > 
> > I haven't grokked all the details here, but the behavior you describe
> > does sound like this might be a different scenario from the above.
> 
> No, we are talking about the same thing. The only difference is that
> I've been working from a bug report and corrupted fuzzer image, not
> a real world corruption vector. Hence I'm trying to work backwards
> from "here's a corrupt image that crashed" to "this is all the paths
> that can lead to that on disk state".
> 
> While writing my last reply, I suspected you were correct about the
> log recovery state, which is why I said:
> 
> > > That said, I'll try some more focussed testing with intentional
> > > corruption to see if it's just a case of my testing being (un)lucky
> > > and not triggering this issue...
> 
> As you've shown, I was being (un)lucky, so more work is needed on
> this one.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

  reply	other threads:[~2018-06-21 11:46 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-06-18  5:57 [PATCH 0/2] xfs: symlink and inode writeback issues Dave Chinner
2018-06-18  5:57 ` [PATCH 1/2] xfs: zero length symlinks are not valid Dave Chinner
2018-06-18 13:24   ` Brian Foster
2018-06-18 22:42     ` Dave Chinner
2018-06-19 11:54       ` Brian Foster
2018-06-19 15:48         ` Darrick J. Wong
2018-06-19 16:28           ` Brian Foster
2018-06-19 23:22             ` Dave Chinner
2018-06-20 11:50               ` Brian Foster
2018-06-20 22:59                 ` Dave Chinner
2018-06-21 11:46                   ` Brian Foster [this message]
2018-06-21 22:31                     ` Dave Chinner
2018-06-21 22:53                       ` Darrick J. Wong
2018-06-22 10:44                         ` Brian Foster
2018-06-23 17:38                           ` Darrick J. Wong
2018-06-18  5:57 ` [PATCH 2/2] xfs: xfs_iflush_abort() can be called twice on cluster writeback failure Dave Chinner
2018-06-18 13:24   ` Brian Foster
2018-06-19  5:05   ` Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180621114618.GA2834@bfoster \
    --to=bfoster@redhat.com \
    --cc=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.