From: Dave Chinner <david@fromorbit.com>
To: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 16/25] xfs: scrub inodes
Date: Thu, 5 Oct 2017 18:13:39 +1100 [thread overview]
Message-ID: <20171005071339.GN3666@dastard> (raw)
In-Reply-To: <20171005052219.GE7122@magnolia>
On Wed, Oct 04, 2017 at 10:22:19PM -0700, Darrick J. Wong wrote:
> On Thu, Oct 05, 2017 at 03:04:52PM +1100, Dave Chinner wrote:
> > On Tue, Oct 03, 2017 at 01:42:30PM -0700, Darrick J. Wong wrote:
> > > + error = xfs_iget(mp, NULL, sc->sm->sm_ino, XFS_IGET_UNTRUSTED,
> > > + 0, &ips);
> >
> > I think we also want XFS_IGET_DONTCACHE here, so we don't trash the
> > inode cache with inodes that we use once for scrub and never touch
> > again.
>
> I thought about adding this, but if we let the inodes fall out of the
> cache now then we'll just have to load them back in for the bmap checks,
> right?
Well, I'm looking at ensuring that we don't blow out the memory
side of things. We've still got the inode buffer in the buffer
cache, so I don't see why we should double cache these things
and then leave both cached copied hanging around after we've
finished with them. Leave the buffer around because we do a fair few
checks with it, but don't use excessive icache memory and trash the
working set if we can avoid it...
> > > +xfs_scrub_checkpoint_log(
> > > + struct xfs_mount *mp)
> > > +{
> > > + int error;
> > > +
> > > + error = _xfs_log_force(mp, XFS_LOG_SYNC, NULL);
> > > + if (error)
> > > + return error;
> > > + xfs_ail_push_all_sync(mp->m_ail);
> > > + return 0;
> > > +}
> >
> > Oooo, that's a nasty thing to do on busy systems with large dirty
> > logs. I hope this is a "last resort" kind of thing....
>
> It is; we only do this if the inobt says there's an inode there and the
> inode verifiers fail.
Ok, so why would pushing the log and the AIL make the verifier then
succeed? how likely is this to occur on a busy system?
> > > +/* Set us up with an inode. */
> >
> > What state are we trying to get the inode into here? We grab all the
> > various locks, but we can still have data changing via mmap pages
> > that are already faulted in and delalloc extents in the incore
> > extent list that aren't reflected on disk...
> >
> > A comment explaining what we expect here would be nice.
>
> /*
> * Grab total control of the inode metadata. It doesn't matter here if
> * the file data is still changing, we just want exclusive access to the
> * metadata.
> */
*nod*
> > > + /* Got the inode, lock it and we're ready to go. */
> > > + sc->ilock_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
> > > + xfs_ilock(sc->ip, sc->ilock_flags);
> > > + error = xfs_scrub_trans_alloc(sc->sm, mp, &sc->tp);
> > > + if (error)
> > > + goto out_unlock;
> > > + sc->ilock_flags |= XFS_ILOCK_EXCL;
> > > + xfs_ilock(sc->ip, XFS_ILOCK_EXCL);
> >
> > Should the inode be joined to the transaction so that cancelling the
> > transaction unlocks the inode? Then the need for the ilock_flags
> > variable goes away....
>
> This is the confluence of two semi-icky things: first, some of the
> scrubbers (particularly the dir and parent pointer scrubbers) will need
> to drop the ILOCK for short periods of time; later on, repair will want
> to keep the inode locked across all the repair transactions, so it makes
> more sense to control the lock and unlock directly.
Ok, I'll pass on this for now, see how the rest of the code falls
out.
> > > + /* di_size */
> > > + isize = be64_to_cpu(dip->di_size);
> > > + if (isize & (1ULL << 63))
> > > + xfs_scrub_ino_set_corrupt(sc, ino, bp);
> >
> > Should we be checking against the on disk format size, or the
> > mounted filesystem maximum size (i.e. mp->m_super->s_maxbytes)?
> > 32 or 64 bit systems are going to have different maximum valid file
> > sizes..
>
> It's perfectly valid to 'truncate -s $((2 ** 60) foofile' so the only
Ugh. We can't do IO past 16TB on 32 bit systems, so I'm kinda
surprised truncate doesn't have the same s_maxbytes restriction...
> thing we can really check for here is that the upper bit isn't set
> (because the VFS does not check, but barfs on, files with that large of
> a size).
xfs_max_file_offset() sets the max file offset to 2^63 - 1, so it
looks like the lack of checking in truncate is the problem here,
not the IO path.
> > Directories have a maximum bound size, too - the data space, leaf
> > space and freespace space, each of which are 32GB in size, IIRC.
> >
> > And symlinks have a different maximum size, too.
>
> Fair enough, I'll expand the i_size checks, though ISTR the verifiers
> now check that for us.
If they do, then just drop a comment in there to say what is checked
by the verifier.
> > > + if (!S_ISDIR(mode) && !S_ISREG(mode) && !S_ISLNK(mode) && isize != 0)
> > > + xfs_scrub_ino_set_corrupt(sc, ino, bp);
> >
> > > +
> > > + /* di_nblocks */
> > > + if (flags2 & XFS_DIFLAG2_REFLINK) {
> > > + ; /* nblocks can exceed dblocks */
> > > + } else if (flags & XFS_DIFLAG_REALTIME) {
> > > + if (be64_to_cpu(dip->di_nblocks) >=
> > > + mp->m_sb.sb_dblocks + mp->m_sb.sb_rblocks)
> > > + xfs_scrub_ino_set_corrupt(sc, ino, bp);
> >
> > That doesn't seem right. the file can be on either the data or the
> > rt device, so the maximum file blocks is the size of one device or
> > the other, not both combined.
>
> di_nblocks is the sum of (data blocks + bmbt blocks + attr blocks),
> right?
Yeah, forgot it was more than just data extents.
> So in theory if you had a rt file with 1000 data blocks, 10 bmbt
> blocks to map the data blocks, and 100 attr blocks then di_nblocks has
> to be 1110.
Yup, but the additional metadata on the data device is not going to
be anywhere near the size of the data device.
/me shrugs
I can't think of an easy way to get a maximum block count, so I
guess that'll have to do...
> > > + if (nextents > fork_recs)
> > > + xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > > + break;
> > > + case XFS_DINODE_FMT_BTREE:
> > > + if (nextents <= fork_recs)
> > > + xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > > + break;
> > > + case XFS_DINODE_FMT_LOCAL:
> > > + case XFS_DINODE_FMT_DEV:
> > > + case XFS_DINODE_FMT_UUID:
> > > + default:
> > > + if (nextents != 0)
> > > + xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > > + break;
> > > + }
> > > +
> > > + /* di_anextents */
> > > + nextents = be16_to_cpu(dip->di_anextents);
> > > + fork_recs = XFS_DFORK_ASIZE(dip, mp) / sizeof(struct xfs_bmbt_rec);
> > > + switch (dip->di_aformat) {
> > > + case XFS_DINODE_FMT_EXTENTS:
> > > + if (nextents > fork_recs)
> > > + xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > > + break;
> > > + case XFS_DINODE_FMT_BTREE:
> > > + if (nextents <= fork_recs)
> > > + xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > > + break;
> > > + case XFS_DINODE_FMT_LOCAL:
> > > + case XFS_DINODE_FMT_DEV:
> > > + case XFS_DINODE_FMT_UUID:
> > > + default:
> > > + if (nextents != 0)
> > > + xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > > + break;
> > > + }
> >
> > Don't we need a check here first to see whether an attribute fork
> > exists or not?
>
> Do you mean the xfs_inode_fork, or something else?
SOmething else. :P
> XFS_DFORK_ASIZE returns zero if !XFS_DFORK_Q which in turn is based on
> di_forkoff so we're really only checking that di_aformat makes sense
> given the number of extents and the size of the attr fork area.
Right, but if XFS_DFORK_ASIZE == 0, the dip->di_aformat *must* be
XFS_DINODE_FMT_EXTENTS. That's the only valid configuration when
there is no attribute fork present.
If there is an attribute fork present, then it can be XFS_DINODE_FMT_LOCAL,
EXTENT or BTREE, and then the extent count needs checking.
XFS_DINODE_FMT_DEV and XFS_DINODE_FMT_UUID are both invalid for the
attribute fork.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2017-10-05 7:14 UTC|newest]
Thread overview: 91+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-10-03 20:40 [PATCH v11 00/25] xfs: online scrub support Darrick J. Wong
2017-10-03 20:40 ` [PATCH 01/25] xfs: create an ioctl to scrub AG metadata Darrick J. Wong
2017-10-03 20:41 ` [PATCH 02/25] xfs: dispatch metadata scrub subcommands Darrick J. Wong
2017-10-03 20:41 ` [PATCH 03/25] xfs: probe the scrub ioctl Darrick J. Wong
2017-10-03 23:32 ` Dave Chinner
2017-10-04 0:02 ` Darrick J. Wong
2017-10-04 1:56 ` Dave Chinner
2017-10-04 3:14 ` Darrick J. Wong
2017-10-03 20:41 ` [PATCH 04/25] xfs: create helpers to record and deal with scrub problems Darrick J. Wong
2017-10-03 23:44 ` Dave Chinner
2017-10-04 0:56 ` Darrick J. Wong
2017-10-03 20:41 ` [PATCH 05/25] xfs: create helpers to scrub a metadata btree Darrick J. Wong
2017-10-03 23:49 ` Dave Chinner
2017-10-04 0:13 ` Darrick J. Wong
2017-10-03 20:41 ` [PATCH 06/25] xfs: scrub the shape of " Darrick J. Wong
2017-10-04 0:15 ` Dave Chinner
2017-10-04 3:51 ` Darrick J. Wong
2017-10-04 5:48 ` Dave Chinner
2017-10-04 17:48 ` Darrick J. Wong
2017-10-03 20:41 ` [PATCH 07/25] xfs: scrub btree keys and records Darrick J. Wong
2017-10-04 20:52 ` Darrick J. Wong
2017-10-03 20:41 ` [PATCH 08/25] xfs: create helpers to scan an allocation group Darrick J. Wong
2017-10-04 0:46 ` Dave Chinner
2017-10-04 3:58 ` Darrick J. Wong
2017-10-04 5:59 ` Dave Chinner
2017-10-04 17:51 ` Darrick J. Wong
2017-10-03 20:41 ` [PATCH 09/25] xfs: scrub the backup superblocks Darrick J. Wong
2017-10-04 0:57 ` Dave Chinner
2017-10-04 4:06 ` Darrick J. Wong
2017-10-04 6:13 ` Dave Chinner
2017-10-04 17:56 ` Darrick J. Wong
2017-10-03 20:41 ` [PATCH 10/25] xfs: scrub AGF and AGFL Darrick J. Wong
2017-10-04 1:31 ` Dave Chinner
2017-10-04 4:21 ` Darrick J. Wong
2017-10-04 6:28 ` Dave Chinner
2017-10-04 17:57 ` Darrick J. Wong
2017-10-03 20:41 ` [PATCH 11/25] xfs: scrub the AGI Darrick J. Wong
2017-10-04 1:43 ` Dave Chinner
2017-10-04 4:25 ` Darrick J. Wong
2017-10-04 6:43 ` Dave Chinner
2017-10-04 18:02 ` Darrick J. Wong
2017-10-04 22:16 ` Dave Chinner
2017-10-04 23:12 ` Darrick J. Wong
2017-10-03 20:42 ` [PATCH 12/25] xfs: scrub free space btrees Darrick J. Wong
2017-10-05 0:59 ` Dave Chinner
2017-10-05 1:13 ` Darrick J. Wong
2017-10-03 20:42 ` [PATCH 13/25] xfs: scrub inode btrees Darrick J. Wong
2017-10-05 2:08 ` Dave Chinner
2017-10-05 5:47 ` Darrick J. Wong
2017-10-05 7:22 ` Dave Chinner
2017-10-05 18:26 ` Darrick J. Wong
2017-10-03 20:42 ` [PATCH 14/25] xfs: scrub rmap btrees Darrick J. Wong
2017-10-05 2:56 ` Dave Chinner
2017-10-05 5:02 ` Darrick J. Wong
2017-10-03 20:42 ` [PATCH 15/25] xfs: scrub refcount btrees Darrick J. Wong
2017-10-05 2:59 ` Dave Chinner
2017-10-05 5:02 ` Darrick J. Wong
2017-10-03 20:42 ` [PATCH 16/25] xfs: scrub inodes Darrick J. Wong
2017-10-05 4:04 ` Dave Chinner
2017-10-05 5:22 ` Darrick J. Wong
2017-10-05 7:13 ` Dave Chinner [this message]
2017-10-05 19:56 ` Darrick J. Wong
2017-10-03 20:42 ` [PATCH 17/25] xfs: scrub inode block mappings Darrick J. Wong
2017-10-06 2:51 ` Dave Chinner
2017-10-06 17:00 ` Darrick J. Wong
2017-10-07 23:10 ` Dave Chinner
2017-10-08 3:54 ` Darrick J. Wong
2017-10-03 20:42 ` [PATCH 18/25] xfs: scrub directory/attribute btrees Darrick J. Wong
2017-10-06 5:07 ` Dave Chinner
2017-10-06 18:30 ` Darrick J. Wong
2017-10-03 20:42 ` [PATCH 19/25] xfs: scrub directory metadata Darrick J. Wong
2017-10-06 7:07 ` Dave Chinner
2017-10-06 19:45 ` Darrick J. Wong
2017-10-06 22:16 ` Dave Chinner
2017-10-03 20:42 ` [PATCH 20/25] xfs: scrub directory freespace Darrick J. Wong
2017-10-09 1:44 ` Dave Chinner
2017-10-09 22:54 ` Darrick J. Wong
2017-10-03 20:43 ` [PATCH 21/25] xfs: scrub extended attributes Darrick J. Wong
2017-10-09 2:13 ` Dave Chinner
2017-10-09 21:14 ` Darrick J. Wong
2017-10-03 20:43 ` [PATCH 22/25] xfs: scrub symbolic links Darrick J. Wong
2017-10-09 2:17 ` Dave Chinner
2017-10-03 20:43 ` [PATCH 23/25] xfs: scrub parent pointers Darrick J. Wong
2017-10-03 20:43 ` [PATCH 24/25] xfs: scrub realtime bitmap/summary Darrick J. Wong
2017-10-09 2:28 ` Dave Chinner
2017-10-09 20:24 ` Darrick J. Wong
2017-10-03 20:43 ` [PATCH 25/25] xfs: scrub quota information Darrick J. Wong
2017-10-09 2:51 ` Dave Chinner
2017-10-09 20:03 ` Darrick J. Wong
2017-10-09 22:17 ` Dave Chinner
2017-10-09 23:08 ` Darrick J. Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20171005071339.GN3666@dastard \
--to=david@fromorbit.com \
--cc=darrick.wong@oracle.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).