* [PATCH repair: do not walk the unlinked inode list
@ 2011-11-09 8:37 Christoph Hellwig
2011-11-09 23:11 ` Dave Chinner
0 siblings, 1 reply; 3+ messages in thread
From: Christoph Hellwig @ 2011-11-09 8:37 UTC (permalink / raw)
To: xfs; +Cc: Stefan Pfetzing
Stefan Pfetzing reported a bug where xfs_repair got stuck eating 100% CPU in
phase3. We track it down to a loop in the unlinked inode list, apparently
caused by memory corruption on an iSCSI target.
I looked into tracking if we already saw a given unlinked inode, but given
that we keep walking even for inodes where we can't find an allocation btree
record that seems infeasible. On the other hand these inodes had their
final unlink and thus were dead even before the system went down. There
really is no point in adding them to the uncertain list and looking for
references to them later.
So the simplest fix seems to be to simply remove the unlinked inode list
walk and just clear it - when we rebuild the inode allocation btrees these
will simply be marked free.
Reported-by: Stefan Pfetzing <stefan.pfetzing@1und1.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Index: xfsprogs-dev/repair/phase3.c
===================================================================
--- xfsprogs-dev.orig/repair/phase3.c 2011-11-09 08:59:35.313604606 +0100
+++ xfsprogs-dev/repair/phase3.c 2011-11-09 09:03:44.800605237 +0100
@@ -28,80 +28,15 @@
#include "progress.h"
#include "prefetch.h"
-/*
- * walks an unlinked list, returns 1 on an error (bogus pointer) or
- * I/O error
- */
-int
-walk_unlinked_list(xfs_mount_t *mp, xfs_agnumber_t agno, xfs_agino_t start_ino)
-{
- xfs_buf_t *bp;
- xfs_dinode_t *dip;
- xfs_agino_t current_ino = start_ino;
- xfs_agblock_t agbno;
- int state;
-
- while (current_ino != NULLAGINO) {
- if (verify_aginum(mp, agno, current_ino))
- return(1);
- if ((bp = get_agino_buf(mp, agno, current_ino, &dip)) == NULL)
- return(1);
- /*
- * if this looks like a decent inode, then continue
- * following the unlinked pointers. If not, bail.
- */
- if (verify_dinode(mp, dip, agno, current_ino) == 0) {
- /*
- * check if the unlinked list points to an unknown
- * inode. if so, put it on the uncertain inode list
- * and set block map appropriately.
- */
- if (find_inode_rec(mp, agno, current_ino) == NULL) {
- add_aginode_uncertain(agno, current_ino, 1);
- agbno = XFS_AGINO_TO_AGBNO(mp, current_ino);
-
- pthread_mutex_lock(&ag_locks[agno]);
- state = get_bmap(agno, agbno);
- switch (state) {
- case XR_E_BAD_STATE:
- do_error(_(
- "bad state in block map %d\n"),
- state);
- break;
- default:
- /*
- * the block looks like inodes
- * so be conservative and try
- * to scavenge what's in there.
- * if what's there is completely
- * bogus, it'll show up later
- * and the inode will be trashed
- * anyway, hopefully without
- * losing too much other data
- */
- set_bmap(agno, agbno, XR_E_INO);
- break;
- }
- pthread_mutex_unlock(&ag_locks[agno]);
- }
- current_ino = be32_to_cpu(dip->di_next_unlinked);
- } else {
- current_ino = NULLAGINO;;
- }
- libxfs_putbuf(bp);
- }
-
- return(0);
-}
-
-void
-process_agi_unlinked(xfs_mount_t *mp, xfs_agnumber_t agno)
+static void
+process_agi_unlinked(
+ struct xfs_mount *mp,
+ xfs_agnumber_t agno)
{
- xfs_agnumber_t i;
- xfs_buf_t *bp;
- xfs_agi_t *agip;
- int err = 0;
- int agi_dirty = 0;
+ struct xfs_buf *bp;
+ struct xfs_agi *agip;
+ xfs_agnumber_t i;
+ int agi_dirty = 0;
bp = libxfs_readbuf(mp->m_dev,
XFS_AG_DADDR(mp, agno, XFS_AGI_DADDR(mp)),
@@ -112,28 +47,16 @@ process_agi_unlinked(xfs_mount_t *mp, xf
agip = XFS_BUF_TO_AGI(bp);
- ASSERT(no_modify || be32_to_cpu(agip->agi_seqno) == agno);
+ ASSERT(be32_to_cpu(agip->agi_seqno) == agno);
for (i = 0; i < XFS_AGI_UNLINKED_BUCKETS; i++) {
- if (be32_to_cpu(agip->agi_unlinked[i]) != NULLAGINO) {
- err += walk_unlinked_list(mp, agno,
- be32_to_cpu(agip->agi_unlinked[i]));
- /*
- * clear the list
- */
- if (!no_modify) {
- agip->agi_unlinked[i] = cpu_to_be32(NULLAGINO);
- agi_dirty = 1;
- }
+ if (agip->agi_unlinked[i] != cpu_to_be32(NULLAGINO)) {
+ agip->agi_unlinked[i] = cpu_to_be32(NULLAGINO);
+ agi_dirty = 1;
}
}
- if (err)
- do_warn(_("error following ag %d unlinked list\n"), agno);
-
- ASSERT(agi_dirty == 0 || (agi_dirty && !no_modify));
-
- if (agi_dirty && !no_modify)
+ if (agi_dirty)
libxfs_writebuf(bp, 0);
else
libxfs_putbuf(bp);
@@ -209,14 +132,14 @@ phase3(xfs_mount_t *mp)
set_progress_msg(PROG_FMT_AGI_UNLINKED, (__uint64_t) glob_agcount);
- /*
- * first, let's look at the possibly bogus inodes
- */
+ /* first clear the agi unlinked AGI list */
+ if (!no_modify) {
+ for (i = 0; i < mp->m_sb.sb_agcount; i++)
+ process_agi_unlinked(mp, i);
+ }
+
+ /* now look at possibly bogus inodes */
for (i = 0; i < mp->m_sb.sb_agcount; i++) {
- /*
- * walk unlinked list to add more potential inodes to list
- */
- process_agi_unlinked(mp, i);
check_uncertain_aginodes(mp, i);
PROG_RPT_INC(prog_rpt_done[i], 1);
}
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH repair: do not walk the unlinked inode list
2011-11-09 8:37 [PATCH repair: do not walk the unlinked inode list Christoph Hellwig
@ 2011-11-09 23:11 ` Dave Chinner
2011-11-14 18:55 ` Christoph Hellwig
0 siblings, 1 reply; 3+ messages in thread
From: Dave Chinner @ 2011-11-09 23:11 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Stefan Pfetzing, xfs
On Wed, Nov 09, 2011 at 03:37:29AM -0500, Christoph Hellwig wrote:
> Stefan Pfetzing reported a bug where xfs_repair got stuck eating 100% CPU in
> phase3. We track it down to a loop in the unlinked inode list, apparently
> caused by memory corruption on an iSCSI target.
>
> I looked into tracking if we already saw a given unlinked inode, but given
> that we keep walking even for inodes where we can't find an allocation btree
> record that seems infeasible. On the other hand these inodes had their
> final unlink and thus were dead even before the system went down. There
> really is no point in adding them to the uncertain list and looking for
> references to them later.
You're making the assumption that log recovery has done the correct
thing any only replayed entire unlink transactions and hence the
filesystem is otherwise consistent (i.e that there are no other
references). I think that's a bad assumption - there's no guarantee
that the unlinked list only contains unreferenced inodes if there's
been corruption and/or log replay was not able to be run.
> So the simplest fix seems to be to simply remove the unlinked inode list
> walk and just clear it - when we rebuild the inode allocation btrees these
> will simply be marked free.
I also think there's more to it than that. The walk of the inode list
also marks all the blocks in the block map as containing inodes, and
all the blocks still used by those inodes as data/bmap/attr types.
This change removes that, so we're going to potentially lose that
state if all the inodes in a block are on the unlinked list.
Hence we'll end up with blocks containing inodes that are still
marked as used in the AGINO btree, but are marked as free space in
the block map. We'll also end up with data blocks that are otherwise
still used as not being marked as used, and that is especially
important for discovering multiply allocated blocks when a block has
been freed (e.g. just before unlink) and then immediately
reallocated and then the crash has left the state on disk
inconsistent....
IOWs, it seems to me that simply removing the walk has more
potential downsides in terms of error detection and tracking than it
provides in benefits. I suspect that just capping the number of
loops that can be executed is the simplest thing to do here. e.g.
allow it to loop for as many times as there are inodes allocated in
the AG or filesystem (e.g. agi->agi_count - agi->agi_free). Yes, it will
still spin for some time on this sort of corruption, but it won't
get stuck, and it won't add new holes into our block/inode usage
tracking...
The logical extension of this is that having a "unlinked inode
count" in the AGI would be really useful here. I'll add it to the
(growing) list of "things to add with CRC checking on-disk format
modifications".
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH repair: do not walk the unlinked inode list
2011-11-09 23:11 ` Dave Chinner
@ 2011-11-14 18:55 ` Christoph Hellwig
0 siblings, 0 replies; 3+ messages in thread
From: Christoph Hellwig @ 2011-11-14 18:55 UTC (permalink / raw)
To: Dave Chinner; +Cc: Christoph Hellwig, Stefan Pfetzing, xfs
On Thu, Nov 10, 2011 at 10:11:33AM +1100, Dave Chinner wrote:
> You're making the assumption that log recovery has done the correct
> thing any only replayed entire unlink transactions and hence the
> filesystem is otherwise consistent (i.e that there are no other
> references). I think that's a bad assumption - there's no guarantee
> that the unlinked list only contains unreferenced inodes if there's
> been corruption and/or log replay was not able to be run.
We add inodes to the uncertain list if any of the following applies
a) are found in an inode btree record reachable from the root in
phase2, but they are suspect based on certain factors - else
we add them to the inode tree directly.
b) are found on the unlinked inodes list in phase3
c) a directory found in an reachable inode btree record points to
them in phase3
so any inodes that either has a link pointing to it, or an inode
allocation btree record pointing to it will still be added to the
uncertain inode list if they aren't on the actual inode btree yet.
Then later in phase3 we move all uncertain inodes that appear fine
back into the main inode record tree.
> I also think there's more to it than that. The walk of the inode list
> also marks all the blocks in the block map as containing inodes, and
> all the blocks still used by those inodes as data/bmap/attr types.
> This change removes that, so we're going to potentially lose that
> state if all the inodes in a block are on the unlinked list.
We still do that walk if we have any genuine reference to the inode.
If we don't have any reference but the unlinked list they can be
considered free - we'd free every ressources assoicated with them
on log recovery anyway.
> Hence we'll end up with blocks containing inodes that are still
> marked as used in the AGINO btree, but are marked as free space in
> the block map.
They aren't. We completely rebuild both the inode allocation and
space allocation bitmaps from the information we gather in the earlier
repair phases, and they will be in sync.
> the AG or filesystem (e.g. agi->agi_count - agi->agi_free). Yes, it will
> still spin for some time on this sort of corruption, but it won't
> get stuck, and it won't add new holes into our block/inode usage
> tracking...
This would basically take forever with thinkgs like Arek's filesystem
with almost 11 million inodes in each AG.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2011-11-14 18:56 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-11-09 8:37 [PATCH repair: do not walk the unlinked inode list Christoph Hellwig
2011-11-09 23:11 ` Dave Chinner
2011-11-14 18:55 ` Christoph Hellwig
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox