From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: with ECARTIS (v1.0.0; list xfs); Mon, 01 Sep 2008 21:39:50 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.11.20060308/8.12.11/SuSE Linux 0.7) with SMTP id m824dfum005288 for ; Mon, 1 Sep 2008 21:39:42 -0700 Received: from [134.14.55.78] (redback.melbourne.sgi.com [134.14.55.78]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id OAA02593 for ; Tue, 2 Sep 2008 14:41:06 +1000 Message-ID: <48BCC5B1.7080300@sgi.com> Date: Tue, 02 Sep 2008 14:48:49 +1000 From: Lachlan McIlroy Reply-To: lachlan@sgi.com MIME-Version: 1.0 Subject: Filesystem corruption writing out unlinked inodes Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com List-Id: xfs To: xfs@oss.sgi.com I've been looking into a case of filesystem corruption and found that we are flushing unlinked inodes after the inode cluster has been freed - and potentially reallocated as something else. The case happens when we unlink the last inode in a cluster and that triggers the cluster to be released. The code path of interest here is: xfs_fs_clear_inode() ->xfs_inactive() ->xfs_ifree() ->xfs_ifree_cluster() ->xfs_reclaim() -> queues inode on deleted inodes list ... and later on xfs_syncsub() ->xfs_finish_reclaim_all() ->xfs_finish_reclaim() ->xfs_iflush() When the inode is unlinked it gets logged in a transaction so xfs_iflush() considers it dirty and writes it out but by this time the cluster has been reallocated. If the cluster is reallocated as user data then the checks in xfs_imap_to_bp will complain because the inode magic will be incorrect but if the cluster is reallocated as another inode cluster then these checks wont detect that. I modified xfs_iflush() to bail out if we try to flush an unlinked inode (ie nlink == 0) and that avoids the corruption but xfs_repair now has problems with inodes marked as free but with non-zero nlink counts. Do we really want to write out unlinked inodes? Seems a bit redundant. Other options could be to delay the release of the inode cluster until the inode has been flushed or move the flush into xfs_ifree() before releasing the cluster. Looking at xfs_ifree_cluster() it scans the inodes in a cluster and tries to lock them and mark them stale - maybe we can leverage this and avoid flushing staled inodes. If so we'd need to tighten up the locking. Does anyone have suggestions which direction we should take? Lachlan