* Review: get_bulkall() returns incorrect inode state
@ 2007-09-21 6:35 Vlad Apostolov
0 siblings, 0 replies; only message in thread
From: Vlad Apostolov @ 2007-09-21 6:35 UTC (permalink / raw)
To: xfs-dev; +Cc: xfs
[-- Attachment #1: Type: text/plain, Size: 2353 bytes --]
In the following scenario xfs_bulkstat() returns incorrect
stale inode state:
1. File_A is created and its inode synced to disk.
2. File_A is unlinked and doesn't exist anymore.
3. Filesystem sync is invoked.
4. File_B is created. File_B happens to reclaim File_A's inode.
5. xfs_bulkstat() is called and detects File_B but reports the
incorrect File_A inode state.
Explanation for the incorrect inode state is that inodes are not immediately
synced on file create for performance reasons. This leaves the on-disk inode
buffer uninitialized (or with old state from a previous generation inode)
and this is what xfs_bulkstat() would report.
Solutions:
One solution provided by the attached patch is to mark the on-disk
inode buffer "dirty" on unlink. When the inode is reclaimed
(by a new file create), xfs_bulkstat() would filter this inode by the
"dirty" mark. Once the inode is flushed to disk, the on-disk buffer "dirty"
mark is automatically removed and a following xfs_bulkstat() would return
the correct inode state.
A better solution would be if the inode is immediately synced at create.
This would make the on-disk inode buffer up to date and xfs_bulkstat()
should
return the correct inode state.
Disadvantages of the above solutions:
The first solution marks the inode "dirty" on unlink by setting the on-disk
di_nlink field to 0. Note that the in-core di_nlink has already been set to
0 and a corresponding transaction logged by xfs_droplink().
This is an exception from the rule that any on-disk inode buffer changes
has to be followed by a disk write (inode flush).
Synchronizing the in-core to on-disk di_nlink values in advance (before the
actual inode flush to disk) should be fine in this case because the inode
is already unlinked and it would never change its di_nlink again for this
inode generation.
The second solution would be a performance hit as it would require disk
write for each new file create to flush the inode to disk. Dave suggested
modifying the inode sync code to allow on-disk buffer updates without
disk writes. This doesn't seem to be very simple and would require some
time for a proper implementation and testing. I have been assigned to
do it in the near future.
In the mean time we have XFS users that urgently need a solution
and the patch bellow seems to fix the problem for them.
Regards,
Vlad
[-- Attachment #2: get_bulkall-appears-to-return-stale-information.patch --]
[-- Type: text/x-patch, Size: 2529 bytes --]
Index: linux-xfs1/fs/xfs/xfs_inode.c
===================================================================
--- linux-xfs1.orig/fs/xfs/xfs_inode.c
+++ linux-xfs1/fs/xfs/xfs_inode.c
@@ -1952,6 +1952,25 @@ xfs_iunlink(
bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
ASSERT(agi->agi_unlinked[bucket_index]);
ASSERT(be32_to_cpu(agi->agi_unlinked[bucket_index]) != agino);
+ error = xfs_itobp(mp, tp, ip, &dip, &ibp, 0, 0);
+
+ if (error) {
+ return error;
+ }
+
+ /*
+ * Clear the on-disk di_nlink. This is to prevent xfs_bulkstat
+ * from picking up this inode when it is reclaimed (its incore state
+ * initialzed but not flushed to disk yet). The in-core di_nlink is
+ * already cleared in xfs_droplink() and a corresponding transaction
+ * logged. The hack here just synchronizes the in-core to on-disk
+ * di_nlink value in advance before the actual inode sync to disk.
+ * This is OK because the inode is already unlinked and would never
+ * change its di_nlink again for this inode generation.
+ * This is a temporary hack that would require a proper fix
+ * in the future.
+ */
+ dip->di_core.di_nlink = 0;
if (be32_to_cpu(agi->agi_unlinked[bucket_index]) != NULLAGINO) {
/*
@@ -1960,10 +1979,6 @@ xfs_iunlink(
* Here we put the head pointer into our next pointer,
* and then we fall through to point the head at us.
*/
- error = xfs_itobp(mp, tp, ip, &dip, &ibp, 0, 0);
- if (error) {
- return error;
- }
ASSERT(be32_to_cpu(dip->di_next_unlinked) == NULLAGINO);
/* both on-disk, don't endian flip twice */
dip->di_next_unlinked = agi->agi_unlinked[bucket_index];
Index: linux-xfs1/fs/xfs/xfs_itable.c
===================================================================
--- linux-xfs1.orig/fs/xfs/xfs_itable.c
+++ linux-xfs1/fs/xfs/xfs_itable.c
@@ -290,8 +290,16 @@ xfs_bulkstat_use_dinode(
return 1;
dip = (xfs_dinode_t *)
xfs_buf_offset(bp, clustidx << mp->m_sb.sb_inodelog);
+ /*
+ * Check the buffer containing the on-disk inode for di_nlink == 0.
+ * This is to prevent xfs_bulkstat from picking up just reclaimed
+ * inodes that have their in-core state initialized but not flushed
+ * to disk yet. This is a temporary hack that would require a proper
+ * fix in the future.
+ */
if (be16_to_cpu(dip->di_core.di_magic) != XFS_DINODE_MAGIC ||
- !XFS_DINODE_GOOD_VERSION(dip->di_core.di_version))
+ !XFS_DINODE_GOOD_VERSION(dip->di_core.di_version) ||
+ !dip->di_core.di_nlink)
return 0;
if (flags & BULKSTAT_FG_QUICK) {
*dipp = dip;
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2007-09-21 6:34 UTC | newest]
Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-09-21 6:35 Review: get_bulkall() returns incorrect inode state Vlad Apostolov
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.