* [PATCH] xfs: reinstate the iolock in xfs_readdir
@ 2013-12-03 21:29 Ben Myers
2013-12-03 22:55 ` Dave Chinner
0 siblings, 1 reply; 6+ messages in thread
From: Ben Myers @ 2013-12-03 21:29 UTC (permalink / raw)
To: xfs
Although it was removed in commit 051e7cd44ab8, ilock needs to be taken in
xfs_readdir because we might have to read the extent list in from disk. This
keeps other threads from reading from or writing to the extent list while it is
being read in and is still in a transitional state.
This has been associated with "Access to block zero" messages on directories
with large numbers of extents resulting from excessive filesytem fragmentation,
as well as extent list corruption. Unfortunately no test case at this point.
Signed-off-by: Ben Myers <bpm@sgi.com>
---
fs/xfs/xfs_dir2_readdir.c | 4 ++++
1 file changed, 4 insertions(+)
Index: b/fs/xfs/xfs_dir2_readdir.c
===================================================================
--- a/fs/xfs/xfs_dir2_readdir.c 2013-12-02 17:40:38.895185673 -0600
+++ b/fs/xfs/xfs_dir2_readdir.c 2013-12-02 17:40:49.025225554 -0600
@@ -674,6 +674,7 @@ xfs_readdir(
{
int rval; /* return value */
int v; /* type-checking value */
+ uint lock_mode;
trace_xfs_readdir(dp);
@@ -683,6 +684,7 @@ xfs_readdir(
ASSERT(S_ISDIR(dp->i_d.di_mode));
XFS_STATS_INC(xs_dir_getdents);
+ lock_mode = xfs_ilock_map_shared(dp);
if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL)
rval = xfs_dir2_sf_getdents(dp, ctx);
else if ((rval = xfs_dir2_isblock(NULL, dp, &v)))
@@ -691,5 +693,7 @@ xfs_readdir(
rval = xfs_dir2_block_getdents(dp, ctx);
else
rval = xfs_dir2_leaf_getdents(dp, ctx, bufsize);
+ xfs_iunlock_map_shared(dp, lock_mode);
+
return rval;
}
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: [PATCH] xfs: reinstate the iolock in xfs_readdir 2013-12-03 21:29 [PATCH] xfs: reinstate the iolock in xfs_readdir Ben Myers @ 2013-12-03 22:55 ` Dave Chinner 2013-12-03 23:14 ` Ben Myers 0 siblings, 1 reply; 6+ messages in thread From: Dave Chinner @ 2013-12-03 22:55 UTC (permalink / raw) To: Ben Myers; +Cc: xfs On Tue, Dec 03, 2013 at 03:29:51PM -0600, Ben Myers wrote: > Although it was removed in commit 051e7cd44ab8, ilock needs to be taken in > xfs_readdir because we might have to read the extent list in from disk. This > keeps other threads from reading from or writing to the extent list while it is > being read in and is still in a transitional state. > > This has been associated with "Access to block zero" messages on directories > with large numbers of extents resulting from excessive filesytem fragmentation, > as well as extent list corruption. Unfortunately no test case at this point. That commit was from 2007, so this is not a result of a recent change. As it is, access to directories is completely serialised by the VFS i_mutex. We don't need the iolock to read in the extent list for a directory, because by the time we get to readdir it is already guaranteed to be read in by this code: STATIC int xfs_dir_open( struct inode *inode, struct file *file) { struct xfs_inode *ip = XFS_I(inode); int mode; int error; error = xfs_file_open(inode, file); if (error) return error; /* * If there are any blocks, read-ahead block 0 as we're almost * certain to have the next operation be a read there. */ mode = xfs_ilock_map_shared(ip); if (ip->i_d.di_nextents > 0) xfs_dir3_data_readahead(NULL, ip, 0, -1); xfs_iunlock(ip, mode); return 0; } Which means that for a directory in block/leaf/node form the open of it prior to the first readdir() call will read the extent tree into memory and it will be pinned in memory for at least the life of the file descriptor that is open on the directory. If the directory is not in block form at open time, then the first modification that takes it to block form will correctly lock and initialise the extent tree, and concurrent readers will block on the i_mutex at the VFS, not on the XFS iolock. Now, if something is accessing the directory without going through the VFS, there might be an issue, but all read accesses are via a file descriptor of some kind and so should always have the extent list initialised and read in correctly before reads occur due to the code in xfs_dir_open() and serialisation via the i_mutex at the VFS level. So I don't see why holding the iolock during readdir is necessary, nor how not holding it could lead to corrupt extent trees being seen during directory operations. I need more information to understand the issue being seen.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] xfs: reinstate the iolock in xfs_readdir 2013-12-03 22:55 ` Dave Chinner @ 2013-12-03 23:14 ` Ben Myers 2013-12-04 0:10 ` Dave Chinner 0 siblings, 1 reply; 6+ messages in thread From: Ben Myers @ 2013-12-03 23:14 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Wed, Dec 04, 2013 at 09:55:20AM +1100, Dave Chinner wrote: > On Tue, Dec 03, 2013 at 03:29:51PM -0600, Ben Myers wrote: > > Although it was removed in commit 051e7cd44ab8, ilock needs to be taken in > > xfs_readdir because we might have to read the extent list in from disk. This > > keeps other threads from reading from or writing to the extent list while it is > > being read in and is still in a transitional state. > > > > This has been associated with "Access to block zero" messages on directories > > with large numbers of extents resulting from excessive filesytem fragmentation, > > as well as extent list corruption. Unfortunately no test case at this point. > > That commit was from 2007, so this is not a result of a recent > change. As it is, access to directories is completely > serialised by the VFS i_mutex. We don't need the iolock to read in > the extent list for a directory, because by the time we get to > readdir it is already guaranteed to be read in by this code: Gah. Typo in my subject line. That's the ilock, not the iolock. Whenever we touch the extent list we want to have the ilock at some level. > STATIC int > xfs_dir_open( > struct inode *inode, > struct file *file) > { > struct xfs_inode *ip = XFS_I(inode); > int mode; > int error; > > error = xfs_file_open(inode, file); > if (error) > return error; > > /* > * If there are any blocks, read-ahead block 0 as we're almost > * certain to have the next operation be a read there. > */ > mode = xfs_ilock_map_shared(ip); > if (ip->i_d.di_nextents > 0) > xfs_dir3_data_readahead(NULL, ip, 0, -1); > xfs_iunlock(ip, mode); > return 0; > } > > Which means that for a directory in block/leaf/node form the open of > it prior to the first readdir() call will read the extent tree into > memory and it will be pinned in memory for at least the life of the > file descriptor that is open on the directory. That readahead _may_ read some of the blocks from disk but it doesn't copy them into the incore extent list. That's what we need to protect from other readers who don't take the i_mutex. Will see if I can find an example. > If the directory is not in block form at open time, then the > first modification that takes it to block form will correctly lock > and initialise the extent tree, and concurrent readers will block on > the i_mutex at the VFS, not on the XFS iolock. > > Now, if something is accessing the directory without going through > the VFS, there might be an issue, but all read accesses are via a > file descriptor of some kind and so should always have the extent > list initialised and read in correctly before reads occur due to the > code in xfs_dir_open() and serialisation via the i_mutex at the VFS > level. > > So I don't see why holding the iolock during readdir is necessary, > nor how not holding it could lead to corrupt extent trees being seen > during directory operations. I need more information to understand > the issue being seen.... Sorry for the confusion. bbl. ;) -Ben _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] xfs: reinstate the iolock in xfs_readdir 2013-12-03 23:14 ` Ben Myers @ 2013-12-04 0:10 ` Dave Chinner 2013-12-04 13:08 ` Christoph Hellwig 0 siblings, 1 reply; 6+ messages in thread From: Dave Chinner @ 2013-12-04 0:10 UTC (permalink / raw) To: Ben Myers; +Cc: xfs On Tue, Dec 03, 2013 at 05:14:15PM -0600, Ben Myers wrote: > On Wed, Dec 04, 2013 at 09:55:20AM +1100, Dave Chinner wrote: > > On Tue, Dec 03, 2013 at 03:29:51PM -0600, Ben Myers wrote: > > > Although it was removed in commit 051e7cd44ab8, ilock needs to be taken in > > > xfs_readdir because we might have to read the extent list in from disk. This > > > keeps other threads from reading from or writing to the extent list while it is > > > being read in and is still in a transitional state. > > > > > > This has been associated with "Access to block zero" messages on directories > > > with large numbers of extents resulting from excessive filesytem fragmentation, > > > as well as extent list corruption. Unfortunately no test case at this point. > > > > That commit was from 2007, so this is not a result of a recent > > change. As it is, access to directories is completely > > serialised by the VFS i_mutex. We don't need the iolock to read in > > the extent list for a directory, because by the time we get to > > readdir it is already guaranteed to be read in by this code: > > Gah. Typo in my subject line. That's the ilock, not the iolock. Whenever we > touch the extent list we want to have the ilock at some level. I just copied what you said. ;) > > STATIC int > > xfs_dir_open( > > struct inode *inode, > > struct file *file) > > { > > struct xfs_inode *ip = XFS_I(inode); > > int mode; > > int error; > > > > error = xfs_file_open(inode, file); > > if (error) > > return error; > > > > /* > > * If there are any blocks, read-ahead block 0 as we're almost > > * certain to have the next operation be a read there. > > */ > > mode = xfs_ilock_map_shared(ip); > > if (ip->i_d.di_nextents > 0) > > xfs_dir3_data_readahead(NULL, ip, 0, -1); > > xfs_iunlock(ip, mode); > > return 0; > > } > > > > Which means that for a directory in block/leaf/node form the open of > > it prior to the first readdir() call will read the extent tree into > > memory and it will be pinned in memory for at least the life of the > > file descriptor that is open on the directory. > > That readahead _may_ read some of the blocks from disk but it doesn't copy them > into the incore extent list. Readahead reads data blocks based on their file offset. How do we find the data block to read from the file offset? It looks it up in the extent list via xfs_bmapi_read(), which reads the entire extent map from disk and populates the incore extent tree in it's entirity if it is not present. i.e. xfs_dir3_data_readahead xfs_da_reada_buf xfs_dabuf_map if (mappedbno == -1) xfs_bmapi_read if (!XFS_IFEXTENTS) xfs_iread_extents set XFS_IFEXTENTS xfs_bmap_read_extents <reads extent tree from disk> <populate incore extent list> <map offset 0 to filesystem block> mappedbno = filesystem block from xfs_bmapi_read xfs_buf_reada(mappedbno) (reads directory data from disk) Hence xfs_dir_open() has the correct locking and ensures that we don't need to use the extent list lock for any read operations through the VFS because all read operations are serialised against modifications to the directory the i_mutex at the VFS. > That's what we need to protect from other readers > who don't take the i_mutex. Will see if I can find an example. The only way I can see there being a problem is if the directory extent tree is being modified without the directory i_mutex being held. AFAICT, that can't happen from the VFS, so maybe there's something that is coming from the ioctl interfaces - the only thing I can see there is xfs_attrmulti_attr_set() causing a directory inode data fork to move from extent to btree format as an attribute fork is added, but that wouldn't cause problems with existing directories with large extent counts. Also, nothing in xfstests/xfsprogs uses XFS_IOC_ATTRMULTI_BY_HANDLE or the libhandle functions that call this ioctl, so it would seem to me that the only thing that might use this function to set attributes on files is DMF..... Hmmm, that makes me wonder about whether this is a CXFS problem, because it has hooks deep in the XFS code below the VFS. Last time I looked, the MDS completely bypassed the VFS for directory modifications driven by clients. That would explain why nobody has seen this problem on mainline kernels for the best part of 6 years, and also explain why the ilock is necessary across the readdir operation to prevent the extent list problems from being seen.... Ben - if this is a DMF or CXFS problem, just say so. Some of us know an awful lot about how those products work, so there's no point wasting my time trying to dance around why a change needs to be made without mentioning those products. The simple fact is that if we ever want to do concurrent directory read operations, we have to take the ilock in readdir() to ensure we can serialise correctly against modifications because the i_mutex can't be used to do that. So, really, I'm not against the fix you proposed - I'm just trying to understand why it is necessary right now.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] xfs: reinstate the iolock in xfs_readdir 2013-12-04 0:10 ` Dave Chinner @ 2013-12-04 13:08 ` Christoph Hellwig 2013-12-05 0:12 ` Ben Myers 0 siblings, 1 reply; 6+ messages in thread From: Christoph Hellwig @ 2013-12-04 13:08 UTC (permalink / raw) To: Dave Chinner; +Cc: Ben Myers, xfs On Wed, Dec 04, 2013 at 11:10:30AM +1100, Dave Chinner wrote: > The simple fact is that if we ever want to do concurrent directory > read operations, we have to take the ilock in readdir() to ensure we > can serialise correctly against modifications because the i_mutex > can't be used to do that. So, really, I'm not against the fix you > proposed - I'm just trying to understand why it is necessary right > now.... Some comments from the person who authered that old comment that removed the ilock: - as far as I can tell that was not intentional. I'm usually pretty good at recording such things in the changelog if I do it intentionally. - relying on the open function to read in the extent list seems potentially dangerous. We generally try to make sure all the functions using it read it in if needed including the proper locking. If we want to avoid that for some reason like in the writeback path we at least comment it and put asserts in. - like Dave pointed out I think things should just work for mainline modulo maybe the attr by handle ioctl, but it seems by accident. Figuring out what issues Ben sees would be useful. - instead of putting the ilock back at the highest level we might want to push it down to the dir2 data code and only cover the actual critical region where we need it. - xfs_iread_extents really needs an assert to make sure the ilock is held. - xfs_bmapi_read probably should have an assert as well to make sure the ilock is held in some way _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] xfs: reinstate the iolock in xfs_readdir 2013-12-04 13:08 ` Christoph Hellwig @ 2013-12-05 0:12 ` Ben Myers 0 siblings, 0 replies; 6+ messages in thread From: Ben Myers @ 2013-12-05 0:12 UTC (permalink / raw) To: Christoph Hellwig, Dave Chinner; +Cc: xfs Hey Fellas, On Wed, Dec 04, 2013 at 05:08:48AM -0800, Christoph Hellwig wrote: > On Wed, Dec 04, 2013 at 11:10:30AM +1100, Dave Chinner wrote: > > The simple fact is that if we ever want to do concurrent directory > > read operations, we have to take the ilock in readdir() to ensure we > > can serialise correctly against modifications because the i_mutex > > can't be used to do that. So, really, I'm not against the fix you > > proposed - I'm just trying to understand why it is necessary right > > now.... I understand. ;) Dave, you are correct in guessing that this bug was found with CXFS. No DMF on this one though. Unfortunately it is not reproduceable at will. > Some comments from the person who authered that old comment that removed > the ilock: > > - as far as I can tell that was not intentional. I'm usually pretty > good at recording such things in the changelog if I do it > intentionally. That was my impression too. > - relying on the open function to read in the extent list seems > potentially dangerous. ...because open doesn't take i_mutex, I think. So today even though xfs_dir_open does take the ilock to read in the extent list (D'oh, I'd missed that it reads it in), there is no exclusion with readdir. > We generally try to make sure all the > functions using it read it in if needed including the proper > locking. If we want to avoid that for some reason like in the > writeback path we at least comment it and put asserts in. > - like Dave pointed out I think things should just work for mainline > modulo maybe the attr by handle ioctl, but it seems by accident. > Figuring out what issues Ben sees would be useful. - Logs indicating 'Access to block zero' on directories. - Forced shutdowns: In xfs_free_ag_extent, XFS_WANT_CORRUPTED_GOTO when freeing an extent we find a section of it already in the freespace trees - Log replay failures xlog_recover_finish xlog_recover_process_efis xlog_recover_process_efi xfs_free_extent xfs_free_ag_extent that's the same thing - A report that removal of files triggers the forced shutdown. There was a similar bug related to reading in the extent list for regular files which is not necessary mainline. That had XFS_WANT_CORRUPTED_GOTO in xfs_bmap_add_extent_delay_real, having found an extent in the bmbt that was overlapping the one it was trying to convert. Maybe that is not related to this readdir business. We found that files with the corruption generally had very large extent lists with small extents, and that xfs_repair wasn't cleaning up this type of corruption. > - instead of putting the ilock back at the highest level we might want > to push it down to the dir2 data code and only cover the actual > critical region where we need it. I agree it's worth considering but I'm not sure that it can be pushed much further down. > - xfs_iread_extents really needs an assert to make sure the ilock > is held. > - xfs_bmapi_read probably should have an assert as well to make > sure the ilock is held in some way I agree with adding the asserts, and I think the ilock should be reinstated mainline in readdir. We do seem to have some reports on the list of freespace btree corruption and XFS_WANT_CORRUPTED_GOTO that are not resolved. Anyway, the open and 'attr by handle' paths do seem suspect. Looks like we're ok in fallocate because xfs doesn't allow that on directories. I'll keep poking around for extent list readers that don't take the i_mutex... -Ben _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2013-12-05 0:12 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-12-03 21:29 [PATCH] xfs: reinstate the iolock in xfs_readdir Ben Myers 2013-12-03 22:55 ` Dave Chinner 2013-12-03 23:14 ` Ben Myers 2013-12-04 0:10 ` Dave Chinner 2013-12-04 13:08 ` Christoph Hellwig 2013-12-05 0:12 ` Ben Myers
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox