public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] xfs: reinstate the iolock in xfs_readdir
@ 2013-12-03 21:29 Ben Myers
  2013-12-03 22:55 ` Dave Chinner
  0 siblings, 1 reply; 6+ messages in thread
From: Ben Myers @ 2013-12-03 21:29 UTC (permalink / raw)
  To: xfs

Although it was removed in commit 051e7cd44ab8, ilock needs to be taken in
xfs_readdir because we might have to read the extent list in from disk.  This
keeps other threads from reading from or writing to the extent list while it is
being read in and is still in a transitional state.

This has been associated with "Access to block zero" messages on directories
with large numbers of extents resulting from excessive filesytem fragmentation,
as well as extent list corruption.  Unfortunately no test case at this point.

Signed-off-by: Ben Myers <bpm@sgi.com>

---
 fs/xfs/xfs_dir2_readdir.c |    4 ++++
 1 file changed, 4 insertions(+)

Index: b/fs/xfs/xfs_dir2_readdir.c
===================================================================
--- a/fs/xfs/xfs_dir2_readdir.c	2013-12-02 17:40:38.895185673 -0600
+++ b/fs/xfs/xfs_dir2_readdir.c	2013-12-02 17:40:49.025225554 -0600
@@ -674,6 +674,7 @@ xfs_readdir(
 {
 	int		rval;		/* return value */
 	int		v;		/* type-checking value */
+	uint		lock_mode;
 
 	trace_xfs_readdir(dp);
 
@@ -683,6 +684,7 @@ xfs_readdir(
 	ASSERT(S_ISDIR(dp->i_d.di_mode));
 	XFS_STATS_INC(xs_dir_getdents);
 
+	lock_mode = xfs_ilock_map_shared(dp);
 	if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL)
 		rval = xfs_dir2_sf_getdents(dp, ctx);
 	else if ((rval = xfs_dir2_isblock(NULL, dp, &v)))
@@ -691,5 +693,7 @@ xfs_readdir(
 		rval = xfs_dir2_block_getdents(dp, ctx);
 	else
 		rval = xfs_dir2_leaf_getdents(dp, ctx, bufsize);
+	xfs_iunlock_map_shared(dp, lock_mode);
+
 	return rval;
 }

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] xfs: reinstate the iolock in xfs_readdir
  2013-12-03 21:29 [PATCH] xfs: reinstate the iolock in xfs_readdir Ben Myers
@ 2013-12-03 22:55 ` Dave Chinner
  2013-12-03 23:14   ` Ben Myers
  0 siblings, 1 reply; 6+ messages in thread
From: Dave Chinner @ 2013-12-03 22:55 UTC (permalink / raw)
  To: Ben Myers; +Cc: xfs

On Tue, Dec 03, 2013 at 03:29:51PM -0600, Ben Myers wrote:
> Although it was removed in commit 051e7cd44ab8, ilock needs to be taken in
> xfs_readdir because we might have to read the extent list in from disk.  This
> keeps other threads from reading from or writing to the extent list while it is
> being read in and is still in a transitional state.
>
> This has been associated with "Access to block zero" messages on directories
> with large numbers of extents resulting from excessive filesytem fragmentation,
> as well as extent list corruption.  Unfortunately no test case at this point.

That commit was from 2007, so this is not a result of a recent
change. As it is, access to directories is completely
serialised by the VFS i_mutex. We don't need the iolock to read in
the extent list for a directory, because by the time we get to
readdir it is already guaranteed to be read in by this code:

STATIC int
xfs_dir_open(
        struct inode    *inode,
        struct file     *file)
{
        struct xfs_inode *ip = XFS_I(inode);
        int             mode;
        int             error;

        error = xfs_file_open(inode, file);
        if (error)
                return error;

        /*
         * If there are any blocks, read-ahead block 0 as we're almost
         * certain to have the next operation be a read there.
         */
        mode = xfs_ilock_map_shared(ip);
        if (ip->i_d.di_nextents > 0)
                xfs_dir3_data_readahead(NULL, ip, 0, -1);
        xfs_iunlock(ip, mode);
        return 0;
}

Which means that for a directory in block/leaf/node form the open of
it prior to the first readdir() call will read the extent tree into
memory and it will be pinned in memory for at least the life of the
file descriptor that is open on the directory.

If the directory is not in block form at open time, then the
first modification that takes it to block form will correctly lock
and initialise the extent tree, and concurrent readers will block on
the i_mutex at the VFS, not on the XFS iolock.

Now, if something is accessing the directory without going through
the VFS, there might be an issue, but all read accesses are via a
file descriptor of some kind and so should always have the extent
list initialised and read in correctly before reads occur due to the
code in xfs_dir_open() and serialisation via the i_mutex at the VFS
level.

So I don't see why holding the iolock during readdir is necessary,
nor how not holding it could lead to corrupt extent trees being seen
during directory operations. I need more information to understand
the issue being seen....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] xfs: reinstate the iolock in xfs_readdir
  2013-12-03 22:55 ` Dave Chinner
@ 2013-12-03 23:14   ` Ben Myers
  2013-12-04  0:10     ` Dave Chinner
  0 siblings, 1 reply; 6+ messages in thread
From: Ben Myers @ 2013-12-03 23:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Wed, Dec 04, 2013 at 09:55:20AM +1100, Dave Chinner wrote:
> On Tue, Dec 03, 2013 at 03:29:51PM -0600, Ben Myers wrote:
> > Although it was removed in commit 051e7cd44ab8, ilock needs to be taken in
> > xfs_readdir because we might have to read the extent list in from disk.  This
> > keeps other threads from reading from or writing to the extent list while it is
> > being read in and is still in a transitional state.
> >
> > This has been associated with "Access to block zero" messages on directories
> > with large numbers of extents resulting from excessive filesytem fragmentation,
> > as well as extent list corruption.  Unfortunately no test case at this point.
> 
> That commit was from 2007, so this is not a result of a recent
> change. As it is, access to directories is completely
> serialised by the VFS i_mutex. We don't need the iolock to read in
> the extent list for a directory, because by the time we get to
> readdir it is already guaranteed to be read in by this code:

Gah.  Typo in my subject line.  That's the ilock, not the iolock.  Whenever we
touch the extent list we want to have the ilock at some level.

> STATIC int
> xfs_dir_open(
>         struct inode    *inode,
>         struct file     *file)
> {
>         struct xfs_inode *ip = XFS_I(inode);
>         int             mode;
>         int             error;
> 
>         error = xfs_file_open(inode, file);
>         if (error)
>                 return error;
> 
>         /*
>          * If there are any blocks, read-ahead block 0 as we're almost
>          * certain to have the next operation be a read there.
>          */
>         mode = xfs_ilock_map_shared(ip);
>         if (ip->i_d.di_nextents > 0)
>                 xfs_dir3_data_readahead(NULL, ip, 0, -1);
>         xfs_iunlock(ip, mode);
>         return 0;
> }
> 
> Which means that for a directory in block/leaf/node form the open of
> it prior to the first readdir() call will read the extent tree into
> memory and it will be pinned in memory for at least the life of the
> file descriptor that is open on the directory.

That readahead _may_ read some of the blocks from disk but it doesn't copy them
into the incore extent list.  That's what we need to protect from other readers
who don't take the i_mutex.  Will see if I can find an example.

> If the directory is not in block form at open time, then the
> first modification that takes it to block form will correctly lock
> and initialise the extent tree, and concurrent readers will block on
> the i_mutex at the VFS, not on the XFS iolock.
> 
> Now, if something is accessing the directory without going through
> the VFS, there might be an issue, but all read accesses are via a
> file descriptor of some kind and so should always have the extent
> list initialised and read in correctly before reads occur due to the
> code in xfs_dir_open() and serialisation via the i_mutex at the VFS
> level.
> 
> So I don't see why holding the iolock during readdir is necessary,
> nor how not holding it could lead to corrupt extent trees being seen
> during directory operations. I need more information to understand
> the issue being seen....

Sorry for the confusion.  bbl.  ;)

-Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] xfs: reinstate the iolock in xfs_readdir
  2013-12-03 23:14   ` Ben Myers
@ 2013-12-04  0:10     ` Dave Chinner
  2013-12-04 13:08       ` Christoph Hellwig
  0 siblings, 1 reply; 6+ messages in thread
From: Dave Chinner @ 2013-12-04  0:10 UTC (permalink / raw)
  To: Ben Myers; +Cc: xfs

On Tue, Dec 03, 2013 at 05:14:15PM -0600, Ben Myers wrote:
> On Wed, Dec 04, 2013 at 09:55:20AM +1100, Dave Chinner wrote:
> > On Tue, Dec 03, 2013 at 03:29:51PM -0600, Ben Myers wrote:
> > > Although it was removed in commit 051e7cd44ab8, ilock needs to be taken in
> > > xfs_readdir because we might have to read the extent list in from disk.  This
> > > keeps other threads from reading from or writing to the extent list while it is
> > > being read in and is still in a transitional state.
> > >
> > > This has been associated with "Access to block zero" messages on directories
> > > with large numbers of extents resulting from excessive filesytem fragmentation,
> > > as well as extent list corruption.  Unfortunately no test case at this point.
> > 
> > That commit was from 2007, so this is not a result of a recent
> > change. As it is, access to directories is completely
> > serialised by the VFS i_mutex. We don't need the iolock to read in
> > the extent list for a directory, because by the time we get to
> > readdir it is already guaranteed to be read in by this code:
> 
> Gah.  Typo in my subject line.  That's the ilock, not the iolock.  Whenever we
> touch the extent list we want to have the ilock at some level.

I just copied what you said. ;)

> > STATIC int
> > xfs_dir_open(
> >         struct inode    *inode,
> >         struct file     *file)
> > {
> >         struct xfs_inode *ip = XFS_I(inode);
> >         int             mode;
> >         int             error;
> > 
> >         error = xfs_file_open(inode, file);
> >         if (error)
> >                 return error;
> > 
> >         /*
> >          * If there are any blocks, read-ahead block 0 as we're almost
> >          * certain to have the next operation be a read there.
> >          */
> >         mode = xfs_ilock_map_shared(ip);
> >         if (ip->i_d.di_nextents > 0)
> >                 xfs_dir3_data_readahead(NULL, ip, 0, -1);
> >         xfs_iunlock(ip, mode);
> >         return 0;
> > }
> > 
> > Which means that for a directory in block/leaf/node form the open of
> > it prior to the first readdir() call will read the extent tree into
> > memory and it will be pinned in memory for at least the life of the
> > file descriptor that is open on the directory.
> 
> That readahead _may_ read some of the blocks from disk but it doesn't copy them
> into the incore extent list.

Readahead reads data blocks based on their file offset. How do we
find the data block to read from the file offset? It looks it up in
the extent list via xfs_bmapi_read(), which reads the entire extent
map from disk and populates the incore extent tree in it's entirity
if it is not present.  i.e.

xfs_dir3_data_readahead
  xfs_da_reada_buf
    xfs_dabuf_map
      if (mappedbno == -1)
	xfs_bmapi_read
	  if (!XFS_IFEXTENTS)
	    xfs_iread_extents
	      set XFS_IFEXTENTS
	      xfs_bmap_read_extents
		<reads extent tree from disk>
		<populate incore extent list>
	  <map offset 0 to filesystem block>
      mappedbno = filesystem block from xfs_bmapi_read
    xfs_buf_reada(mappedbno)
      (reads directory data from disk)

Hence xfs_dir_open() has the correct locking and ensures that we
don't need to use the extent list lock for any read operations
through the VFS because all read operations are serialised against
modifications to the directory the i_mutex at the VFS.

> That's what we need to protect from other readers
> who don't take the i_mutex.  Will see if I can find an example.

The only way I can see there being a problem is if the directory
extent tree is being modified without the directory i_mutex being
held. AFAICT, that can't happen from the VFS, so maybe there's
something that is coming from the ioctl interfaces - the only thing
I can see there is xfs_attrmulti_attr_set() causing a directory
inode data fork to move from extent to btree format as an attribute
fork is added, but that wouldn't cause problems with existing
directories with large extent counts.

Also, nothing in xfstests/xfsprogs uses XFS_IOC_ATTRMULTI_BY_HANDLE
or the libhandle functions that call this ioctl, so it would seem to
me that the only thing that might use this function to set
attributes on files is DMF.....

Hmmm, that makes me wonder about whether this is a CXFS problem,
because it has hooks deep in the XFS code below the VFS. Last time I
looked, the MDS completely bypassed the VFS for directory
modifications driven by clients. That would explain why nobody has
seen this problem on mainline kernels for the best part of 6 years,
and also explain why the ilock is necessary across the readdir
operation to prevent the extent list problems from being seen....

Ben - if this is a DMF or CXFS problem, just say so. Some of us know
an awful lot about how those products work, so there's no point
wasting my time trying to dance around why a change needs to be made
without mentioning those products.

The simple fact is that if we ever want to do concurrent directory
read operations, we have to take the ilock in readdir() to ensure we
can serialise correctly against modifications because the i_mutex
can't be used to do that. So, really, I'm not against the fix you
proposed - I'm just trying to understand why it is necessary right
now....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] xfs: reinstate the iolock in xfs_readdir
  2013-12-04  0:10     ` Dave Chinner
@ 2013-12-04 13:08       ` Christoph Hellwig
  2013-12-05  0:12         ` Ben Myers
  0 siblings, 1 reply; 6+ messages in thread
From: Christoph Hellwig @ 2013-12-04 13:08 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Ben Myers, xfs

On Wed, Dec 04, 2013 at 11:10:30AM +1100, Dave Chinner wrote:
> The simple fact is that if we ever want to do concurrent directory
> read operations, we have to take the ilock in readdir() to ensure we
> can serialise correctly against modifications because the i_mutex
> can't be used to do that. So, really, I'm not against the fix you
> proposed - I'm just trying to understand why it is necessary right
> now....

Some comments from the person who authered that old comment that removed
the ilock:

 - as far as I can tell that was not intentional.  I'm usually pretty
   good at recording such things in the changelog if I do it
   intentionally.
 - relying on the open function to read in the extent list seems
   potentially dangerous.  We generally try to make sure all the
   functions using it read it in if needed including the proper
   locking.  If we want to avoid that for some reason like in the
   writeback path we at least comment it and put asserts in.
 - like Dave pointed out I think things should just work for mainline
   modulo maybe the attr by handle ioctl, but it seems by accident.
   Figuring out what issues Ben sees would be useful.
 - instead of putting the ilock back at the highest level we might want
   to push it down to the dir2 data code and only cover the actual
   critical region where we need it.
 - xfs_iread_extents really needs an assert to make sure the ilock
   is held.
 - xfs_bmapi_read probably should have an assert as well to make
   sure the ilock is held in some way

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] xfs: reinstate the iolock in xfs_readdir
  2013-12-04 13:08       ` Christoph Hellwig
@ 2013-12-05  0:12         ` Ben Myers
  0 siblings, 0 replies; 6+ messages in thread
From: Ben Myers @ 2013-12-05  0:12 UTC (permalink / raw)
  To: Christoph Hellwig, Dave Chinner; +Cc: xfs

Hey Fellas,

On Wed, Dec 04, 2013 at 05:08:48AM -0800, Christoph Hellwig wrote:
> On Wed, Dec 04, 2013 at 11:10:30AM +1100, Dave Chinner wrote:
> > The simple fact is that if we ever want to do concurrent directory
> > read operations, we have to take the ilock in readdir() to ensure we
> > can serialise correctly against modifications because the i_mutex
> > can't be used to do that. So, really, I'm not against the fix you
> > proposed - I'm just trying to understand why it is necessary right
> > now....

I understand.  ;)

Dave, you are correct in guessing that this bug was found with CXFS.  No DMF on
this one though.  Unfortunately it is not reproduceable at will.

> Some comments from the person who authered that old comment that removed
> the ilock:
> 
>  - as far as I can tell that was not intentional.  I'm usually pretty
>    good at recording such things in the changelog if I do it
>    intentionally.

That was my impression too.

>  - relying on the open function to read in the extent list seems
>    potentially dangerous.

...because open doesn't take i_mutex, I think.  So today even though
xfs_dir_open does take the ilock to read in the extent list (D'oh, I'd missed
that it reads it in), there is no exclusion with readdir.

>    We generally try to make sure all the
>    functions using it read it in if needed including the proper
>    locking.  If we want to avoid that for some reason like in the
>    writeback path we at least comment it and put asserts in.

>  - like Dave pointed out I think things should just work for mainline
>    modulo maybe the attr by handle ioctl, but it seems by accident.
>    Figuring out what issues Ben sees would be useful.

- Logs indicating 'Access to block zero' on directories.
- Forced shutdowns:
	In xfs_free_ag_extent, XFS_WANT_CORRUPTED_GOTO when freeing an extent
	we find a section of it already in the freespace trees
- Log replay failures
	xlog_recover_finish
	  xlog_recover_process_efis
	    xlog_recover_process_efi
	      xfs_free_extent
	        xfs_free_ag_extent	that's the same thing
- A report that removal of files triggers the forced shutdown.

There was a similar bug related to reading in the extent list for regular files
which is not necessary mainline.  That had XFS_WANT_CORRUPTED_GOTO in
xfs_bmap_add_extent_delay_real, having found an extent in the bmbt that was
overlapping the one it was trying to convert.  Maybe that is not related to
this readdir business.

We found that files with the corruption generally had very large extent lists
with small extents, and that xfs_repair wasn't cleaning up this type of
corruption.

>  - instead of putting the ilock back at the highest level we might want
>    to push it down to the dir2 data code and only cover the actual
>    critical region where we need it.

I agree it's worth considering but I'm not sure that it can be pushed much
further down.  

>  - xfs_iread_extents really needs an assert to make sure the ilock
>    is held.
>  - xfs_bmapi_read probably should have an assert as well to make
>    sure the ilock is held in some way

I agree with adding the asserts, and I think the ilock should be reinstated
mainline in readdir.  We do seem to have some reports on the list of freespace
btree corruption and XFS_WANT_CORRUPTED_GOTO that are not resolved.

Anyway, the open and 'attr by handle' paths do seem suspect.  Looks like we're
ok in fallocate because xfs doesn't allow that on directories.  I'll keep
poking around for extent list readers that don't take the i_mutex...

-Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2013-12-05  0:12 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-03 21:29 [PATCH] xfs: reinstate the iolock in xfs_readdir Ben Myers
2013-12-03 22:55 ` Dave Chinner
2013-12-03 23:14   ` Ben Myers
2013-12-04  0:10     ` Dave Chinner
2013-12-04 13:08       ` Christoph Hellwig
2013-12-05  0:12         ` Ben Myers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox