public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/19] Fix filesystem freezing deadlocks
@ 2012-03-05 16:00 Jan Kara
  2012-03-05 16:01 ` [PATCH 11/19] xfs: Convert to new freezing code Jan Kara
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Kara @ 2012-03-05 16:00 UTC (permalink / raw)
  To: LKML
  Cc: Jan Kara, J. Bruce Fields, cluster-devel, ocfs2-devel,
	KONISHI Ryusuke, OGAWA Hirofumi, sandeen, linux-nilfs,
	Miklos Szeredi, Christoph Hellwig, Anton Altaparmakov, linux-ext4,
	fuse-devel, Mark Fasheh, xfs, Ben Myers, Joel Becker, dchinner,
	Steven Whitehouse, Chris Mason, linux-nfs, Alex Elder,
	Theodore Ts'o, linux-ntfs-dev, Kamal Mostafa, Al Viro,
	linux-fsdevel, David S. Miller, linux-btrfs

  Hallelujah,

  after a couple of weeks and several rewrites, here comes the third iteration
of my patches to improve filesystem freezing.  Filesystem freezing is currently
racy and thus we can end up with dirty data on frozen filesystem (see changelog
patch 06 for detailed race description). This patch series aims at fixing this.

To be able to block all places where inodes get dirtied, I've moved filesystem
freeze handling in mnt_want_write() / mnt_drop_write(). This however required
some code shuffling and changes to kern_path_create() (see patches 02-05). I
think the result is OK but opinions may differ ;). The advantage of this change
also is that all filesystems get freeze protection almost for free - even ext2
can handle freezing well now.

Another potential contention point might be patch 19. In that patch we make
freeze_super() refuse to freeze the filesystem when there are open but unlinked
files which may be impractical in some cases. The main reason for this is the
problem with handling of file deletion from fput() called with mmap_sem held
(e.g. from munmap(2)), and then there's the fact that we cannot really force
such filesystem into a consistent state... But if people think that freezing
with open but unlinked files should happen, then I have some possible
solutions in mind (maybe as a separate patchset since this is large enough).

I'm not able to hit any deadlocks, lockdep warnings, or dirty data on frozen
filesystem despite beating it with fsstress and bash-shared-mapping while
freezing and unfreezing for several hours (using ext4 and xfs) so I'm
reasonably confident this could finally be the right solution.

And for people wanting to test - this patchset is based on patch series
"Push file_update_time() into .page_mkwrite" so you'll need to pull that one
in as well.

Changes since v2:
  * completely rewritten
  * freezing is now blocked at VFS entry points
  * two stage freezing to handle both mmapped writes and other IO

The biggest changes since v1:
  * have two counters to provide safe state transitions for SB_FREEZE_WRITE
    and SB_FREEZE_TRANS states
  * use percpu counters instead of own percpu structure
  * added documentation fixes from the old fs freezing series
  * converted XFS to use SB_FREEZE_TRANS counter instead of its private
    m_active_trans counter

								Honza

CC: Alex Elder <elder@kernel.org>
CC: Anton Altaparmakov <anton@tuxera.com>
CC: Ben Myers <bpm@sgi.com>
CC: Chris Mason <chris.mason@oracle.com>
CC: cluster-devel@redhat.com
CC: "David S. Miller" <davem@davemloft.net>
CC: fuse-devel@lists.sourceforge.net
CC: "J. Bruce Fields" <bfields@fieldses.org>
CC: Joel Becker <jlbec@evilplan.org>
CC: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
CC: linux-btrfs@vger.kernel.org
CC: linux-ext4@vger.kernel.org
CC: linux-nfs@vger.kernel.org
CC: linux-nilfs@vger.kernel.org
CC: linux-ntfs-dev@lists.sourceforge.net
CC: Mark Fasheh <mfasheh@suse.com>
CC: Miklos Szeredi <miklos@szeredi.hu>
CC: ocfs2-devel@oss.oracle.com
CC: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
CC: Steven Whitehouse <swhiteho@redhat.com>
CC: "Theodore Ts'o" <tytso@mit.edu>
CC: xfs@oss.sgi.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 11/19] xfs: Convert to new freezing code
  2012-03-05 16:00 [PATCH 00/19] Fix filesystem freezing deadlocks Jan Kara
@ 2012-03-05 16:01 ` Jan Kara
  2012-03-08 23:20   ` Dave Chinner
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Kara @ 2012-03-05 16:01 UTC (permalink / raw)
  To: LKML
  Cc: sandeen, Alex Elder, Jan Kara, Kamal Mostafa, xfs,
	Christoph Hellwig, Ben Myers, Al Viro, dchinner, linux-fsdevel

Generic code now blocks all writers from standard write paths. So we block all
writers coming from ioctl and replace blocking of transactions on frozen
filesystem with a debugging check. As a bonus, we get a protection of ioctl
against racing remount read-only. We also convert xfs_file_aio_write() to a
non-racy freeze protection.

CC: Ben Myers <bpm@sgi.com>
CC: Alex Elder <elder@kernel.org>
CC: xfs@oss.sgi.com
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/xfs/xfs_file.c    |   10 ++++++--
 fs/xfs/xfs_ioctl.c   |   55 +++++++++++++++++++++++++++++++++++++++++++++++--
 fs/xfs/xfs_ioctl32.c |   12 ++++++++++
 fs/xfs/xfs_iomap.c   |    1 -
 fs/xfs/xfs_mount.c   |    2 +-
 fs/xfs/xfs_mount.h   |    3 --
 fs/xfs/xfs_sync.c    |    2 +-
 fs/xfs/xfs_trans.c   |    2 +-
 8 files changed, 74 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 7e5bc87..57dd20e 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -874,10 +874,12 @@ xfs_file_aio_write(
 	if (ocount == 0)
 		return 0;
 
-	xfs_wait_for_freeze(ip->i_mount, SB_FREEZE_WRITE);
+	sb_start_write(inode->i_sb);
 
-	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
-		return -EIO;
+	if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
+		ret = -EIO;
+		goto out;
+	}
 
 	if (unlikely(file->f_flags & O_DIRECT))
 		ret = xfs_file_dio_aio_write(iocb, iovp, nr_segs, pos, ocount);
@@ -896,6 +898,8 @@ xfs_file_aio_write(
 			ret = err;
 	}
 
+out:
+	sb_end_write(inode->i_sb);
 	return ret;
 }
 
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 76f3ca5..7890105 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -367,9 +367,15 @@ xfs_fssetdm_by_handle(
 	if (copy_from_user(&dmhreq, arg, sizeof(xfs_fsop_setdm_handlereq_t)))
 		return -XFS_ERROR(EFAULT);
 
+	error = mnt_want_write_file(parfilp);
+	if (error)
+		return error;
+
 	dentry = xfs_handlereq_to_dentry(parfilp, &dmhreq.hreq);
-	if (IS_ERR(dentry))
+	if (IS_ERR(dentry)) {
+		mnt_drop_write_file(parfilp);
 		return PTR_ERR(dentry);
+	}
 
 	if (IS_IMMUTABLE(dentry->d_inode) || IS_APPEND(dentry->d_inode)) {
 		error = -XFS_ERROR(EPERM);
@@ -385,6 +391,7 @@ xfs_fssetdm_by_handle(
 				 fsd.fsd_dmstate);
 
  out:
+	mnt_drop_write_file(parfilp);
 	dput(dentry);
 	return error;
 }
@@ -631,7 +638,11 @@ xfs_ioc_space(
 	if (ioflags & IO_INVIS)
 		attr_flags |= XFS_ATTR_DMI;
 
+	error = mnt_want_write_file(filp);
+	if (error)
+		return error;
 	error = xfs_change_file_space(ip, cmd, bf, filp->f_pos, attr_flags);
+	mnt_drop_write_file(filp);
 	return -error;
 }
 
@@ -1160,6 +1171,7 @@ xfs_ioc_fssetxattr(
 {
 	struct fsxattr		fa;
 	unsigned int		mask;
+	int error;
 
 	if (copy_from_user(&fa, arg, sizeof(fa)))
 		return -EFAULT;
@@ -1168,7 +1180,12 @@ xfs_ioc_fssetxattr(
 	if (filp->f_flags & (O_NDELAY|O_NONBLOCK))
 		mask |= FSX_NONBLOCK;
 
-	return -xfs_ioctl_setattr(ip, &fa, mask);
+	error = mnt_want_write_file(filp);
+	if (error)
+		return error;
+	error = xfs_ioctl_setattr(ip, &fa, mask);
+	mnt_drop_write_file(filp);
+	return -error;
 }
 
 STATIC int
@@ -1193,6 +1210,7 @@ xfs_ioc_setxflags(
 	struct fsxattr		fa;
 	unsigned int		flags;
 	unsigned int		mask;
+	int error;
 
 	if (copy_from_user(&flags, arg, sizeof(flags)))
 		return -EFAULT;
@@ -1207,7 +1225,12 @@ xfs_ioc_setxflags(
 		mask |= FSX_NONBLOCK;
 	fa.fsx_xflags = xfs_merge_ioc_xflags(flags, xfs_ip2xflags(ip));
 
-	return -xfs_ioctl_setattr(ip, &fa, mask);
+	error = mnt_want_write_file(filp);
+	if (error)
+		return error;
+	error = xfs_ioctl_setattr(ip, &fa, mask);
+	mnt_drop_write_file(filp);
+	return -error;
 }
 
 STATIC int
@@ -1382,8 +1405,13 @@ xfs_file_ioctl(
 		if (copy_from_user(&dmi, arg, sizeof(dmi)))
 			return -XFS_ERROR(EFAULT);
 
+		error = mnt_want_write_file(filp);
+		if (error)
+			return error;
+
 		error = xfs_set_dmattrs(ip, dmi.fsd_dmevmask,
 				dmi.fsd_dmstate);
+		mnt_drop_write_file(filp);
 		return -error;
 	}
 
@@ -1431,7 +1459,11 @@ xfs_file_ioctl(
 
 		if (copy_from_user(&sxp, arg, sizeof(xfs_swapext_t)))
 			return -XFS_ERROR(EFAULT);
+		error = mnt_want_write_file(filp);
+		if (error)
+			return error;
 		error = xfs_swapext(&sxp);
+		mnt_drop_write_file(filp);
 		return -error;
 	}
 
@@ -1460,9 +1492,14 @@ xfs_file_ioctl(
 		if (copy_from_user(&inout, arg, sizeof(inout)))
 			return -XFS_ERROR(EFAULT);
 
+		error = mnt_want_write_file(filp);
+		if (error)
+			return error;
+
 		/* input parameter is passed in resblks field of structure */
 		in = inout.resblks;
 		error = xfs_reserve_blocks(mp, &in, &inout);
+		mnt_drop_write_file(filp);
 		if (error)
 			return -error;
 
@@ -1493,7 +1530,11 @@ xfs_file_ioctl(
 		if (copy_from_user(&in, arg, sizeof(in)))
 			return -XFS_ERROR(EFAULT);
 
+		error = mnt_want_write_file(filp);
+		if (error)
+			return error;
 		error = xfs_growfs_data(mp, &in);
+		mnt_drop_write_file(filp);
 		return -error;
 	}
 
@@ -1503,7 +1544,11 @@ xfs_file_ioctl(
 		if (copy_from_user(&in, arg, sizeof(in)))
 			return -XFS_ERROR(EFAULT);
 
+		error = mnt_want_write_file(filp);
+		if (error)
+			return error;
 		error = xfs_growfs_log(mp, &in);
+		mnt_drop_write_file(filp);
 		return -error;
 	}
 
@@ -1513,7 +1558,11 @@ xfs_file_ioctl(
 		if (copy_from_user(&in, arg, sizeof(in)))
 			return -XFS_ERROR(EFAULT);
 
+		error = mnt_want_write_file(filp);
+		if (error)
+			return error;
 		error = xfs_growfs_rt(mp, &in);
+		mnt_drop_write_file(filp);
 		return -error;
 	}
 
diff --git a/fs/xfs/xfs_ioctl32.c b/fs/xfs/xfs_ioctl32.c
index f9ccb7b..2012369 100644
--- a/fs/xfs/xfs_ioctl32.c
+++ b/fs/xfs/xfs_ioctl32.c
@@ -602,7 +602,11 @@ xfs_file_compat_ioctl(
 
 		if (xfs_compat_growfs_data_copyin(&in, arg))
 			return -XFS_ERROR(EFAULT);
+		error = mnt_want_write_file(filp);
+		if (error)
+			return error;
 		error = xfs_growfs_data(mp, &in);
+		mnt_drop_write_file(filp);
 		return -error;
 	}
 	case XFS_IOC_FSGROWFSRT_32: {
@@ -610,7 +614,11 @@ xfs_file_compat_ioctl(
 
 		if (xfs_compat_growfs_rt_copyin(&in, arg))
 			return -XFS_ERROR(EFAULT);
+		error = mnt_want_write_file(filp);
+		if (error)
+			return error;
 		error = xfs_growfs_rt(mp, &in);
+		mnt_drop_write_file(filp);
 		return -error;
 	}
 #endif
@@ -629,7 +637,11 @@ xfs_file_compat_ioctl(
 				   offsetof(struct xfs_swapext, sx_stat)) ||
 		    xfs_ioctl32_bstat_copyin(&sxp.sx_stat, &sxu->sx_stat))
 			return -XFS_ERROR(EFAULT);
+		error = mnt_want_write_file(filp);
+		if (error)
+			return error;
 		error = xfs_swapext(&sxp);
+		mnt_drop_write_file(filp);
 		return -error;
 	}
 	case XFS_IOC_FSBULKSTAT_32:
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 246c7d5..7ab98d9 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -679,7 +679,6 @@ xfs_iomap_write_unwritten(
 		 * the same inode that we complete here and might deadlock
 		 * on the iolock.
 		 */
-		xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
 		tp = _xfs_trans_alloc(mp, XFS_TRANS_STRAT_WRITE, KM_NOFS);
 		tp->t_flags |= XFS_TRANS_RESERVE;
 		error = xfs_trans_reserve(tp, resblks,
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index d06afbc..bead529 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1541,7 +1541,7 @@ xfs_unmountfs(
 int
 xfs_fs_writable(xfs_mount_t *mp)
 {
-	return !(xfs_test_for_freeze(mp) || XFS_FORCED_SHUTDOWN(mp) ||
+	return !(mp->m_super->s_writers.frozen || XFS_FORCED_SHUTDOWN(mp) ||
 		(mp->m_flags & XFS_MOUNT_RDONLY));
 }
 
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 19f69e2..f0afeb9 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -310,9 +310,6 @@ void xfs_do_force_shutdown(struct xfs_mount *mp, int flags, char *fname,
 #define SHUTDOWN_REMOTE_REQ	0x0010	/* shutdown came from remote cell */
 #define SHUTDOWN_DEVICE_REQ	0x0020	/* failed all paths to the device */
 
-#define xfs_test_for_freeze(mp)		((mp)->m_super->s_frozen)
-#define xfs_wait_for_freeze(mp,l)	vfs_check_frozen((mp)->m_super, (l))
-
 /*
  * Flags for xfs_mountfs
  */
diff --git a/fs/xfs/xfs_sync.c b/fs/xfs/xfs_sync.c
index 40b75ee..c44d687 100644
--- a/fs/xfs/xfs_sync.c
+++ b/fs/xfs/xfs_sync.c
@@ -498,7 +498,7 @@ xfs_sync_worker(
 
 	if (!(mp->m_flags & XFS_MOUNT_RDONLY)) {
 		/* dgc: errors ignored here */
-		if (mp->m_super->s_frozen == SB_UNFROZEN &&
+		if (mp->m_super->s_writers.frozen == SB_UNFROZEN &&
 		    xfs_log_need_covered(mp))
 			error = xfs_fs_log_dummy(mp);
 		else
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 329b06a..6468a2a 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -577,7 +577,6 @@ xfs_trans_alloc(
 	xfs_mount_t	*mp,
 	uint		type)
 {
-	xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
 	return _xfs_trans_alloc(mp, type, KM_SLEEP);
 }
 
@@ -589,6 +588,7 @@ _xfs_trans_alloc(
 {
 	xfs_trans_t	*tp;
 
+	WARN_ON(mp->m_super->s_writers.frozen == SB_FREEZE_COMPLETE);
 	atomic_inc(&mp->m_active_trans);
 
 	tp = kmem_zone_zalloc(xfs_trans_zone, memflags);
-- 
1.7.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 11/19] xfs: Convert to new freezing code
  2012-03-05 16:01 ` [PATCH 11/19] xfs: Convert to new freezing code Jan Kara
@ 2012-03-08 23:20   ` Dave Chinner
  2012-03-09  8:23     ` Jan Kara
  2012-03-09 14:22     ` Jan Kara
  0 siblings, 2 replies; 11+ messages in thread
From: Dave Chinner @ 2012-03-08 23:20 UTC (permalink / raw)
  To: Jan Kara
  Cc: sandeen, Alex Elder, Kamal Mostafa, LKML, xfs, Christoph Hellwig,
	Ben Myers, Al Viro, dchinner, linux-fsdevel

On Mon, Mar 05, 2012 at 05:01:09PM +0100, Jan Kara wrote:
> Generic code now blocks all writers from standard write paths. So we block all
> writers coming from ioctl and replace blocking of transactions on frozen
> filesystem with a debugging check. As a bonus, we get a protection of ioctl
> against racing remount read-only. We also convert xfs_file_aio_write() to a
> non-racy freeze protection.
....
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 329b06a..6468a2a 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -577,7 +577,6 @@ xfs_trans_alloc(
>  	xfs_mount_t	*mp,
>  	uint		type)
>  {
> -	xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
>  	return _xfs_trans_alloc(mp, type, KM_SLEEP);
>  }

So what is there to stop internal XFS threads from starting
transactions when the filesystem is frozen? Previously this
SB_FREEZE_TRANS would guarantee even internal fucntions would get
stopped, but now there's nothing?

I do beleive that ext4 has the same problem (the issue reported with
the lazy inode init background thread), and I can see that any other
filesystem that can make modifications via internal triggers will
see the same problem - freeze doesn't block them any more...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 11/19] xfs: Convert to new freezing code
  2012-03-08 23:20   ` Dave Chinner
@ 2012-03-09  8:23     ` Jan Kara
  2012-03-09 14:22     ` Jan Kara
  1 sibling, 0 replies; 11+ messages in thread
From: Jan Kara @ 2012-03-09  8:23 UTC (permalink / raw)
  To: Dave Chinner
  Cc: sandeen, Alex Elder, Jan Kara, Kamal Mostafa, LKML, xfs,
	Christoph Hellwig, Ben Myers, Al Viro, dchinner, linux-fsdevel

On Fri 09-03-12 10:20:41, Dave Chinner wrote:
> On Mon, Mar 05, 2012 at 05:01:09PM +0100, Jan Kara wrote:
> > Generic code now blocks all writers from standard write paths. So we block all
> > writers coming from ioctl and replace blocking of transactions on frozen
> > filesystem with a debugging check. As a bonus, we get a protection of ioctl
> > against racing remount read-only. We also convert xfs_file_aio_write() to a
> > non-racy freeze protection.
> ....
> > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> > index 329b06a..6468a2a 100644
> > --- a/fs/xfs/xfs_trans.c
> > +++ b/fs/xfs/xfs_trans.c
> > @@ -577,7 +577,6 @@ xfs_trans_alloc(
> >  	xfs_mount_t	*mp,
> >  	uint		type)
> >  {
> > -	xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
> >  	return _xfs_trans_alloc(mp, type, KM_SLEEP);
> >  }
> 
> So what is there to stop internal XFS threads from starting
> transactions when the filesystem is frozen? Previously this
> SB_FREEZE_TRANS would guarantee even internal fucntions would get
> stopped, but now there's nothing?
> 
> I do beleive that ext4 has the same problem (the issue reported with
> the lazy inode init background thread), and I can see that any other
> filesystem that can make modifications via internal triggers will
> see the same problem - freeze doesn't block them any more...
  Yes, I'll have to audit filesystem internal threads. I forgot about these
in the first round. Most of them shouldn't have anything to do because the
filesystem is clean but there are exceptions as ext4 lazyinit has shown.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 11/19] xfs: Convert to new freezing code
  2012-03-08 23:20   ` Dave Chinner
  2012-03-09  8:23     ` Jan Kara
@ 2012-03-09 14:22     ` Jan Kara
  2012-03-11 22:45       ` Dave Chinner
  1 sibling, 1 reply; 11+ messages in thread
From: Jan Kara @ 2012-03-09 14:22 UTC (permalink / raw)
  To: Dave Chinner
  Cc: sandeen, Alex Elder, Jan Kara, Kamal Mostafa, LKML, xfs,
	Christoph Hellwig, Ben Myers, Al Viro, dchinner, linux-fsdevel

On Fri 09-03-12 10:20:41, Dave Chinner wrote:
> On Mon, Mar 05, 2012 at 05:01:09PM +0100, Jan Kara wrote:
> > Generic code now blocks all writers from standard write paths. So we block all
> > writers coming from ioctl and replace blocking of transactions on frozen
> > filesystem with a debugging check. As a bonus, we get a protection of ioctl
> > against racing remount read-only. We also convert xfs_file_aio_write() to a
> > non-racy freeze protection.
> ....
> > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> > index 329b06a..6468a2a 100644
> > --- a/fs/xfs/xfs_trans.c
> > +++ b/fs/xfs/xfs_trans.c
> > @@ -577,7 +577,6 @@ xfs_trans_alloc(
> >  	xfs_mount_t	*mp,
> >  	uint		type)
> >  {
> > -	xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
> >  	return _xfs_trans_alloc(mp, type, KM_SLEEP);
> >  }
> 
> So what is there to stop internal XFS threads from starting
> transactions when the filesystem is frozen? Previously this
> SB_FREEZE_TRANS would guarantee even internal fucntions would get
> stopped, but now there's nothing?
  I've checked XFS now and I didn't find any work that would be done on a
clean filesystem. I found:
_xfs_mru_cache_reap() - doesn't seem to do any IO
xfs_end_io(), xfs_buf_iodone_work() - shouldn't be a problem either since only
  reads can happen on a frozen filesystem
xfs_reclaim_worker() - everything is clean so no IO should happen
xfs_flush_worker() - everything is clean so no IO should happen
xfs_sync_worker() - again the same if I understand the code right

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 11/19] xfs: Convert to new freezing code
  2012-03-09 14:22     ` Jan Kara
@ 2012-03-11 22:45       ` Dave Chinner
  2012-03-12 17:55         ` Jan Kara
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2012-03-11 22:45 UTC (permalink / raw)
  To: Jan Kara
  Cc: sandeen, Alex Elder, Kamal Mostafa, LKML, xfs, Christoph Hellwig,
	Ben Myers, Al Viro, dchinner, linux-fsdevel

On Fri, Mar 09, 2012 at 03:22:53PM +0100, Jan Kara wrote:
> On Fri 09-03-12 10:20:41, Dave Chinner wrote:
> > On Mon, Mar 05, 2012 at 05:01:09PM +0100, Jan Kara wrote:
> > > Generic code now blocks all writers from standard write paths. So we block all
> > > writers coming from ioctl and replace blocking of transactions on frozen
> > > filesystem with a debugging check. As a bonus, we get a protection of ioctl
> > > against racing remount read-only. We also convert xfs_file_aio_write() to a
> > > non-racy freeze protection.
> > ....
> > > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> > > index 329b06a..6468a2a 100644
> > > --- a/fs/xfs/xfs_trans.c
> > > +++ b/fs/xfs/xfs_trans.c
> > > @@ -577,7 +577,6 @@ xfs_trans_alloc(
> > >  	xfs_mount_t	*mp,
> > >  	uint		type)
> > >  {
> > > -	xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
> > >  	return _xfs_trans_alloc(mp, type, KM_SLEEP);
> > >  }
> > 
> > So what is there to stop internal XFS threads from starting
> > transactions when the filesystem is frozen? Previously this
> > SB_FREEZE_TRANS would guarantee even internal fucntions would get
> > stopped, but now there's nothing?
>   I've checked XFS now and I didn't find any work that would be done on a
> clean filesystem. I found:
> _xfs_mru_cache_reap() - doesn't seem to do any IO

Causes iput() on inodes, which if dropping the last reference to the
inode can cause them to enter reclaim and hence have transactions
run on them to either truncate blocks beyonds EOF or to do the
second phase of an unlink(). So it will definitely break the freeze
model.

> xfs_end_io(), xfs_buf_iodone_work() - shouldn't be a problem either since only
>   reads can happen on a frozen filesystem
> xfs_reclaim_worker() - everything is clean so no IO should happen

It will do work as other things push inodes into reclaim. e.g.
filestreams inode expiry. And it will still run to reclaim clean
inodes...

> xfs_flush_worker() - everything is clean so no IO should happen
> xfs_sync_worker() - again the same if I understand the code right

xfs_sync_worker() will always trigger a log force, so if there is
anything that has dirtied the journal, it will trigger IO. We have
no protection against the journal being dirtied anymore, so no
guarantees can be given here.

Basically, your patchset creates a "shell" around the outside of the
filesystem that catches all the external modifications that can
occur through the VFS and ioctls. The "shell" is now the only layer
of defense because the patchset removes the layer of protection that
prevents internal modifications from occurring.

For example, the XFS patch added a bunch of protection to ioctls on
XFS that only modify file metadata and so never were protected
against data freezes - they were all implicitly protected by the inner
transaction subsystem freeze. All of the above cases were protected
by the inner layer of defense, which is now gone.

Not to mention the shrinkers. The inode cache shrinkers can
cause inodes to enter reclaim which can trigger EOF truncation just
like _xfs_mru_cache_reap(). The dquot cache shrinker can also issue
IO, and dquots will be dirtied by EOF truncation. Same with buffers,
and the buffer cache shrinker.

I'm sure other filesystems have just as complex or even more complex
internal paths that trigger dirtying and IO that the "freeze shell"
model does not prevent. I don't think auditing is good enough - the
shell model is, IMO, too easy to break and too hard to test for
breakage...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 11/19] xfs: Convert to new freezing code
  2012-03-11 22:45       ` Dave Chinner
@ 2012-03-12 17:55         ` Jan Kara
  2012-03-12 23:48           ` Dave Chinner
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Kara @ 2012-03-12 17:55 UTC (permalink / raw)
  To: Dave Chinner
  Cc: sandeen, Alex Elder, Jan Kara, Kamal Mostafa, LKML, xfs,
	Christoph Hellwig, Ben Myers, Al Viro, dchinner, linux-fsdevel

On Mon 12-03-12 09:45:37, Dave Chinner wrote:
> On Fri, Mar 09, 2012 at 03:22:53PM +0100, Jan Kara wrote:
> > On Fri 09-03-12 10:20:41, Dave Chinner wrote:
> > > On Mon, Mar 05, 2012 at 05:01:09PM +0100, Jan Kara wrote:
> > > > Generic code now blocks all writers from standard write paths. So we block all
> > > > writers coming from ioctl and replace blocking of transactions on frozen
> > > > filesystem with a debugging check. As a bonus, we get a protection of ioctl
> > > > against racing remount read-only. We also convert xfs_file_aio_write() to a
> > > > non-racy freeze protection.
> > > ....
> > > > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> > > > index 329b06a..6468a2a 100644
> > > > --- a/fs/xfs/xfs_trans.c
> > > > +++ b/fs/xfs/xfs_trans.c
> > > > @@ -577,7 +577,6 @@ xfs_trans_alloc(
> > > >  	xfs_mount_t	*mp,
> > > >  	uint		type)
> > > >  {
> > > > -	xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
> > > >  	return _xfs_trans_alloc(mp, type, KM_SLEEP);
> > > >  }
> > > 
> > > So what is there to stop internal XFS threads from starting
> > > transactions when the filesystem is frozen? Previously this
> > > SB_FREEZE_TRANS would guarantee even internal fucntions would get
> > > stopped, but now there's nothing?
> >   I've checked XFS now and I didn't find any work that would be done on a
> > clean filesystem. I found:
> > _xfs_mru_cache_reap() - doesn't seem to do any IO
> 
> Causes iput() on inodes, which if dropping the last reference to the
> inode can cause them to enter reclaim and hence have transactions
> run on them to either truncate blocks beyonds EOF or to do the
> second phase of an unlink(). So it will definitely break the freeze
> model.
  OK, I've missed this. I'll fix it.

> > xfs_end_io(), xfs_buf_iodone_work() - shouldn't be a problem either since only
> >   reads can happen on a frozen filesystem
> > xfs_reclaim_worker() - everything is clean so no IO should happen
> 
> It will do work as other things push inodes into reclaim. e.g.
> filestreams inode expiry. And it will still run to reclaim clean
> inodes...
  Sure. But reclaim will start a transaction only if inode is dirty. And
inode must not be dirty on a frozen filesystem.
 
> > xfs_flush_worker() - everything is clean so no IO should happen
> > xfs_sync_worker() - again the same if I understand the code right
> 
> xfs_sync_worker() will always trigger a log force, so if there is
> anything that has dirtied the journal, it will trigger IO. We have
> no protection against the journal being dirtied anymore, so no
> guarantees can be given here.
  Yes, it would be a bug if something dirtied a journal while the
filesystem is frozen. To catch issues like this, I've added WARN_ON in
xfs_trans_alloc() to actually warn when transaction would be started on a
frozen filesystem. My testing never triggered it with the latest patch set.

> Basically, your patchset creates a "shell" around the outside of the
> filesystem that catches all the external modifications that can
> occur through the VFS and ioctls. The "shell" is now the only layer
> of defense because the patchset removes the layer of protection that
> prevents internal modifications from occurring.
  Yes. I'd just note that the "shell" is there already to provide reliable
remounting read only. I just had to change some properties of the shell to
be usable for freezing as well. But the shell has to be maintained
regardless of how we decide to do freezing.

Also believe me that I've initially tried to fix the freezing without the
external shell. But it just isn't enough to prevent dirty inodes from being
created (e.g. from directory operations) while sync during freezing is
running.

Sure I could keep the freezing mechanism on transaction start. But it
seemed like a cleaner approach to me to protect all the places properly
with the generic mechanism than having two mechanisms and unclear
(especially locking) interactions between them.

But in the end, if you guys really feel strongly about it and decide that
XFS wants to keep it's mechanism of freezing on transaction start, then I
won't stop you. Although it would mean XFS would have to have three
counters to protect freezing while other filesystems will need only two.
Anyway I'm confident enough to remove that mechanism from ext3/4 and ocfs2
but I don't know XFS internals enough to really argue...

> For example, the XFS patch added a bunch of protection to ioctls on
> XFS that only modify file metadata and so never were protected
> against data freezes - they were all implicitly protected by the inner
> transaction subsystem freeze. All of the above cases were protected
> by the inner layer of defense, which is now gone.
  Yes, but all these ioctls are currently buggy with respect to remounting
read only (not the whole filesystem, just one mountpoint of many where
filesystem may be mounted). So I'd argue that these particular changes are
actually bug fixes.

> Not to mention the shrinkers. The inode cache shrinkers can
> cause inodes to enter reclaim which can trigger EOF truncation just
> like _xfs_mru_cache_reap(). The dquot cache shrinker can also issue
> IO, and dquots will be dirtied by EOF truncation. Same with buffers,
> and the buffer cache shrinker.
  OK, now I see that even iput() of a clean inode can cause truncation to
happen for XFS. Thanks for the pointer. I'll see what we could do about
this.

> I'm sure other filesystems have just as complex or even more complex
> internal paths that trigger dirtying and IO that the "freeze shell"
> model does not prevent. I don't think auditing is good enough - the
> shell model is, IMO, too easy to break and too hard to test for
> breakage...
  Well, actually I don't know filesystem which would have as complex
interactions as XFS has. But maybe there are some. In either case I don't
think maintaining the "shell" is that hard. You have to maintain the one
facing to userspace anyway due to remounting. For the internal filesystem
threads, you have to be aware when you can cause write and be protected
against freezing in that case. I don't think it's that complex, although I
agree that the current test coverage will be small (freezing isn't used
very often) so bugs can go unnoticed for a long time.

Maybe we could add a test in xfs_trans_alloc() checking that fs is
protected against freezing? That would increase the test coverage a lot.
Although the test should be probably enabled only for XFS_DEBUG because it
will be an expensive one.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 11/19] xfs: Convert to new freezing code
  2012-03-12 17:55         ` Jan Kara
@ 2012-03-12 23:48           ` Dave Chinner
  2012-03-13 21:30             ` Jan Kara
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2012-03-12 23:48 UTC (permalink / raw)
  To: Jan Kara
  Cc: sandeen, Alex Elder, Kamal Mostafa, LKML, xfs, Christoph Hellwig,
	Ben Myers, Al Viro, dchinner, linux-fsdevel

On Mon, Mar 12, 2012 at 06:55:51PM +0100, Jan Kara wrote:
> On Mon 12-03-12 09:45:37, Dave Chinner wrote:
> > On Fri, Mar 09, 2012 at 03:22:53PM +0100, Jan Kara wrote:
> > It will do work as other things push inodes into reclaim. e.g.
> > filestreams inode expiry. And it will still run to reclaim clean
> > inodes...
>   Sure. But reclaim will start a transaction only if inode is dirty. And
> inode must not be dirty on a frozen filesystem.

That's what I've been trying to tell you - even clean inodes can be
dirtied in reclaim on XFS, so ensuring the FS is clean is not a good
enough guarantee.  And not just through the shrinkers - just closing
a file descriptor can trigger truncation as well via .release. See
xfs_release:

	/* If this is a read-only mount, don't do this (would generate I/O) */
	if (mp->m_flags & XFS_MOUNT_RDONLY)
		goto out;

> > > xfs_flush_worker() - everything is clean so no IO should happen
> > > xfs_sync_worker() - again the same if I understand the code right
> > 
> > xfs_sync_worker() will always trigger a log force, so if there is
> > anything that has dirtied the journal, it will trigger IO. We have
> > no protection against the journal being dirtied anymore, so no
> > guarantees can be given here.
>   Yes, it would be a bug if something dirtied a journal while the
> filesystem is frozen. To catch issues like this, I've added WARN_ON in
> xfs_trans_alloc() to actually warn when transaction would be started on a
> frozen filesystem. My testing never triggered it with the latest patch set.

Just hold a file descriptor open for a large write over the freeze
(don't get it caught in the freeze), and then close it after the
filesystem is frozen. Perhaps even drop the slab caches.

> > Basically, your patchset creates a "shell" around the outside of the
> > filesystem that catches all the external modifications that can
> > occur through the VFS and ioctls. The "shell" is now the only layer
> > of defense because the patchset removes the layer of protection that
> > prevents internal modifications from occurring.
>   Yes. I'd just note that the "shell" is there already to provide reliable
> remounting read only.

But it doesn't provide reliable read-only behaviour - internal
filesystem triggers still need to check and disarm to make read-only
mounts reliable. Indeed, XFS has many internal checks for read only
mounts, and the places we are talking about here are similar to the
places where you've removed freeze protection from by gutting the
transaction level freezing.

> I just had to change some properties of the shell to
> be usable for freezing as well. But the shell has to be maintained
> regardless of how we decide to do freezing.
> 
> Also believe me that I've initially tried to fix the freezing without the
> external shell. But it just isn't enough to prevent dirty inodes from being
> created (e.g. from directory operations) while sync during freezing is
> running.

Sure - the old mechanism was "freeze data" which only protected data
modification paths, followed by "freeze trans" which protected
metadata modification paths (like directory operations). You started
by removing the metadata modification path protection - it is any
wonder that those operations continued to dirty inodes until you
expanded the "freeze data" shell to mean "freeze data and metadata"?

> Sure I could keep the freezing mechanism on transaction start. But it
> seemed like a cleaner approach to me to protect all the places properly
> with the generic mechanism than having two mechanisms and unclear
> (especially locking) interactions between them.

Once again, that's a view that does not take into account that
modifcations don't all come through the VFS.

It's like the hard shelli/soft center security model - it protects
well from external attackers, but once they get inside, there's no
protection. Indeed, there is zero protection from inside jobs, and
thats where multiple layers of defense are needed.

Your freeze changes provide us with a hard outer shell, but it's got
a really, really soft center. We can't use the outer shell defenses
deep inside the shell because of the locking orders that have been
defined (freeze protection must be outermost), so we need another
layer....

> But in the end, if you guys really feel strongly about it and decide that
> XFS wants to keep it's mechanism of freezing on transaction start, then I
> won't stop you. Although it would mean XFS would have to have three
> counters to protect freezing while other filesystems will need only two.

There is two counters only to avoid lockdep problems because the
different entry points to the outer shell have different locking
orders. I fail to see why adding a third counter to provide an
innermost freeze lock that retains the current level of protection
is a big deal because all that it really needs to work is a
different lock order.

Also, ext4 could retain it's current outer/inner protection, so I
don't see that this behaviour is XFS specific.....

> Anyway I'm confident enough to remove that mechanism from ext3/4 and ocfs2
> but I don't know XFS internals enough to really argue...

... especially given the reported ext4 regression. I have much less
confidence than you that the outer shell is sufficient for ext4 or
any other complex filesystem.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 11/19] xfs: Convert to new freezing code
  2012-03-12 23:48           ` Dave Chinner
@ 2012-03-13 21:30             ` Jan Kara
  2012-03-14  3:00               ` Dave Chinner
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Kara @ 2012-03-13 21:30 UTC (permalink / raw)
  To: Dave Chinner
  Cc: sandeen, Alex Elder, Jan Kara, Kamal Mostafa, LKML, xfs,
	Christoph Hellwig, Ben Myers, Al Viro, dchinner, linux-fsdevel

On Tue 13-03-12 10:48:17, Dave Chinner wrote:
> On Mon, Mar 12, 2012 at 06:55:51PM +0100, Jan Kara wrote:
> > On Mon 12-03-12 09:45:37, Dave Chinner wrote:
> > > On Fri, Mar 09, 2012 at 03:22:53PM +0100, Jan Kara wrote:
> > > It will do work as other things push inodes into reclaim. e.g.
> > > filestreams inode expiry. And it will still run to reclaim clean
> > > inodes...
> >   Sure. But reclaim will start a transaction only if inode is dirty. And
> > inode must not be dirty on a frozen filesystem.
> 
> That's what I've been trying to tell you - even clean inodes can be
> dirtied in reclaim on XFS, so ensuring the FS is clean is not a good
> enough guarantee.  And not just through the shrinkers - just closing
> a file descriptor can trigger truncation as well via .release. See
> xfs_release:
> 
> 	/* If this is a read-only mount, don't do this (would generate I/O) */
> 	if (mp->m_flags & XFS_MOUNT_RDONLY)
> 		goto out;
  Yes, I've noticed this while writing the email but forgot to update the
beginning. Sorry for confusion.

> > > Basically, your patchset creates a "shell" around the outside of the
> > > filesystem that catches all the external modifications that can
> > > occur through the VFS and ioctls. The "shell" is now the only layer
> > > of defense because the patchset removes the layer of protection that
> > > prevents internal modifications from occurring.
> >   Yes. I'd just note that the "shell" is there already to provide reliable
> > remounting read only.
> 
> But it doesn't provide reliable read-only behaviour - internal
> filesystem triggers still need to check and disarm to make read-only
> mounts reliable. Indeed, XFS has many internal checks for read only
> mounts, and the places we are talking about here are similar to the
> places where you've removed freeze protection from by gutting the
> transaction level freezing.
  And indeed your read-only checks are racy the same way freezing was...
But that's a different story. You have actually a good point that freezing
protection and read-only mount protection should be at the same places.

> > I just had to change some properties of the shell to
> > be usable for freezing as well. But the shell has to be maintained
> > regardless of how we decide to do freezing.
> > 
> > Also believe me that I've initially tried to fix the freezing without the
> > external shell. But it just isn't enough to prevent dirty inodes from being
> > created (e.g. from directory operations) while sync during freezing is
> > running.
> 
> Sure - the old mechanism was "freeze data" which only protected data
> modification paths, followed by "freeze trans" which protected
> metadata modification paths (like directory operations). You started
> by removing the metadata modification path protection - it is any
> wonder that those operations continued to dirty inodes until you
> expanded the "freeze data" shell to mean "freeze data and metadata"?
  That's just not true. Remember older versions of this patch set. Old
model is doing:
   freeze data
   sync
   freeze metadata (via freezing transations)
and it is broken mainly because while sync is running, new metadata are
dirtied and thus you end up with dirty metadata on a frozen filesystem.
Forcing the journal commit in ->freeze_fs() pushes changes to disk but
inodes still remain dirty and actually, it can be too late for flusher
thread anyway because it can be already hanging trying to start a
transaction.

Then there are also other less severe problems like that mark_inode_dirty()
could race with freezing in such a way that it was added to dirty list
after filesystem was fully frozen. 

> > Sure I could keep the freezing mechanism on transaction start. But it
> > seemed like a cleaner approach to me to protect all the places properly
> > with the generic mechanism than having two mechanisms and unclear
> > (especially locking) interactions between them.
> 
> Once again, that's a view that does not take into account that
> modifcations don't all come through the VFS.
  "all the places" in my sentence meant "including the internal sources of
modification"...

> It's like the hard shelli/soft center security model - it protects
> well from external attackers, but once they get inside, there's no
> protection. Indeed, there is zero protection from inside jobs, and
> thats where multiple layers of defense are needed.
> 
> Your freeze changes provide us with a hard outer shell, but it's got
> a really, really soft center. We can't use the outer shell defenses
> deep inside the shell because of the locking orders that have been
> defined (freeze protection must be outermost), so we need another
> layer....
  I think there are two different issues. One is your complaint that
extending the shell to cover also internal sources of fs modification will
be too fragile if not done implicitely on transaction start.

Another is a question whether we could reasonably fend off internal
modification using the current two counters. The MRU thread or ->release
can be easily handled with them (->release can be handled by the counter
ranking below mmap_sem). For shrinkers, I'm not that certain. Certainly the
outer counter won't work. The inner one with lock ranking just below
mmap_sem might work - depends on whether there are any XFS locks under
mmap_sem from within which we'd do GFP_KERNEL allocations. But here I agree
issues might be nasty enough to warrant another counter.

This is connected with another somewhat related issue: Do we want shrinkers
to block on frozen filesystem? That could result in unrelated processes
blocking on frozen filesystem. I'd find it nicer if we just skipped
the writes if the filesystem is frozen as it seems to me
xfs_free_eofblocks() is somewhat optional. Is it true?

> > But in the end, if you guys really feel strongly about it and decide that
> > XFS wants to keep it's mechanism of freezing on transaction start, then I
> > won't stop you. Although it would mean XFS would have to have three
> > counters to protect freezing while other filesystems will need only two.
> 
> There is two counters only to avoid lockdep problems because the
> different entry points to the outer shell have different locking
> orders. I fail to see why adding a third counter to provide an
> innermost freeze lock that retains the current level of protection
> is a big deal because all that it really needs to work is a
> different lock order.
  The two counters are needed to avoid deadlocks, not just because of
lockdep... The third counter can be added but per-cpu counters are not
exactly cheap (in terms of memory) so I don't want to add it without a good
reason. Lock ordering constraints in shrinkers might be this good reason.

Also fitting the XFS transaction system to freezing was a bit ugly (because
you need to log a dummy transaction on a frozen filesystem and you have
things like xfs_trans_dup() which requires us to bump the counter even
though the filesystem is already in freezing process) so I'd be happier if
we could avoid it...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 11/19] xfs: Convert to new freezing code
  2012-03-13 21:30             ` Jan Kara
@ 2012-03-14  3:00               ` Dave Chinner
  0 siblings, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2012-03-14  3:00 UTC (permalink / raw)
  To: Jan Kara
  Cc: sandeen, Alex Elder, Kamal Mostafa, LKML, xfs, Christoph Hellwig,
	Ben Myers, Al Viro, linux-fsdevel

On Tue, Mar 13, 2012 at 10:30:20PM +0100, Jan Kara wrote:
> On Tue 13-03-12 10:48:17, Dave Chinner wrote:
> > On Mon, Mar 12, 2012 at 06:55:51PM +0100, Jan Kara wrote:
> > > On Mon 12-03-12 09:45:37, Dave Chinner wrote:
> > > I just had to change some properties of the shell to
> > > be usable for freezing as well. But the shell has to be maintained
> > > regardless of how we decide to do freezing.
> > > 
> > > Also believe me that I've initially tried to fix the freezing without the
> > > external shell. But it just isn't enough to prevent dirty inodes from being
> > > created (e.g. from directory operations) while sync during freezing is
> > > running.
> > 
> > Sure - the old mechanism was "freeze data" which only protected data
> > modification paths, followed by "freeze trans" which protected
> > metadata modification paths (like directory operations). You started
> > by removing the metadata modification path protection - it is any
> > wonder that those operations continued to dirty inodes until you
> > expanded the "freeze data" shell to mean "freeze data and metadata"?
>   That's just not true. Remember older versions of this patch set. Old
> model is doing:
>    freeze data
>    sync
>    freeze metadata (via freezing transations)
> and it is broken mainly because while sync is running, new metadata are
> dirtied and thus you end up with dirty metadata on a frozen filesystem.

But that exactly the issue that the filesystem .freeze_fs
method is supposed to handle.  It's supposed to finish the
freeze process by finishing all outstanding metadata operations and
cleaning all remaining dirty metadata. It still has to guarantee
this regardless of the model used prior to this.

> Forcing the journal commit in ->freeze_fs() pushes changes to disk but
> inodes still remain dirty and actually, it can be too late for flusher
> thread anyway because it can be already hanging trying to start a
> transaction.

If the changes are already in the journal, then there is no new
transaction work needed to clean the VFS state.  Indeed, .freezefs
needs to write back all dirty metadata objects that it is tracking
internally in the journal so that a snapshot of the frozen image is
consistent and can be mounted read-only without running recovery...

> > It's like the hard shelli/soft center security model - it protects
> > well from external attackers, but once they get inside, there's no
> > protection. Indeed, there is zero protection from inside jobs, and
> > thats where multiple layers of defense are needed.
> > 
> > Your freeze changes provide us with a hard outer shell, but it's got
> > a really, really soft center. We can't use the outer shell defenses
> > deep inside the shell because of the locking orders that have been
> > defined (freeze protection must be outermost), so we need another
> > layer....
>   I think there are two different issues. One is your complaint that
> extending the shell to cover also internal sources of fs modification will
> be too fragile if not done implicitely on transaction start.

Most definitely. It means that we have to have freeze behaviour in
mind whether doing any async/background work. That's usually the
furthest think from people's minds when implementing stuff like
background inode initialisation...

> Another is a question whether we could reasonably fend off internal
> modification using the current two counters. The MRU thread or ->release
> can be easily handled with them (->release can be handled by the counter
> ranking below mmap_sem). For shrinkers, I'm not that certain. Certainly the
> outer counter won't work. The inner one with lock ranking just below
> mmap_sem might work - depends on whether there are any XFS locks under
> mmap_sem from within which we'd do GFP_KERNEL allocations. But here I agree
> issues might be nasty enough to warrant another counter.

Might be nasty? There's no question about it in my mind that it
*would* be nasty....

> This is connected with another somewhat related issue: Do we want shrinkers
> to block on frozen filesystem? That could result in unrelated processes
> blocking on frozen filesystem. I'd find it nicer if we just skipped
> the writes if the filesystem is frozen as it seems to me
> xfs_free_eofblocks() is somewhat optional. Is it true?

Not really. It needs to be done on rw filesytems[*].

Besides, this isn't really an XFS specific problem - the shrinkers
need to operate on frozen filesystems. e.g. run a directory
traversal of a frozen filesystem, and if you stop the shrinkers from
running then the dcache/icache growth will run the system out of
memory.

Sure, blocking a shrinker can be considered antisocial, but if you
can't free memory because it will dirty inodes on a frozen
filesystem then it is better to block the shrinker until the
filesytem is unfrozen than it is to allow the shrinker to fail and
eventually OOM the machine. Freezing filesysetms is supposed to be a
short term, workload-invisible operation. Kicking the OOM killer
because you can't free inodes is far more antisocial than having
those applications pause while the fs is frozen.

[*] Right now we are discussing adding a new background/async worker
that that does speculative preallocation removal to make it occur
faster than it currently does. So skipping it is not really an
option and this is exactly the sort of "new feature needs to
consider how to deal with freeze" problem I'm talking about.

> > > But in the end, if you guys really feel strongly about it and decide that
> > > XFS wants to keep it's mechanism of freezing on transaction start, then I
> > > won't stop you. Although it would mean XFS would have to have three
> > > counters to protect freezing while other filesystems will need only two.
> > 
> > There is two counters only to avoid lockdep problems because the
> > different entry points to the outer shell have different locking
> > orders. I fail to see why adding a third counter to provide an
> > innermost freeze lock that retains the current level of protection
> > is a big deal because all that it really needs to work is a
> > different lock order.
>   The two counters are needed to avoid deadlocks, not just because of
> lockdep... The third counter can be added but per-cpu counters are not
> exactly cheap (in terms of memory) so I don't want to add it without a good
> reason. Lock ordering constraints in shrinkers might be this good reason.

An additional percpu counter is far, far cheaper than the cost of
always needing to consider how not to break filesystem freezing for
the forseeable future.

> Also fitting the XFS transaction system to freezing was a bit ugly (because
> you need to log a dummy transaction on a frozen filesystem

Sure - that's to ensure the log is dirty so that open-but-unlinked
inodes are handled correctly by recovery if the system crashes while the
filesystem is frozen. But the last patch in your series would prevent
freezing filesystems while there are open-but-unlinked files
present, so all that grot could be removed.

Indeed, when that patch goes to mainline, the XFS code for
handling remount,ro and freeze can be made identical....

> and you have
> things like xfs_trans_dup() which requires us to bump the counter even
> though the filesystem is already in freezing process) so I'd be happier if
> we could avoid it...

The only additional infrastructure that you need to provide is a
"start_nested" operation for xfs_trans_dup() that increments the
percpu counter. I don't think that's a difficult problem to solve.

Cheers,

Dave.
-- 
Dave Chinner
dchinner@redhat.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 11/19] xfs: Convert to new freezing code
  2012-03-28 23:43 [PATCH 00/19 v4] Fix filesystem freezing deadlocks Jan Kara
@ 2012-03-28 23:43 ` Jan Kara
  0 siblings, 0 replies; 11+ messages in thread
From: Jan Kara @ 2012-03-28 23:43 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: sandeen, Alex Elder, Jan Kara, Kamal Mostafa, xfs, Ben Myers,
	Al Viro, dchinner

Generic code now blocks all writers from standard write paths. So we add
blocking of all writers coming from ioctl (we get a protection of ioctl against
racing remount read-only as a bonus) and convert xfs_file_aio_write() to a
non-racy freeze protection. We also keep freeze protection on transaction
start to block internal filesystem writes such as removal of preallocated
blocks.

CC: Ben Myers <bpm@sgi.com>
CC: Alex Elder <elder@kernel.org>
CC: xfs@oss.sgi.com
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/xfs/xfs_file.c    |   10 ++++++--
 fs/xfs/xfs_ioctl.c   |   55 +++++++++++++++++++++++++++++++++++++++++++++++--
 fs/xfs/xfs_ioctl32.c |   12 ++++++++++
 fs/xfs/xfs_iomap.c   |    4 +-
 fs/xfs/xfs_mount.c   |    2 +-
 fs/xfs/xfs_mount.h   |    3 --
 fs/xfs/xfs_sync.c    |    2 +-
 fs/xfs/xfs_trans.c   |   17 ++++++++++++--
 fs/xfs/xfs_trans.h   |    2 +
 9 files changed, 91 insertions(+), 16 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 7e5bc87..57dd20e 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -874,10 +874,12 @@ xfs_file_aio_write(
 	if (ocount == 0)
 		return 0;
 
-	xfs_wait_for_freeze(ip->i_mount, SB_FREEZE_WRITE);
+	sb_start_write(inode->i_sb);
 
-	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
-		return -EIO;
+	if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
+		ret = -EIO;
+		goto out;
+	}
 
 	if (unlikely(file->f_flags & O_DIRECT))
 		ret = xfs_file_dio_aio_write(iocb, iovp, nr_segs, pos, ocount);
@@ -896,6 +898,8 @@ xfs_file_aio_write(
 			ret = err;
 	}
 
+out:
+	sb_end_write(inode->i_sb);
 	return ret;
 }
 
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 76f3ca5..7890105 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -367,9 +367,15 @@ xfs_fssetdm_by_handle(
 	if (copy_from_user(&dmhreq, arg, sizeof(xfs_fsop_setdm_handlereq_t)))
 		return -XFS_ERROR(EFAULT);
 
+	error = mnt_want_write_file(parfilp);
+	if (error)
+		return error;
+
 	dentry = xfs_handlereq_to_dentry(parfilp, &dmhreq.hreq);
-	if (IS_ERR(dentry))
+	if (IS_ERR(dentry)) {
+		mnt_drop_write_file(parfilp);
 		return PTR_ERR(dentry);
+	}
 
 	if (IS_IMMUTABLE(dentry->d_inode) || IS_APPEND(dentry->d_inode)) {
 		error = -XFS_ERROR(EPERM);
@@ -385,6 +391,7 @@ xfs_fssetdm_by_handle(
 				 fsd.fsd_dmstate);
 
  out:
+	mnt_drop_write_file(parfilp);
 	dput(dentry);
 	return error;
 }
@@ -631,7 +638,11 @@ xfs_ioc_space(
 	if (ioflags & IO_INVIS)
 		attr_flags |= XFS_ATTR_DMI;
 
+	error = mnt_want_write_file(filp);
+	if (error)
+		return error;
 	error = xfs_change_file_space(ip, cmd, bf, filp->f_pos, attr_flags);
+	mnt_drop_write_file(filp);
 	return -error;
 }
 
@@ -1160,6 +1171,7 @@ xfs_ioc_fssetxattr(
 {
 	struct fsxattr		fa;
 	unsigned int		mask;
+	int error;
 
 	if (copy_from_user(&fa, arg, sizeof(fa)))
 		return -EFAULT;
@@ -1168,7 +1180,12 @@ xfs_ioc_fssetxattr(
 	if (filp->f_flags & (O_NDELAY|O_NONBLOCK))
 		mask |= FSX_NONBLOCK;
 
-	return -xfs_ioctl_setattr(ip, &fa, mask);
+	error = mnt_want_write_file(filp);
+	if (error)
+		return error;
+	error = xfs_ioctl_setattr(ip, &fa, mask);
+	mnt_drop_write_file(filp);
+	return -error;
 }
 
 STATIC int
@@ -1193,6 +1210,7 @@ xfs_ioc_setxflags(
 	struct fsxattr		fa;
 	unsigned int		flags;
 	unsigned int		mask;
+	int error;
 
 	if (copy_from_user(&flags, arg, sizeof(flags)))
 		return -EFAULT;
@@ -1207,7 +1225,12 @@ xfs_ioc_setxflags(
 		mask |= FSX_NONBLOCK;
 	fa.fsx_xflags = xfs_merge_ioc_xflags(flags, xfs_ip2xflags(ip));
 
-	return -xfs_ioctl_setattr(ip, &fa, mask);
+	error = mnt_want_write_file(filp);
+	if (error)
+		return error;
+	error = xfs_ioctl_setattr(ip, &fa, mask);
+	mnt_drop_write_file(filp);
+	return -error;
 }
 
 STATIC int
@@ -1382,8 +1405,13 @@ xfs_file_ioctl(
 		if (copy_from_user(&dmi, arg, sizeof(dmi)))
 			return -XFS_ERROR(EFAULT);
 
+		error = mnt_want_write_file(filp);
+		if (error)
+			return error;
+
 		error = xfs_set_dmattrs(ip, dmi.fsd_dmevmask,
 				dmi.fsd_dmstate);
+		mnt_drop_write_file(filp);
 		return -error;
 	}
 
@@ -1431,7 +1459,11 @@ xfs_file_ioctl(
 
 		if (copy_from_user(&sxp, arg, sizeof(xfs_swapext_t)))
 			return -XFS_ERROR(EFAULT);
+		error = mnt_want_write_file(filp);
+		if (error)
+			return error;
 		error = xfs_swapext(&sxp);
+		mnt_drop_write_file(filp);
 		return -error;
 	}
 
@@ -1460,9 +1492,14 @@ xfs_file_ioctl(
 		if (copy_from_user(&inout, arg, sizeof(inout)))
 			return -XFS_ERROR(EFAULT);
 
+		error = mnt_want_write_file(filp);
+		if (error)
+			return error;
+
 		/* input parameter is passed in resblks field of structure */
 		in = inout.resblks;
 		error = xfs_reserve_blocks(mp, &in, &inout);
+		mnt_drop_write_file(filp);
 		if (error)
 			return -error;
 
@@ -1493,7 +1530,11 @@ xfs_file_ioctl(
 		if (copy_from_user(&in, arg, sizeof(in)))
 			return -XFS_ERROR(EFAULT);
 
+		error = mnt_want_write_file(filp);
+		if (error)
+			return error;
 		error = xfs_growfs_data(mp, &in);
+		mnt_drop_write_file(filp);
 		return -error;
 	}
 
@@ -1503,7 +1544,11 @@ xfs_file_ioctl(
 		if (copy_from_user(&in, arg, sizeof(in)))
 			return -XFS_ERROR(EFAULT);
 
+		error = mnt_want_write_file(filp);
+		if (error)
+			return error;
 		error = xfs_growfs_log(mp, &in);
+		mnt_drop_write_file(filp);
 		return -error;
 	}
 
@@ -1513,7 +1558,11 @@ xfs_file_ioctl(
 		if (copy_from_user(&in, arg, sizeof(in)))
 			return -XFS_ERROR(EFAULT);
 
+		error = mnt_want_write_file(filp);
+		if (error)
+			return error;
 		error = xfs_growfs_rt(mp, &in);
+		mnt_drop_write_file(filp);
 		return -error;
 	}
 
diff --git a/fs/xfs/xfs_ioctl32.c b/fs/xfs/xfs_ioctl32.c
index f9ccb7b..2012369 100644
--- a/fs/xfs/xfs_ioctl32.c
+++ b/fs/xfs/xfs_ioctl32.c
@@ -602,7 +602,11 @@ xfs_file_compat_ioctl(
 
 		if (xfs_compat_growfs_data_copyin(&in, arg))
 			return -XFS_ERROR(EFAULT);
+		error = mnt_want_write_file(filp);
+		if (error)
+			return error;
 		error = xfs_growfs_data(mp, &in);
+		mnt_drop_write_file(filp);
 		return -error;
 	}
 	case XFS_IOC_FSGROWFSRT_32: {
@@ -610,7 +614,11 @@ xfs_file_compat_ioctl(
 
 		if (xfs_compat_growfs_rt_copyin(&in, arg))
 			return -XFS_ERROR(EFAULT);
+		error = mnt_want_write_file(filp);
+		if (error)
+			return error;
 		error = xfs_growfs_rt(mp, &in);
+		mnt_drop_write_file(filp);
 		return -error;
 	}
 #endif
@@ -629,7 +637,11 @@ xfs_file_compat_ioctl(
 				   offsetof(struct xfs_swapext, sx_stat)) ||
 		    xfs_ioctl32_bstat_copyin(&sxp.sx_stat, &sxu->sx_stat))
 			return -XFS_ERROR(EFAULT);
+		error = mnt_want_write_file(filp);
+		if (error)
+			return error;
 		error = xfs_swapext(&sxp);
+		mnt_drop_write_file(filp);
 		return -error;
 	}
 	case XFS_IOC_FSBULKSTAT_32:
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 246c7d5..2c35710 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -679,9 +679,9 @@ xfs_iomap_write_unwritten(
 		 * the same inode that we complete here and might deadlock
 		 * on the iolock.
 		 */
-		xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
+		sb_start_intwrite(mp->m_super);
 		tp = _xfs_trans_alloc(mp, XFS_TRANS_STRAT_WRITE, KM_NOFS);
-		tp->t_flags |= XFS_TRANS_RESERVE;
+		tp->t_flags |= XFS_TRANS_RESERVE | XFS_TRANS_FREEZE_PROT;
 		error = xfs_trans_reserve(tp, resblks,
 				XFS_WRITE_LOG_RES(mp), 0,
 				XFS_TRANS_PERM_LOG_RES,
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index d06afbc..bead529 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1541,7 +1541,7 @@ xfs_unmountfs(
 int
 xfs_fs_writable(xfs_mount_t *mp)
 {
-	return !(xfs_test_for_freeze(mp) || XFS_FORCED_SHUTDOWN(mp) ||
+	return !(mp->m_super->s_writers.frozen || XFS_FORCED_SHUTDOWN(mp) ||
 		(mp->m_flags & XFS_MOUNT_RDONLY));
 }
 
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 19f69e2..f0afeb9 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -310,9 +310,6 @@ void xfs_do_force_shutdown(struct xfs_mount *mp, int flags, char *fname,
 #define SHUTDOWN_REMOTE_REQ	0x0010	/* shutdown came from remote cell */
 #define SHUTDOWN_DEVICE_REQ	0x0020	/* failed all paths to the device */
 
-#define xfs_test_for_freeze(mp)		((mp)->m_super->s_frozen)
-#define xfs_wait_for_freeze(mp,l)	vfs_check_frozen((mp)->m_super, (l))
-
 /*
  * Flags for xfs_mountfs
  */
diff --git a/fs/xfs/xfs_sync.c b/fs/xfs/xfs_sync.c
index 40b75ee..c44d687 100644
--- a/fs/xfs/xfs_sync.c
+++ b/fs/xfs/xfs_sync.c
@@ -498,7 +498,7 @@ xfs_sync_worker(
 
 	if (!(mp->m_flags & XFS_MOUNT_RDONLY)) {
 		/* dgc: errors ignored here */
-		if (mp->m_super->s_frozen == SB_UNFROZEN &&
+		if (mp->m_super->s_writers.frozen == SB_UNFROZEN &&
 		    xfs_log_need_covered(mp))
 			error = xfs_fs_log_dummy(mp);
 		else
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 329b06a..12339b7 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -577,8 +577,12 @@ xfs_trans_alloc(
 	xfs_mount_t	*mp,
 	uint		type)
 {
-	xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
-	return _xfs_trans_alloc(mp, type, KM_SLEEP);
+	xfs_trans_t     *tp;
+
+	sb_start_intwrite(mp->m_super);
+	tp = _xfs_trans_alloc(mp, type, KM_SLEEP);
+	tp->t_flags |= XFS_TRANS_FREEZE_PROT;
+	return tp;
 }
 
 xfs_trans_t *
@@ -589,6 +593,7 @@ _xfs_trans_alloc(
 {
 	xfs_trans_t	*tp;
 
+	WARN_ON(mp->m_super->s_writers.frozen == SB_FREEZE_COMPLETE);
 	atomic_inc(&mp->m_active_trans);
 
 	tp = kmem_zone_zalloc(xfs_trans_zone, memflags);
@@ -612,6 +617,8 @@ xfs_trans_free(
 	xfs_alloc_busy_clear(tp->t_mountp, &tp->t_busy, false);
 
 	atomic_dec(&tp->t_mountp->m_active_trans);
+	if (tp->t_flags & XFS_TRANS_FREEZE_PROT)
+		sb_end_intwrite(tp->t_mountp->m_super);
 	xfs_trans_free_dqinfo(tp);
 	kmem_zone_free(xfs_trans_zone, tp);
 }
@@ -644,7 +651,11 @@ xfs_trans_dup(
 	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
 	ASSERT(tp->t_ticket != NULL);
 
-	ntp->t_flags = XFS_TRANS_PERM_LOG_RES | (tp->t_flags & XFS_TRANS_RESERVE);
+	ntp->t_flags = XFS_TRANS_PERM_LOG_RES |
+		       (tp->t_flags & XFS_TRANS_RESERVE) |
+		       (tp->t_flags & XFS_TRANS_FREEZE_PROT);
+	/* We gave our writer reference to the new transaction */
+	tp->t_flags &= ~XFS_TRANS_FREEZE_PROT;
 	ntp->t_ticket = xfs_log_ticket_get(tp->t_ticket);
 	ntp->t_blk_res = tp->t_blk_res - tp->t_blk_res_used;
 	tp->t_blk_res = tp->t_blk_res_used;
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index f611870..e42d94e 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -179,6 +179,8 @@ struct xfs_log_item_desc {
 #define	XFS_TRANS_SYNC		0x08	/* make commit synchronous */
 #define XFS_TRANS_DQ_DIRTY	0x10	/* at least one dquot in trx dirty */
 #define XFS_TRANS_RESERVE	0x20    /* OK to use reserved data blocks */
+#define XFS_TRANS_FREEZE_PROT	0x40	/* Transaction has elevated writer
+					   count in superblock */
 
 /*
  * Values for call flags parameter.
-- 
1.7.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2012-03-28 23:44 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-03-05 16:00 [PATCH 00/19] Fix filesystem freezing deadlocks Jan Kara
2012-03-05 16:01 ` [PATCH 11/19] xfs: Convert to new freezing code Jan Kara
2012-03-08 23:20   ` Dave Chinner
2012-03-09  8:23     ` Jan Kara
2012-03-09 14:22     ` Jan Kara
2012-03-11 22:45       ` Dave Chinner
2012-03-12 17:55         ` Jan Kara
2012-03-12 23:48           ` Dave Chinner
2012-03-13 21:30             ` Jan Kara
2012-03-14  3:00               ` Dave Chinner
  -- strict thread matches above, loose matches on Subject: below --
2012-03-28 23:43 [PATCH 00/19 v4] Fix filesystem freezing deadlocks Jan Kara
2012-03-28 23:43 ` [PATCH 11/19] xfs: Convert to new freezing code Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox