cluster-devel.redhat.com archive mirror
 help / color / mirror / Atom feed
* [Cluster-devel] [PATCH 00/19 v5] Fix filesystem freezing deadlocks
@ 2012-04-16 16:13 Jan Kara
  2012-04-16 16:13 ` [Cluster-devel] [PATCH 05/27] gfs2: Push file_update_time() into gfs2_page_mkwrite() Jan Kara
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: Jan Kara @ 2012-04-16 16:13 UTC (permalink / raw)
  To: cluster-devel.redhat.com

  Hello,

  here is the fifth iteration of my patches to improve filesystem freezing.
No serious changes since last time. Mostly I rebased patches and merged this
series with series moving file_update_time() to ->page_mkwrite() to simplify
testing and merging.

Filesystem freezing is currently racy and thus we can end up with dirty data on
frozen filesystem (see changelog patch 13 for detailed race description). This
patch series aims at fixing this.

To be able to block all places where inodes get dirtied, I've moved filesystem
file_update_time() call to ->page_mkwrite callback (patches 01-07) and put
freeze handling in mnt_want_write() / mnt_drop_write(). That however required
some code shuffling and changes to kern_path_create() (see patches 09-12). I
think the result is OK but opinions may differ ;). The advantage of this change
also is that all filesystems get freeze protection almost for free - even ext2
can handle freezing well now.

Another potential contention point might be patch 19. In that patch we make
freeze_super() refuse to freeze the filesystem when there are open but unlinked
files which may be impractical in some cases. The main reason for this is the
problem with handling of file deletion from fput() called with mmap_sem held
(e.g. from munmap(2)), and then there's the fact that we cannot really force
such filesystem into a consistent state... But if people think that freezing
with open but unlinked files should happen, then I have some possible
solutions in mind (maybe as a separate patchset since this is large enough).

I'm not able to hit any deadlocks, lockdep warnings, or dirty data on frozen
filesystem despite beating it with fsstress and bash-shared-mapping while
freezing and unfreezing for several hours (using ext4 and xfs) so I'm
reasonably confident this could finally be the right solution.

Changes since v4:
  * added a couple of Acked-by's
  * added some comments & doc update
  * added patches from series "Push file_update_time() into .page_mkwrite"
    since it doesn't make much sense to keep them separate anymore
  * rebased on top of 3.4-rc2

Changes since v3:
  * added third level of freezing for fs internal purposes - hooked some
    filesystems to use it (XFS, nilfs2)
  * removed racy i_size check from filemap_mkwrite()

Changes since v2:
  * completely rewritten
  * freezing is now blocked at VFS entry points
  * two stage freezing to handle both mmapped writes and other IO

The biggest changes since v1:
  * have two counters to provide safe state transitions for SB_FREEZE_WRITE
    and SB_FREEZE_TRANS states
  * use percpu counters instead of own percpu structure
  * added documentation fixes from the old fs freezing series
  * converted XFS to use SB_FREEZE_TRANS counter instead of its private
    m_active_trans counter

								Honza

CC: Alex Elder <elder@kernel.org>
CC: Anton Altaparmakov <anton@tuxera.com>
CC: Ben Myers <bpm@sgi.com>
CC: Chris Mason <chris.mason@oracle.com>
CC: cluster-devel at redhat.com
CC: "David S. Miller" <davem@davemloft.net>
CC: fuse-devel at lists.sourceforge.net
CC: "J. Bruce Fields" <bfields@fieldses.org>
CC: Joel Becker <jlbec@evilplan.org>
CC: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
CC: linux-btrfs at vger.kernel.org
CC: linux-ext4 at vger.kernel.org
CC: linux-nfs at vger.kernel.org
CC: linux-nilfs at vger.kernel.org
CC: linux-ntfs-dev at lists.sourceforge.net
CC: Mark Fasheh <mfasheh@suse.com>
CC: Miklos Szeredi <miklos@szeredi.hu>
CC: ocfs2-devel at oss.oracle.com
CC: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
CC: Steven Whitehouse <swhiteho@redhat.com>
CC: "Theodore Ts'o" <tytso@mit.edu>
CC: xfs at oss.sgi.com



^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Cluster-devel] [PATCH 05/27] gfs2: Push file_update_time() into gfs2_page_mkwrite()
  2012-04-16 16:13 [Cluster-devel] [PATCH 00/19 v5] Fix filesystem freezing deadlocks Jan Kara
@ 2012-04-16 16:13 ` Jan Kara
  2012-04-16 16:13 ` [Cluster-devel] [PATCH 20/27] gfs2: Convert to new freezing mechanism Jan Kara
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 6+ messages in thread
From: Jan Kara @ 2012-04-16 16:13 UTC (permalink / raw)
  To: cluster-devel.redhat.com

CC: Steven Whitehouse <swhiteho@redhat.com>
CC: cluster-devel at redhat.com
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/gfs2/file.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index a3d2c9e..151d667 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -376,6 +376,9 @@ static int gfs2_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	 */
 	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
 
+	/* Update file times before taking page lock */
+	file_update_time(vma->vm_file);
+
 	gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &gh);
 	ret = gfs2_glock_nq(&gh);
 	if (ret)
-- 
1.7.1



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [Cluster-devel] [PATCH 20/27] gfs2: Convert to new freezing mechanism
  2012-04-16 16:13 [Cluster-devel] [PATCH 00/19 v5] Fix filesystem freezing deadlocks Jan Kara
  2012-04-16 16:13 ` [Cluster-devel] [PATCH 05/27] gfs2: Push file_update_time() into gfs2_page_mkwrite() Jan Kara
@ 2012-04-16 16:13 ` Jan Kara
  2012-04-16 16:16 ` [Cluster-devel] [PATCH 00/19 v5] Fix filesystem freezing deadlocks Jan Kara
       [not found] ` <C9A0E2F0-ED57-40D6-9F75-5D72D45D21F0@dilger.ca>
  3 siblings, 0 replies; 6+ messages in thread
From: Jan Kara @ 2012-04-16 16:13 UTC (permalink / raw)
  To: cluster-devel.redhat.com

It is enough to update gfs2_page_mkwrite() to use new freeze protection.
Rest is handled by the generic code.

CC: cluster-devel at redhat.com
CC: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/gfs2/file.c |   15 +++------------
 1 files changed, 3 insertions(+), 12 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 151d667..87f5aa8 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -370,11 +370,7 @@ static int gfs2_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	loff_t size;
 	int ret;
 
-	/* Wait if fs is frozen. This is racy so we check again later on
-	 * and retry if the fs has been frozen after the page lock has
-	 * been acquired
-	 */
-	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
+	sb_start_pagefault(inode->i_sb);
 
 	/* Update file times before taking page lock */
 	file_update_time(vma->vm_file);
@@ -458,14 +454,9 @@ out:
 	gfs2_holder_uninit(&gh);
 	if (ret == 0) {
 		set_page_dirty(page);
-		/* This check must be post dropping of transaction lock */
-		if (inode->i_sb->s_frozen == SB_UNFROZEN) {
-			wait_on_page_writeback(page);
-		} else {
-			ret = -EAGAIN;
-			unlock_page(page);
-		}
+		wait_on_page_writeback(page);
 	}
+	sb_end_pagefault(inode->i_sb);
 	return block_page_mkwrite_return(ret);
 }
 
-- 
1.7.1



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [Cluster-devel] [PATCH 00/19 v5] Fix filesystem freezing deadlocks
  2012-04-16 16:13 [Cluster-devel] [PATCH 00/19 v5] Fix filesystem freezing deadlocks Jan Kara
  2012-04-16 16:13 ` [Cluster-devel] [PATCH 05/27] gfs2: Push file_update_time() into gfs2_page_mkwrite() Jan Kara
  2012-04-16 16:13 ` [Cluster-devel] [PATCH 20/27] gfs2: Convert to new freezing mechanism Jan Kara
@ 2012-04-16 16:16 ` Jan Kara
       [not found] ` <C9A0E2F0-ED57-40D6-9F75-5D72D45D21F0@dilger.ca>
  3 siblings, 0 replies; 6+ messages in thread
From: Jan Kara @ 2012-04-16 16:16 UTC (permalink / raw)
  To: cluster-devel.redhat.com

  The subject should have been [PATCH 00/27]... Sorry for the mistake.

								Honza

On Mon 16-04-12 18:13:38, Jan Kara wrote:
>   Hello,
> 
>   here is the fifth iteration of my patches to improve filesystem freezing.
> No serious changes since last time. Mostly I rebased patches and merged this
> series with series moving file_update_time() to ->page_mkwrite() to simplify
> testing and merging.
> 
> Filesystem freezing is currently racy and thus we can end up with dirty data on
> frozen filesystem (see changelog patch 13 for detailed race description). This
> patch series aims at fixing this.
> 
> To be able to block all places where inodes get dirtied, I've moved filesystem
> file_update_time() call to ->page_mkwrite callback (patches 01-07) and put
> freeze handling in mnt_want_write() / mnt_drop_write(). That however required
> some code shuffling and changes to kern_path_create() (see patches 09-12). I
> think the result is OK but opinions may differ ;). The advantage of this change
> also is that all filesystems get freeze protection almost for free - even ext2
> can handle freezing well now.
> 
> Another potential contention point might be patch 19. In that patch we make
> freeze_super() refuse to freeze the filesystem when there are open but unlinked
> files which may be impractical in some cases. The main reason for this is the
> problem with handling of file deletion from fput() called with mmap_sem held
> (e.g. from munmap(2)), and then there's the fact that we cannot really force
> such filesystem into a consistent state... But if people think that freezing
> with open but unlinked files should happen, then I have some possible
> solutions in mind (maybe as a separate patchset since this is large enough).
> 
> I'm not able to hit any deadlocks, lockdep warnings, or dirty data on frozen
> filesystem despite beating it with fsstress and bash-shared-mapping while
> freezing and unfreezing for several hours (using ext4 and xfs) so I'm
> reasonably confident this could finally be the right solution.
> 
> Changes since v4:
>   * added a couple of Acked-by's
>   * added some comments & doc update
>   * added patches from series "Push file_update_time() into .page_mkwrite"
>     since it doesn't make much sense to keep them separate anymore
>   * rebased on top of 3.4-rc2
> 
> Changes since v3:
>   * added third level of freezing for fs internal purposes - hooked some
>     filesystems to use it (XFS, nilfs2)
>   * removed racy i_size check from filemap_mkwrite()
> 
> Changes since v2:
>   * completely rewritten
>   * freezing is now blocked at VFS entry points
>   * two stage freezing to handle both mmapped writes and other IO
> 
> The biggest changes since v1:
>   * have two counters to provide safe state transitions for SB_FREEZE_WRITE
>     and SB_FREEZE_TRANS states
>   * use percpu counters instead of own percpu structure
>   * added documentation fixes from the old fs freezing series
>   * converted XFS to use SB_FREEZE_TRANS counter instead of its private
>     m_active_trans counter
> 
> 								Honza
> 
> CC: Alex Elder <elder@kernel.org>
> CC: Anton Altaparmakov <anton@tuxera.com>
> CC: Ben Myers <bpm@sgi.com>
> CC: Chris Mason <chris.mason@oracle.com>
> CC: cluster-devel at redhat.com
> CC: "David S. Miller" <davem@davemloft.net>
> CC: fuse-devel at lists.sourceforge.net
> CC: "J. Bruce Fields" <bfields@fieldses.org>
> CC: Joel Becker <jlbec@evilplan.org>
> CC: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
> CC: linux-btrfs at vger.kernel.org
> CC: linux-ext4 at vger.kernel.org
> CC: linux-nfs at vger.kernel.org
> CC: linux-nilfs at vger.kernel.org
> CC: linux-ntfs-dev at lists.sourceforge.net
> CC: Mark Fasheh <mfasheh@suse.com>
> CC: Miklos Szeredi <miklos@szeredi.hu>
> CC: ocfs2-devel at oss.oracle.com
> CC: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
> CC: Steven Whitehouse <swhiteho@redhat.com>
> CC: "Theodore Ts'o" <tytso@mit.edu>
> CC: xfs at oss.sgi.com
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR



^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Cluster-devel] [PATCH 00/19 v5] Fix filesystem freezing deadlocks
       [not found] ` <C9A0E2F0-ED57-40D6-9F75-5D72D45D21F0@dilger.ca>
@ 2012-04-17  9:32   ` Jan Kara
  2012-04-17 19:34     ` Joel Becker
  0 siblings, 1 reply; 6+ messages in thread
From: Jan Kara @ 2012-04-17  9:32 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon 16-04-12 15:02:50, Andreas Dilger wrote:
> On 2012-04-16, at 9:13 AM, Jan Kara wrote:
> > Another potential contention point might be patch 19. In that patch
> > we make freeze_super() refuse to freeze the filesystem when there
> > are open but unlinked files which may be impractical in some cases.
> > The main reason for this is the problem with handling of file deletion
> > from fput() called with mmap_sem held (e.g. from munmap(2)), and
> > then there's the fact that we cannot really force such filesystem
> > into a consistent state... But if people think that freezing with
> > open but unlinked files should happen, then I have some possible
> > solutions in mind (maybe as a separate patchset since this is
> > large enough).
> 
> Looking at a desktop system, I think it is very typical that there
> are open-unlinked files present, so I don't know if this is really
> an acceptable solution.  It isn't clear from your comments whether
> this is a blanket refusal for all open-unlinked files, or only in
> some particular cases...
  Thanks for looking at this. It is currently a blanket refusal. And I
agree it's problematic. There are two problems with open but unlinked
files.

One is that some old filesystems cannot get in a consistent state in
presence of open but unlinked files but for filesystems we really care
about - xfs, ext4, ext3, btrfs, or even ocfs2, gfs2 - that is not a real
issue (these filesystems will delete those inodes on next mount read-write).

The other problem is with what should happen when you put last inode
reference on a frozen filesystem. Two possibilities I see are:

a) block the iput() call - that is inconvenient because it can be
called in various contexts. I think we could possibly use the same level of
freeze protection as for page fault (this has changed since I originally
thought about this and that would make things simpler) but I'm not
completely sure.

b) let the iput finish but filesystem will keep inode on its orphan list
(or it's equivalent) and the inode will be deleted after the filesystem is
thawed. The advantage of this is we don't have to block iput(), the
disadvantage is we have to have filesystem support and not all filesystems
can do this.

Any thoughts?

								Honza
> 
> lsof | grep deleted
> nautilus  25393  adilger   19r      REG           253,0      340     253954 /home/adilger/.local/share/gvfs-metadata/home (deleted)
> nautilus  25393  adilger   20r      REG           253,0    32768     253964 /home/adilger/.local/share/gvfs-metadata/home-f332a8f3.log (deleted)
> gnome-ter 25623  adilger   22u      REG            0,18    17841    2717846 /tmp/vtePIRJCW (deleted)
> gnome-ter 25623  adilger   23u      REG            0,18     5568    2717847 /tmp/vteDCSJCW (deleted)
> gnome-ter 25623  adilger   29u      REG            0,18      480    2728484 /tmp/vte6C1TCW (deleted)
  
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR



^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Cluster-devel] [PATCH 00/19 v5] Fix filesystem freezing deadlocks
  2012-04-17  9:32   ` Jan Kara
@ 2012-04-17 19:34     ` Joel Becker
  0 siblings, 0 replies; 6+ messages in thread
From: Joel Becker @ 2012-04-17 19:34 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue, Apr 17, 2012 at 11:32:46AM +0200, Jan Kara wrote:
> On Mon 16-04-12 15:02:50, Andreas Dilger wrote:
> > On 2012-04-16, at 9:13 AM, Jan Kara wrote:
> > > Another potential contention point might be patch 19. In that patch
> > > we make freeze_super() refuse to freeze the filesystem when there
> > > are open but unlinked files which may be impractical in some cases.
> > > The main reason for this is the problem with handling of file deletion
> > > from fput() called with mmap_sem held (e.g. from munmap(2)), and
> > > then there's the fact that we cannot really force such filesystem
> > > into a consistent state... But if people think that freezing with
> > > open but unlinked files should happen, then I have some possible
> > > solutions in mind (maybe as a separate patchset since this is
> > > large enough).
> > 
> > Looking at a desktop system, I think it is very typical that there
> > are open-unlinked files present, so I don't know if this is really
> > an acceptable solution.  It isn't clear from your comments whether
> > this is a blanket refusal for all open-unlinked files, or only in
> > some particular cases...
>   Thanks for looking at this. It is currently a blanket refusal. And I
> agree it's problematic. There are two problems with open but unlinked
> files.

	Let me add my name to the chorus of "we have to handle freezing
with open+unlinked, we cannot assume they don't exist."

> One is that some old filesystems cannot get in a consistent state in
> presence of open but unlinked files but for filesystems we really care
> about - xfs, ext4, ext3, btrfs, or even ocfs2, gfs2 - that is not a real
> issue (these filesystems will delete those inodes on next mount read-write).

	Others have pointed out that we can flag the safe filesystems.
I'd even be willing to say you can't freeze the unsafe filesystems.

> The other problem is with what should happen when you put last inode
> reference on a frozen filesystem. Two possibilities I see are:
> 
> a) block the iput() call - that is inconvenient because it can be
> called in various contexts. I think we could possibly use the same level of
> freeze protection as for page fault (this has changed since I originally
> thought about this and that would make things simpler) but I'm not
> completely sure.

	Given that frozen filesystems can stay that way for a while,
couldn't that lead to a million frozen df(1)s?  It's like your average
NFS network failure.

> b) let the iput finish but filesystem will keep inode on its orphan list
> (or it's equivalent) and the inode will be deleted after the filesystem is
> thawed. The advantage of this is we don't have to block iput(), the
> disadvantage is we have to have filesystem support and not all filesystems
> can do this.

	Perhaps we handle iput() like unlinked.  If the filesystem can
handle it, we allow it, otherwise we block.

Joel

> 
> Any thoughts?
> 
> 								Honza
> > 
> > lsof | grep deleted
> > nautilus  25393  adilger   19r      REG           253,0      340     253954 /home/adilger/.local/share/gvfs-metadata/home (deleted)
> > nautilus  25393  adilger   20r      REG           253,0    32768     253964 /home/adilger/.local/share/gvfs-metadata/home-f332a8f3.log (deleted)
> > gnome-ter 25623  adilger   22u      REG            0,18    17841    2717846 /tmp/vtePIRJCW (deleted)
> > gnome-ter 25623  adilger   23u      REG            0,18     5568    2717847 /tmp/vteDCSJCW (deleted)
> > gnome-ter 25623  adilger   29u      REG            0,18      480    2728484 /tmp/vte6C1TCW (deleted)
>   
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

-- 

"The first requisite of a good citizen in this republic of ours
 is that he shall be able and willing to pull his weight."
	- Theodore Roosevelt

			http://www.jlbec.org/
			jlbec at evilplan.org



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-04-17 19:34 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-04-16 16:13 [Cluster-devel] [PATCH 00/19 v5] Fix filesystem freezing deadlocks Jan Kara
2012-04-16 16:13 ` [Cluster-devel] [PATCH 05/27] gfs2: Push file_update_time() into gfs2_page_mkwrite() Jan Kara
2012-04-16 16:13 ` [Cluster-devel] [PATCH 20/27] gfs2: Convert to new freezing mechanism Jan Kara
2012-04-16 16:16 ` [Cluster-devel] [PATCH 00/19 v5] Fix filesystem freezing deadlocks Jan Kara
     [not found] ` <C9A0E2F0-ED57-40D6-9F75-5D72D45D21F0@dilger.ca>
2012-04-17  9:32   ` Jan Kara
2012-04-17 19:34     ` Joel Becker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).