[patch 089/114] jbd2: Fix return value of jbd2_journal_start

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [patch 089/114] jbd2: Fix return value of jbd2_journal_start_commit()
       [not found] ` <20090314011649.GA26170@kroah.com>
@ 2009-03-14  1:11   ` Greg KH
  2009-03-14  1:11   ` [patch 090/114] Revert "ext4: wait on all pending commits in ext4_sync_fs()" Greg KH
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 5+ messages in thread
From: Greg KH @ 2009-03-14  1:11 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Justin Forbes, Zwane Mwaikambo, Theodore Ts'o, Randy Dunlap,
	Dave Jones, Chuck Wolber, Chris Wedgwood, Michael Krufky,
	Chuck Ebbert, Domenico Andreoli, Willy Tarreau,
	Rodrigo Rubira Branco, Jake Edge, Eugene Teo, torvalds, akpm,
	alan, linux-ext4, Jan Kara, Eric Sandeen

[-- Attachment #1: jbd2-fix-return-value-of-jbd2_journal_start_commit.patch --]
[-- Type: text/plain, Size: 3646 bytes --]


2.6.28-stable review patch.  If anyone has any objections, please let us know.

------------------

From: Jan Kara <jack@suse.cz>

(cherry picked from commit c88ccea3143975294f5a52097546bcbb75975f52)

The function jbd2_journal_start_commit() returns 1 if either a
transaction is committing or the function has queued a transaction
commit. But it returns 0 if we raced with somebody queueing the
transaction commit as well. This resulted in ext4_sync_fs() not
functioning correctly (description from Arthur Jones):

   In the case of a data=ordered umount with pending long symlinks
   which are delayed due to a long list of other I/O on the backing
   block device, this causes the buffer associated with the long
   symlinks to not be moved to the inode dirty list in the second
   phase of fsync_super.  Then, before they can be dirtied again,
   kjournald exits, seeing the UMOUNT flag and the dirty pages are
   never written to the backing block device, causing long symlink
   corruption and exposing new or previously freed block data to
   userspace.

This can be reproduced with a script created by Eric Sandeen
<sandeen@redhat.com>:

        #!/bin/bash

        umount /mnt/test2
        mount /dev/sdb4 /mnt/test2
        rm -f /mnt/test2/*
        dd if=/dev/zero of=/mnt/test2/bigfile bs=1M count=512
        touch /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
        ln -s /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
        /mnt/test2/link
        umount /mnt/test2
        mount /dev/sdb4 /mnt/test2
        ls /mnt/test2/

This patch fixes jbd2_journal_start_commit() to always return 1 when
there's a transaction committing or queued for commit.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
CC: Eric Sandeen <sandeen@redhat.com>
CC: linux-ext4@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

---
 fs/jbd2/journal.c |   17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -430,7 +430,7 @@ int __jbd2_log_space_left(journal_t *jou
 }
 
 /*
- * Called under j_state_lock.  Returns true if a transaction was started.
+ * Called under j_state_lock.  Returns true if a transaction commit was started.
  */
 int __jbd2_log_start_commit(journal_t *journal, tid_t target)
 {
@@ -498,7 +498,8 @@ int jbd2_journal_force_commit_nested(jou
 
 /*
  * Start a commit of the current running transaction (if any).  Returns true
- * if a transaction was started, and fills its tid in at *ptid
+ * if a transaction is going to be committed (or is currently already
+ * committing), and fills its tid in at *ptid
  */
 int jbd2_journal_start_commit(journal_t *journal, tid_t *ptid)
 {
@@ -508,15 +509,19 @@ int jbd2_journal_start_commit(journal_t 
 	if (journal->j_running_transaction) {
 		tid_t tid = journal->j_running_transaction->t_tid;
 
-		ret = __jbd2_log_start_commit(journal, tid);
-		if (ret && ptid)
+		__jbd2_log_start_commit(journal, tid);
+		/* There's a running transaction and we've just made sure
+		 * it's commit has been scheduled. */
+		if (ptid)
 			*ptid = tid;
-	} else if (journal->j_committing_transaction && ptid) {
+		ret = 1;
+	} else if (journal->j_committing_transaction) {
 		/*
 		 * If ext3_write_super() recently started a commit, then we
 		 * have to wait for completion of that transaction
 		 */
-		*ptid = journal->j_committing_transaction->t_tid;
+		if (ptid)
+			*ptid = journal->j_committing_transaction->t_tid;
 		ret = 1;
 	}
 	spin_unlock(&journal->j_state_lock);



^ permalink raw reply	[flat|nested] 5+ messages in thread

* [patch 090/114] Revert "ext4: wait on all pending commits in ext4_sync_fs()"
       [not found] ` <20090314011649.GA26170@kroah.com>
  2009-03-14  1:11   ` [patch 089/114] jbd2: Fix return value of jbd2_journal_start_commit() Greg KH
@ 2009-03-14  1:11   ` Greg KH
  2009-03-14  1:11   ` [patch 091/114] jbd2: Avoid possible NULL dereference in jbd2_journal_begin_ordered_truncate() Greg KH
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 5+ messages in thread
From: Greg KH @ 2009-03-14  1:11 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Justin Forbes, Zwane Mwaikambo, Theodore Ts'o, Randy Dunlap,
	Dave Jones, Chuck Wolber, Chris Wedgwood, Michael Krufky,
	Chuck Ebbert, Domenico Andreoli, Willy Tarreau,
	Rodrigo Rubira Branco, Jake Edge, Eugene Teo, torvalds, akpm,
	alan, linux-ext4, Jan Kara, Eric Sandeen

[-- Attachment #1: revert-ext4-wait-on-all-pending-commits-in-ext4_sync_fs.patch --]
[-- Type: text/plain, Size: 1580 bytes --]


2.6.28-stable review patch.  If anyone has any objections, please let us know.

------------------

From: Jan Kara <jack@suse.cz>

(cherry picked from commit 9eddacf9e9c03578ef2c07c9534423e823d677f8)

This undoes commit 14ce0cb411c88681ab8f3a4c9caa7f42e97a3184.

Since jbd2_journal_start_commit() is now fixed to return 1 when we
started a transaction commit, there's some transaction waiting to be
committed or there's a transaction already committing, we don't
need to call ext4_force_commit() in ext4_sync_fs(). Furthermore
ext4_force_commit() can unnecessarily create sync transaction which is
expensive so it's worthwhile to remove it when we can.

http://bugzilla.kernel.org/show_bug.cgi?id=12224

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Eric Sandeen <sandeen@redhat.com>
Cc: linux-ext4@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

---
 fs/ext4/super.c |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2914,15 +2914,15 @@ static void ext4_write_super(struct supe
 
 static int ext4_sync_fs(struct super_block *sb, int wait)
 {
-	int ret = 0;
+	tid_t target;
 
 	trace_mark(ext4_sync_fs, "dev %s wait %d", sb->s_id, wait);
 	sb->s_dirt = 0;
-	if (wait)
-		ret = ext4_force_commit(sb);
-	else
-		jbd2_journal_start_commit(EXT4_SB(sb)->s_journal, NULL);
-	return ret;
+	if (jbd2_journal_start_commit(EXT4_SB(sb)->s_journal, &target)) {
+		if (wait)
+			jbd2_log_wait_commit(EXT4_SB(sb)->s_journal, target);
+	}
+	return 0;
 }
 
 /*



^ permalink raw reply	[flat|nested] 5+ messages in thread

* [patch 091/114] jbd2: Avoid possible NULL dereference in jbd2_journal_begin_ordered_truncate()
       [not found] ` <20090314011649.GA26170@kroah.com>
  2009-03-14  1:11   ` [patch 089/114] jbd2: Fix return value of jbd2_journal_start_commit() Greg KH
  2009-03-14  1:11   ` [patch 090/114] Revert "ext4: wait on all pending commits in ext4_sync_fs()" Greg KH
@ 2009-03-14  1:11   ` Greg KH
  2009-03-14  1:11   ` [patch 097/114] ext4: Add fallback for find_group_flex Greg KH
  2009-03-14  1:11   ` [patch 098/114] ext4: Fix deadlock in ext4_write_begin() and ext4_da_write_begin() Greg KH
  4 siblings, 0 replies; 5+ messages in thread
From: Greg KH @ 2009-03-14  1:11 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Justin Forbes, Zwane Mwaikambo, Theodore Ts'o, Randy Dunlap,
	Dave Jones, Chuck Wolber, Chris Wedgwood, Michael Krufky,
	Chuck Ebbert, Domenico Andreoli, Willy Tarreau,
	Rodrigo Rubira Branco, Jake Edge, Eugene Teo, torvalds, akpm,
	alan, Dan Carpenter, mfasheh, Jan Kara, linux-ext4, ocfs2-devel,
	Joel Becker

[-- Attachment #1: jbd2-avoid-possible-null-dereference-in-jbd2_journal_begin_ordered_truncate.patch --]
[-- Type: text/plain, Size: 5673 bytes --]


2.6.28-stable review patch.  If anyone has any objections, please let us know.

------------------

From: Jan Kara <jack@suse.cz>

(cherry picked from commit 7f5aa215088b817add9c71914b83650bdd49f8a9)

If we race with commit code setting i_transaction to NULL, we could
possibly dereference it.  Proper locking requires the journal pointer
(to access journal->j_list_lock), which we don't have.  So we have to
change the prototype of the function so that filesystem passes us the
journal pointer.  Also add a more detailed comment about why the
function jbd2_journal_begin_ordered_truncate() does what it does and
how it should be used.

Thanks to Dan Carpenter <error27@gmail.com> for pointing to the
suspitious code.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Joel Becker <joel.becker@oracle.com>
CC: linux-ext4@vger.kernel.org
CC: ocfs2-devel@oss.oracle.com
CC: mfasheh@suse.de
CC: Dan Carpenter <error27@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

---
 fs/ext4/inode.c       |    6 ++++--
 fs/jbd2/transaction.c |   42 +++++++++++++++++++++++++++++++-----------
 fs/ocfs2/journal.h    |    6 ++++--
 include/linux/jbd2.h  |    3 ++-
 4 files changed, 41 insertions(+), 16 deletions(-)

--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -46,8 +46,10 @@
 static inline int ext4_begin_ordered_truncate(struct inode *inode,
 					      loff_t new_size)
 {
-	return jbd2_journal_begin_ordered_truncate(&EXT4_I(inode)->jinode,
-						   new_size);
+	return jbd2_journal_begin_ordered_truncate(
+					EXT4_SB(inode->i_sb)->s_journal,
+					&EXT4_I(inode)->jinode,
+					new_size);
 }
 
 static void ext4_invalidatepage(struct page *page, unsigned long offset);
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -2050,26 +2050,46 @@ done:
 }
 
 /*
- * This function must be called when inode is journaled in ordered mode
- * before truncation happens. It starts writeout of truncated part in
- * case it is in the committing transaction so that we stand to ordered
- * mode consistency guarantees.
+ * File truncate and transaction commit interact with each other in a
+ * non-trivial way.  If a transaction writing data block A is
+ * committing, we cannot discard the data by truncate until we have
+ * written them.  Otherwise if we crashed after the transaction with
+ * write has committed but before the transaction with truncate has
+ * committed, we could see stale data in block A.  This function is a
+ * helper to solve this problem.  It starts writeout of the truncated
+ * part in case it is in the committing transaction.
+ *
+ * Filesystem code must call this function when inode is journaled in
+ * ordered mode before truncation happens and after the inode has been
+ * placed on orphan list with the new inode size. The second condition
+ * avoids the race that someone writes new data and we start
+ * committing the transaction after this function has been called but
+ * before a transaction for truncate is started (and furthermore it
+ * allows us to optimize the case where the addition to orphan list
+ * happens in the same transaction as write --- we don't have to write
+ * any data in such case).
  */
-int jbd2_journal_begin_ordered_truncate(struct jbd2_inode *inode,
+int jbd2_journal_begin_ordered_truncate(journal_t *journal,
+					struct jbd2_inode *jinode,
 					loff_t new_size)
 {
-	journal_t *journal;
-	transaction_t *commit_trans;
+	transaction_t *inode_trans, *commit_trans;
 	int ret = 0;
 
-	if (!inode->i_transaction && !inode->i_next_transaction)
+	/* This is a quick check to avoid locking if not necessary */
+	if (!jinode->i_transaction)
 		goto out;
-	journal = inode->i_transaction->t_journal;
+	/* Locks are here just to force reading of recent values, it is
+	 * enough that the transaction was not committing before we started
+	 * a transaction adding the inode to orphan list */
 	spin_lock(&journal->j_state_lock);
 	commit_trans = journal->j_committing_transaction;
 	spin_unlock(&journal->j_state_lock);
-	if (inode->i_transaction == commit_trans) {
-		ret = filemap_fdatawrite_range(inode->i_vfs_inode->i_mapping,
+	spin_lock(&journal->j_list_lock);
+	inode_trans = jinode->i_transaction;
+	spin_unlock(&journal->j_list_lock);
+	if (inode_trans == commit_trans) {
+		ret = filemap_fdatawrite_range(jinode->i_vfs_inode->i_mapping,
 			new_size, LLONG_MAX);
 		if (ret)
 			jbd2_journal_abort(journal, ret);
--- a/fs/ocfs2/journal.h
+++ b/fs/ocfs2/journal.h
@@ -445,8 +445,10 @@ static inline int ocfs2_jbd2_file_inode(
 static inline int ocfs2_begin_ordered_truncate(struct inode *inode,
 					       loff_t new_size)
 {
-	return jbd2_journal_begin_ordered_truncate(&OCFS2_I(inode)->ip_jinode,
-						   new_size);
+	return jbd2_journal_begin_ordered_truncate(
+				OCFS2_SB(inode->i_sb)->journal->j_journal,
+				&OCFS2_I(inode)->ip_jinode,
+				new_size);
 }
 
 #endif /* OCFS2_JOURNAL_H */
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1087,7 +1087,8 @@ extern int	   jbd2_journal_clear_err  (j
 extern int	   jbd2_journal_bmap(journal_t *, unsigned long, unsigned long long *);
 extern int	   jbd2_journal_force_commit(journal_t *);
 extern int	   jbd2_journal_file_inode(handle_t *handle, struct jbd2_inode *inode);
-extern int	   jbd2_journal_begin_ordered_truncate(struct jbd2_inode *inode, loff_t new_size);
+extern int	   jbd2_journal_begin_ordered_truncate(journal_t *journal,
+				struct jbd2_inode *inode, loff_t new_size);
 extern void	   jbd2_journal_init_jbd_inode(struct jbd2_inode *jinode, struct inode *inode);
 extern void	   jbd2_journal_release_jbd_inode(journal_t *journal, struct jbd2_inode *jinode);
 



^ permalink raw reply	[flat|nested] 5+ messages in thread

* [patch 097/114] ext4: Add fallback for find_group_flex
       [not found] ` <20090314011649.GA26170@kroah.com>
                     ` (2 preceding siblings ...)
  2009-03-14  1:11   ` [patch 091/114] jbd2: Avoid possible NULL dereference in jbd2_journal_begin_ordered_truncate() Greg KH
@ 2009-03-14  1:11   ` Greg KH
  2009-03-14  1:11   ` [patch 098/114] ext4: Fix deadlock in ext4_write_begin() and ext4_da_write_begin() Greg KH
  4 siblings, 0 replies; 5+ messages in thread
From: Greg KH @ 2009-03-14  1:11 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Justin Forbes, Zwane Mwaikambo, Theodore Ts'o, Randy Dunlap,
	Dave Jones, Chuck Wolber, Chris Wedgwood, Michael Krufky,
	Chuck Ebbert, Domenico Andreoli, Willy Tarreau,
	Rodrigo Rubira Branco, Jake Edge, Eugene Teo, torvalds, akpm,
	alan, Ext4 Developers List

[-- Attachment #1: ext4-add-fallback-for-find_group_flex.patch --]
[-- Type: text/plain, Size: 1459 bytes --]

2.6.28-stable review patch.  If anyone has any objections, please let us know.

------------------

From: "Theodore Ts'o" <tytso@mit.edu>

(cherry picked from commit 05bf9e839d9de4e8a094274a0a2fd07beb47eaf1)

This is a workaround for find_group_flex() which badly needs to be
replaced.  One of its problems (besides ignoring the Orlov algorithm)
is that it is a bit hyperactive about returning failure under
suspicious circumstances.  This can lead to spurious ENOSPC failures
even when there are inodes still available.

Work around this for now by retrying the search using
find_group_other() if find_group_flex() returns -1.  If
find_group_other() succeeds when find_group_flex() has failed, log a
warning message.

A better block/inode allocator that will fix this problem for real has
been queued up for the next merge window.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

---
 fs/ext4/ialloc.c |    7 +++++++
 1 file changed, 7 insertions(+)

--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -703,6 +703,13 @@ struct inode *ext4_new_inode(handle_t *h
 
 	if (sbi->s_log_groups_per_flex) {
 		ret2 = find_group_flex(sb, dir, &group);
+		if (ret2 == -1) {
+			ret2 = find_group_other(sb, dir, &group);
+			if (ret2 == 0 && printk_ratelimit())
+				printk(KERN_NOTICE "ext4: find_group_flex "
+				       "failed, fallback succeeded dir %lu\n",
+				       dir->i_ino);
+		}
 		goto got_group;
 	}
 



^ permalink raw reply	[flat|nested] 5+ messages in thread

* [patch 098/114] ext4: Fix deadlock in ext4_write_begin() and ext4_da_write_begin()
       [not found] ` <20090314011649.GA26170@kroah.com>
                     ` (3 preceding siblings ...)
  2009-03-14  1:11   ` [patch 097/114] ext4: Add fallback for find_group_flex Greg KH
@ 2009-03-14  1:11   ` Greg KH
  4 siblings, 0 replies; 5+ messages in thread
From: Greg KH @ 2009-03-14  1:11 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Justin Forbes, Zwane Mwaikambo, Theodore Ts'o, Randy Dunlap,
	Dave Jones, Chuck Wolber, Chris Wedgwood, Michael Krufky,
	Chuck Ebbert, Domenico Andreoli, Willy Tarreau,
	Rodrigo Rubira Branco, Jake Edge, Eugene Teo, torvalds, akpm,
	alan, Ext4 Developers List, Jan Kara

[-- Attachment #1: ext4-fix-deadlock-in-ext4_write_begin-and-ext4_da_write_begin.patch --]
[-- Type: text/plain, Size: 1697 bytes --]


2.6.28-stable review patch.  If anyone has any objections, please let us know.

------------------

From: Jan Kara <jack@suse.cz>

(cherry picked from commit ebd3610b110bbb18ea6f9f2aeed1e1068c537227)

Functions ext4_write_begin() and ext4_da_write_begin() call
grab_cache_page_write_begin() without AOP_FLAG_NOFS. Thus it
can happen that page reclaim is triggered in that function
and it recurses back into the filesystem (or some other filesystem).
But this can lead to various problems as a transaction is already
started at that point. Add the necessary flag.

http://bugzilla.kernel.org/show_bug.cgi?id=11688

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

---
 fs/ext4/inode.c |    9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1347,6 +1347,10 @@ retry:
 		goto out;
 	}
 
+	/* We cannot recurse into the filesystem as the transaction is already
+	 * started */
+	flags |= AOP_FLAG_NOFS;
+
 	page = grab_cache_page_write_begin(mapping, index, flags);
 	if (!page) {
 		ext4_journal_stop(handle);
@@ -1356,7 +1360,7 @@ retry:
 	*pagep = page;
 
 	ret = block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
-							ext4_get_block);
+				ext4_get_block);
 
 	if (!ret && ext4_should_journal_data(inode)) {
 		ret = walk_page_buffers(handle, page_buffers(page),
@@ -2603,6 +2607,9 @@ retry:
 		ret = PTR_ERR(handle);
 		goto out;
 	}
+	/* We cannot recurse into the filesystem as the transaction is already
+	 * started */
+	flags |= AOP_FLAG_NOFS;
 
 	page = grab_cache_page_write_begin(mapping, index, flags);
 	if (!page) {



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2009-03-14  1:21 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20090314010937.416083662@mini.kroah.org>
     [not found] ` <20090314011649.GA26170@kroah.com>
2009-03-14  1:11   ` [patch 089/114] jbd2: Fix return value of jbd2_journal_start_commit() Greg KH
2009-03-14  1:11   ` [patch 090/114] Revert "ext4: wait on all pending commits in ext4_sync_fs()" Greg KH
2009-03-14  1:11   ` [patch 091/114] jbd2: Avoid possible NULL dereference in jbd2_journal_begin_ordered_truncate() Greg KH
2009-03-14  1:11   ` [patch 097/114] ext4: Add fallback for find_group_flex Greg KH
2009-03-14  1:11   ` [patch 098/114] ext4: Fix deadlock in ext4_write_begin() and ext4_da_write_begin() Greg KH

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).