linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Theodore Ts'o <tytso@mit.edu>
To: Ben Hutchings <ben@decadent.org.uk>
Cc: George Barnett <gbarnett@atlassian.com>,
	linux-ext4@vger.kernel.org,
	Debian kernel maintainers <debian-kernel@lists.debian.org>
Subject: [PATCH] ext4/jbd2: don't wait (forever) for stale tid caused by wraparound
Date: Sun, 17 Mar 2013 22:54:01 -0400	[thread overview]
Message-ID: <20130318025401.GA12611@thunk.org> (raw)
In-Reply-To: <1363412062.3937.196.camel@deadeye.wl.decadent.org.uk>

On Sat, Mar 16, 2013 at 05:34:22AM +0000, Ben Hutchings wrote:
> > We use debian for a number of machines in our storage infrastructure
> > and we have recently been seeing a number of "hangs". We primary
> > notice this by seeing nfsd processes locking up and then a hung task
> > killer going wild. We finally managed to get a trace last night - its
> > pasted below: 

Thanks for reporting this.  I thought we had fixed this in 3.0.
Before then, when we had a tid wrap, it would result in kjournald
spinning forever.  I suspect this was your "spontaneous reboots" that
you mentioned you mentioned when you were using 2.6.39 --- did you
have a hardware or softward watchdog timer enabled by any chance?

Since we didn't have a good way of reproducing the problem at the
time, I didn't realize that the problem had not been fully fixed;
since while jbd2_log_start_commit() would no longer cause kjournald to
spin forwever, a subsequent call to jbd2_log_wait_commit() with a
stale transaction id would wait for a very long time (possibly until
the heat death of the universe :-)

I think a patch like this should fix things; I've run a stress test
with a hack to increment the transaction id by 1 << 24 after each
commit, to more quickly cause an tid wrap, and the regression tests
seem to be passing without complaint.

						- Ted

>From 76b05344fef573701b22ced444223188f054f94d Mon Sep 17 00:00:00 2001
From: Theodore Ts'o <tytso@mit.edu>
Date: Sun, 17 Mar 2013 22:24:46 -0400
Subject: [PATCH] ext4/jbd2: don't wait (forever) for stale tid caused by
 wraparound

In the case where an inode has a very stale transaction id (tid) in
i_datasync_tid or i_sync_tid, it's possible that after a very large
(2**31) number of transactions, that the tid number space might wrap,
causing tid_geq()'s calculations to fail.

Commit deeeaf13 "jbd2: fix fsync() tid wraparound bug", later modified
by commit e7b04ac0 "jbd2: don't wake kjournald unnecessarily",
attempted to fix this problem, but it only avoided kjournald spinning
forever by fixing the logic in jbd2_log_start_commit().

Unfortunately, in the codepaths in fs/ext4/fsync.c and fs/ext4/inode.c
that might call jbd2_log_start_commit() with a stale tid, those
functions will subsequently call jbd2_log_wait_commit() with the same
stale tid, and then wait for a very long time.  To fix this, we
replace the calls to jbd2_log_start_commit() and
jbd2_log_wait_commit() with a call to a new function,
jbd2_complete_transaction(), which will correctly handle stale tid's.

As a bonus, jbd2_complete_transaction() will avoid locking
j_state_lock for writing unless a commit needs to be started.  This
should have a small (but probably not measurable) improvement for
ext4's scalability.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reported-by: Ben Hutchings <ben@decadent.org.uk>
Reported-by: George Barnett <gbarnett@atlassian.com>
Cc: stable@vger.kernel.org
---
 fs/ext4/fsync.c      |  3 +--
 fs/ext4/inode.c      |  3 +--
 fs/jbd2/journal.c    | 31 +++++++++++++++++++++++++++++++
 include/linux/jbd2.h |  1 +
 4 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 3278e64..e0ba8a4 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -166,8 +166,7 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	if (journal->j_flags & JBD2_BARRIER &&
 	    !jbd2_trans_will_send_data_barrier(journal, commit_tid))
 		needs_barrier = true;
-	jbd2_log_start_commit(journal, commit_tid);
-	ret = jbd2_log_wait_commit(journal, commit_tid);
+	ret = jbd2_complete_transaction(journal, commit_tid);
 	if (needs_barrier) {
 		err = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
 		if (!ret)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b6fab7c..de4b58d 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -211,8 +211,7 @@ void ext4_evict_inode(struct inode *inode)
 			journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
 			tid_t commit_tid = EXT4_I(inode)->i_datasync_tid;
 
-			jbd2_log_start_commit(journal, commit_tid);
-			jbd2_log_wait_commit(journal, commit_tid);
+			jbd2_complete_transaction(journal, commit_tid);
 			filemap_write_and_wait(&inode->i_data);
 		}
 		truncate_inode_pages(&inode->i_data, 0);
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index ed10991..886ec2f 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -710,6 +710,37 @@ int jbd2_log_wait_commit(journal_t *journal, tid_t tid)
 }
 
 /*
+ * When this function returns the transaction corresponding to tid
+ * will be completed.  If the transaction has currently running, start
+ * committing that transaction before waiting for it to complete.  If
+ * the transaction id is stale, it is by definition already completed,
+ * so just return SUCCESS.
+ */
+int jbd2_complete_transaction(journal_t *journal, tid_t tid)
+{
+	int	need_to_wait = 1;
+
+	read_lock(&journal->j_state_lock);
+	if (journal->j_running_transaction &&
+	    journal->j_running_transaction->t_tid == tid) {
+		if (journal->j_commit_request != tid) {
+			/* transaction not yet started, so request it */
+			read_unlock(&journal->j_state_lock);
+			jbd2_log_start_commit(journal, tid);
+			goto wait_commit;
+		}
+	} else if (!(journal->j_committing_transaction &&
+		     journal->j_committing_transaction->t_tid == tid))
+		need_to_wait = 0;
+	read_unlock(&journal->j_state_lock);
+	if (!need_to_wait)
+		return 0;
+wait_commit:
+	return jbd2_log_wait_commit(journal, tid);
+}
+EXPORT_SYMBOL(jbd2_complete_transaction);
+
+/*
  * Log buffer allocation routines:
  */
 
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 50e5a5e..f028975 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1200,6 +1200,7 @@ int __jbd2_log_start_commit(journal_t *journal, tid_t tid);
 int jbd2_journal_start_commit(journal_t *journal, tid_t *tid);
 int jbd2_journal_force_commit_nested(journal_t *journal);
 int jbd2_log_wait_commit(journal_t *journal, tid_t tid);
+int jbd2_complete_transaction(journal_t *journal, tid_t tid);
 int jbd2_log_do_checkpoint(journal_t *journal);
 int jbd2_trans_will_send_data_barrier(journal_t *journal, tid_t tid);
 
-- 
1.7.12.rc0.22.gcdd159b




  reply	other threads:[~2013-03-18  2:54 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <B2EC601CDDA242189A46599B31EA6AD3@atlassian.com>
2013-03-16  5:34 ` jbd2 tid wrap seen on NFS server Ben Hutchings
2013-03-18  2:54   ` Theodore Ts'o [this message]
2013-03-18  4:24     ` [PATCH] ext4/jbd2: don't wait (forever) for stale tid caused by wraparound George Barnett
2013-03-18  4:53       ` Ben Hutchings
2013-03-18 14:31         ` Theodore Ts'o
2013-03-18 14:34           ` Theodore Ts'o
2013-03-21 20:46     ` Jan Kara
2013-03-21 21:09       ` Theodore Ts'o
2013-03-21 22:41         ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130318025401.GA12611@thunk.org \
    --to=tytso@mit.edu \
    --cc=ben@decadent.org.uk \
    --cc=debian-kernel@lists.debian.org \
    --cc=gbarnett@atlassian.com \
    --cc=linux-ext4@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).