jbd2 tid wrap seen on NFS server

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* jbd2 tid wrap seen on NFS server
       [not found] <B2EC601CDDA242189A46599B31EA6AD3@atlassian.com>
@ 2013-03-16  5:34 ` Ben Hutchings
  2013-03-18  2:54   ` [PATCH] ext4/jbd2: don't wait (forever) for stale tid caused by wraparound Theodore Ts'o
  0 siblings, 1 reply; 9+ messages in thread
From: Ben Hutchings @ 2013-03-16  5:34 UTC (permalink / raw)
  To: George Barnett; +Cc: linux-ext4, Debian kernel maintainers

[-- Attachment #1: Type: text/plain, Size: 3615 bytes --]

On Sat, 2013-03-16 at 16:12 +1100, George Barnett wrote:
> Hi, 
> 
> We use debian for a number of machines in our storage infrastructure
> and we have recently been seeing a number of "hangs". We primary
> notice this by seeing nfsd processes locking up and then a hung task
> killer going wild. We finally managed to get a trace last night - its
> pasted below: 
> 
> We did not see this crash under 2.6.39 back port however this kernel
> spontaneously rebooted at ~200 days uptime (we had about 3/4 of our
> infra reboot in a few weeks. It was not a good time for our ops
> teams). 
> 
> I would be grateful if anybody who could help me narrow this down
> would jump in and help with requests for further info, or provide
> further advice. 

There was some discussion on the ext4 development list in December about
why this warning can appear and how that should be dealt.  It didn't
seem like this actually got resolved though.  I'm cc'ing that list in
the hope of either reviving that discussion or finding out what the fix
is.

Ben.

> [11309697.466397] ------------[ cut here ]------------ 
> [11309697.466556] WARNING: at /build/buildd-linux_3.2.23-1~bpo60+2-amd64-oLufer/linux-3.2.23/fs/jbd2/journal.c:507 __jbd2_log_start_commit+0x7e/0x8c [jbd2]()
> [11309697.466660] Hardware name: X8DT6
> [11309697.466728] JBD2: bad log_start_commit: 2205591757 2205591757 14613566 0
> [11309697.466798] Modules linked in: netconsole autofs4 8021q garp bridge stp nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc bonding tcp_htcp ext4 jbd2 crc16 configfs loop ohci_hcd tpm_tis tpm i7core_edac i2c_i801 snd_pcm snd_timer snd soundcore edac_core i2c_core ioatdma tpm_bios snd_page_alloc coretemp crc32c_intel psmouse pcspkr joydev evdev acpi_cpufreq mperf processor serio_raw button thermal_sys ext3 jbd mbcache usbhid hid sd_mod ses enclosure crc_t10dif uhci_hcd ahci libahci libata igb ehci_hcd e1000e usbcore dca megaraid_sas usb_common scsi_mod [last unloaded: netconsole]
> [11309697.470190] Pid: 62, comm: kswapd0 Not tainted 3.2.0-0.bpo.3-amd64 #1
> [11309697.470261] Call Trace:
> [11309697.470329] [<ffffffff810498a8>] ? warn_slowpath_common+0x78/0x8c
> [11309697.470399] [<ffffffff8104995a>] ? warn_slowpath_fmt+0x45/0x4a
> [11309697.470471] [<ffffffffa01cabad>] ? __jbd2_log_start_commit+0x7e/0x8c [jbd2]
> [11309697.470558] [<ffffffffa01cac83>] ? jbd2_log_start_commit+0x21/0x2f [jbd2]
> [11309697.470634] [<ffffffffa02dee7a>] ? ext4_evict_inode+0x86/0x2d1 [ext4]
> [11309697.470707] [<ffffffff81119626>] ? evict+0x9a/0x14e
> [11309697.470775] [<ffffffff811198b4>] ? dispose_list+0x35/0x3f
> [11309697.470844] [<ffffffff81119b87>] ? prune_icache_sb+0x2c9/0x2d8
> [11309697.470915] [<ffffffff811081b0>] ? prune_super+0xd6/0x147
> [11309697.470987] [<ffffffff810cb9e2>] ? shrink_slab+0x1a3/0x266
> [11309697.471056] [<ffffffff810cd937>] ? balance_pgdat+0x335/0x625
> [11309697.471126] [<ffffffff810cdf31>] ? kswapd+0x30a/0x325
> [11309697.471196] [<ffffffff81063815>] ? wake_up_bit+0x20/0x20
> [11309697.471265] [<ffffffff810cdc27>] ? balance_pgdat+0x625/0x625
> [11309697.471334] [<ffffffff810cdc27>] ? balance_pgdat+0x625/0x625
> [11309697.471403] [<ffffffff810633d9>] ? kthread+0x7a/0x82
> [11309697.471472] [<ffffffff8136d3f4>] ? kernel_thread_helper+0x4/0x10
> [11309697.471543] [<ffffffff8106335f>] ? kthread_worker_fn+0x147/0x147
> [11309697.471613] [<ffffffff8136d3f0>] ? gs_change+0x13/0x13
> [11309697.471680] ---[ end trace 56d2be5ea52d0917 ]---

-- 
Ben Hutchings
It is easier to change the specification to fit the program than vice versa.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] ext4/jbd2: don't wait (forever) for stale tid caused by wraparound
  2013-03-16  5:34 ` jbd2 tid wrap seen on NFS server Ben Hutchings
@ 2013-03-18  2:54   ` Theodore Ts'o
  2013-03-18  4:24     ` George Barnett
  2013-03-21 20:46     ` Jan Kara
  0 siblings, 2 replies; 9+ messages in thread
From: Theodore Ts'o @ 2013-03-18  2:54 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: George Barnett, linux-ext4, Debian kernel maintainers

On Sat, Mar 16, 2013 at 05:34:22AM +0000, Ben Hutchings wrote:
> > We use debian for a number of machines in our storage infrastructure
> > and we have recently been seeing a number of "hangs". We primary
> > notice this by seeing nfsd processes locking up and then a hung task
> > killer going wild. We finally managed to get a trace last night - its
> > pasted below: 

Thanks for reporting this.  I thought we had fixed this in 3.0.
Before then, when we had a tid wrap, it would result in kjournald
spinning forever.  I suspect this was your "spontaneous reboots" that
you mentioned you mentioned when you were using 2.6.39 --- did you
have a hardware or softward watchdog timer enabled by any chance?

Since we didn't have a good way of reproducing the problem at the
time, I didn't realize that the problem had not been fully fixed;
since while jbd2_log_start_commit() would no longer cause kjournald to
spin forwever, a subsequent call to jbd2_log_wait_commit() with a
stale transaction id would wait for a very long time (possibly until
the heat death of the universe :-)

I think a patch like this should fix things; I've run a stress test
with a hack to increment the transaction id by 1 << 24 after each
commit, to more quickly cause an tid wrap, and the regression tests
seem to be passing without complaint.

						- Ted

>From 76b05344fef573701b22ced444223188f054f94d Mon Sep 17 00:00:00 2001
From: Theodore Ts'o <tytso@mit.edu>
Date: Sun, 17 Mar 2013 22:24:46 -0400
Subject: [PATCH] ext4/jbd2: don't wait (forever) for stale tid caused by
 wraparound

In the case where an inode has a very stale transaction id (tid) in
i_datasync_tid or i_sync_tid, it's possible that after a very large
(2**31) number of transactions, that the tid number space might wrap,
causing tid_geq()'s calculations to fail.

Commit deeeaf13 "jbd2: fix fsync() tid wraparound bug", later modified
by commit e7b04ac0 "jbd2: don't wake kjournald unnecessarily",
attempted to fix this problem, but it only avoided kjournald spinning
forever by fixing the logic in jbd2_log_start_commit().

Unfortunately, in the codepaths in fs/ext4/fsync.c and fs/ext4/inode.c
that might call jbd2_log_start_commit() with a stale tid, those
functions will subsequently call jbd2_log_wait_commit() with the same
stale tid, and then wait for a very long time.  To fix this, we
replace the calls to jbd2_log_start_commit() and
jbd2_log_wait_commit() with a call to a new function,
jbd2_complete_transaction(), which will correctly handle stale tid's.

As a bonus, jbd2_complete_transaction() will avoid locking
j_state_lock for writing unless a commit needs to be started.  This
should have a small (but probably not measurable) improvement for
ext4's scalability.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reported-by: Ben Hutchings <ben@decadent.org.uk>
Reported-by: George Barnett <gbarnett@atlassian.com>
Cc: stable@vger.kernel.org
---
 fs/ext4/fsync.c      |  3 +--
 fs/ext4/inode.c      |  3 +--
 fs/jbd2/journal.c    | 31 +++++++++++++++++++++++++++++++
 include/linux/jbd2.h |  1 +
 4 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 3278e64..e0ba8a4 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -166,8 +166,7 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	if (journal->j_flags & JBD2_BARRIER &&
 	    !jbd2_trans_will_send_data_barrier(journal, commit_tid))
 		needs_barrier = true;
-	jbd2_log_start_commit(journal, commit_tid);
-	ret = jbd2_log_wait_commit(journal, commit_tid);
+	ret = jbd2_complete_transaction(journal, commit_tid);
 	if (needs_barrier) {
 		err = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
 		if (!ret)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b6fab7c..de4b58d 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -211,8 +211,7 @@ void ext4_evict_inode(struct inode *inode)
 			journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
 			tid_t commit_tid = EXT4_I(inode)->i_datasync_tid;
 
-			jbd2_log_start_commit(journal, commit_tid);
-			jbd2_log_wait_commit(journal, commit_tid);
+			jbd2_complete_transaction(journal, commit_tid);
 			filemap_write_and_wait(&inode->i_data);
 		}
 		truncate_inode_pages(&inode->i_data, 0);
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index ed10991..886ec2f 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -710,6 +710,37 @@ int jbd2_log_wait_commit(journal_t *journal, tid_t tid)
 }
 
 /*
+ * When this function returns the transaction corresponding to tid
+ * will be completed.  If the transaction has currently running, start
+ * committing that transaction before waiting for it to complete.  If
+ * the transaction id is stale, it is by definition already completed,
+ * so just return SUCCESS.
+ */
+int jbd2_complete_transaction(journal_t *journal, tid_t tid)
+{
+	int	need_to_wait = 1;
+
+	read_lock(&journal->j_state_lock);
+	if (journal->j_running_transaction &&
+	    journal->j_running_transaction->t_tid == tid) {
+		if (journal->j_commit_request != tid) {
+			/* transaction not yet started, so request it */
+			read_unlock(&journal->j_state_lock);
+			jbd2_log_start_commit(journal, tid);
+			goto wait_commit;
+		}
+	} else if (!(journal->j_committing_transaction &&
+		     journal->j_committing_transaction->t_tid == tid))
+		need_to_wait = 0;
+	read_unlock(&journal->j_state_lock);
+	if (!need_to_wait)
+		return 0;
+wait_commit:
+	return jbd2_log_wait_commit(journal, tid);
+}
+EXPORT_SYMBOL(jbd2_complete_transaction);
+
+/*
  * Log buffer allocation routines:
  */
 
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 50e5a5e..f028975 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1200,6 +1200,7 @@ int __jbd2_log_start_commit(journal_t *journal, tid_t tid);
 int jbd2_journal_start_commit(journal_t *journal, tid_t *tid);
 int jbd2_journal_force_commit_nested(journal_t *journal);
 int jbd2_log_wait_commit(journal_t *journal, tid_t tid);
+int jbd2_complete_transaction(journal_t *journal, tid_t tid);
 int jbd2_log_do_checkpoint(journal_t *journal);
 int jbd2_trans_will_send_data_barrier(journal_t *journal, tid_t tid);
 
-- 
1.7.12.rc0.22.gcdd159b




^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] ext4/jbd2: don't wait (forever) for stale tid caused by wraparound
  2013-03-18  2:54   ` [PATCH] ext4/jbd2: don't wait (forever) for stale tid caused by wraparound Theodore Ts'o
@ 2013-03-18  4:24     ` George Barnett
  2013-03-18  4:53       ` Ben Hutchings
  2013-03-21 20:46     ` Jan Kara
  1 sibling, 1 reply; 9+ messages in thread
From: George Barnett @ 2013-03-18  4:24 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Ben Hutchings, linux-ext4, Debian kernel maintainers

[-- Attachment #1: Type: text/plain, Size: 2107 bytes --]

On Monday, 18 March 2013 at 1:54 PM, Theodore Ts'o wrote:
> Thanks for reporting this. I thought we had fixed this in 3.0.
> Before then, when we had a tid wrap, it would result in kjournald
> spinning forever. I suspect this was your "spontaneous reboots" that
> you mentioned you mentioned when you were using 2.6.39 --- did you
> have a hardware or softward watchdog timer enabled by any chance?
> 
> 

Thank you for your prompt attention on this.  It's greatly appreciated!

We believe our previous spontaneous reboots were caused by https://bugzilla.kernel.org/show_bug.cgi?id=16991 which was resolved by our move to a 3.2 kernel (we were on a 2.6.38-bpo kernel ^1).  We do not presently use any watchdogs.
> Since we didn't have a good way of reproducing the problem at the
> time, I didn't realize that the problem had not been fully fixed;
> since while jbd2_log_start_commit() would no longer cause kjournald to
> spin forwever, a subsequent call to jbd2_log_wait_commit() with a
> stale transaction id would wait for a very long time (possibly until
> the heat death of the universe :-)
> 
> 

This would mirror what we've seen, although our ops guys haven't been waiting around for any universes to die :)
> I think a patch like this should fix things; I've run a stress test
> with a hack to increment the transaction id by 1 << 24 after each
> commit, to more quickly cause an tid wrap, and the regression tests
> seem to be passing without complaint.
> 
> 

Excellent news.  Again, thank you for your help in this regard.

@Ben - could you let me know what your preferred course of action would be here?  As I'm sure you can understand, I do not wish to maintain a forked kernel from Debian upstream.  Is this something you would be prepared to integrate into the 3.2 BPO kernels?

Best regards,

George

1. We moved to 2.6.38 in order to get access to the packet steering patches which were put into the kernel in ~.33 or .34 from memory.  This gave us quite a nice performance bump to our storage speed and hence we didn't want to lose it by going back to .32 to get the 200d uptime bug fix.

[-- Attachment #2: Type: text/html, Size: 2981 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] ext4/jbd2: don't wait (forever) for stale tid caused by wraparound
  2013-03-18  4:24     ` George Barnett
@ 2013-03-18  4:53       ` Ben Hutchings
  2013-03-18 14:31         ` Theodore Ts'o
  0 siblings, 1 reply; 9+ messages in thread
From: Ben Hutchings @ 2013-03-18  4:53 UTC (permalink / raw)
  To: George Barnett; +Cc: Theodore Ts'o, linux-ext4, Debian kernel maintainers


[-- Attachment #1.1: Type: text/plain, Size: 1346 bytes --]

On Mon, 2013-03-18 at 15:24 +1100, George Barnett wrote:
> On Monday, 18 March 2013 at 1:54 PM, Theodore Ts'o wrote:
[...]

> > I think a patch like this should fix things; I've run a stress test
> > with a hack to increment the transaction id by 1 << 24 after each
> > commit, to more quickly cause an tid wrap, and the regression tests
> > seem to be passing without complaint.
> Excellent news.  Again, thank you for your help in this regard.
> 
> 
> @Ben - could you let me know what your preferred course of action
> would be here?  As I'm sure you can understand, I do not wish to
> maintain a forked kernel from Debian upstream.  Is this something you
> would be prepared to integrate into the 3.2 BPO kernels?
[...]

We need you to verify that this fix works first.  If it does, it should
get included in the various 3.x.y stable branches and in Debian kernel
packages.

Ted might prefer you to test this against current mainline (3.9-rc3),
though you may find it easier to test against the version you already
have.  In the latter case you'll need a slightly modified version of the
patch (attached) and instructions at
<http://kernel-handbook.alioth.debian.org/ch-common-tasks.html#s-common-official>.

Ben.

-- 
Ben Hutchings
Never attribute to conspiracy what can adequately be explained by stupidity.

[-- Attachment #1.2: ext4-jbd2-don-t-wait-forever-for-stale-tid-caused-by-wraparound.patch --]
[-- Type: text/x-patch, Size: 4527 bytes --]

From: Theodore Ts'o <tytso@mit.edu>
Date: Sun, 17 Mar 2013 22:24:46 -0400
Subject: [PATCH] ext4/jbd2: don't wait (forever) for stale tid caused by
 wraparound

In the case where an inode has a very stale transaction id (tid) in
i_datasync_tid or i_sync_tid, it's possible that after a very large
(2**31) number of transactions, that the tid number space might wrap,
causing tid_geq()'s calculations to fail.

Commit deeeaf13 "jbd2: fix fsync() tid wraparound bug", later modified
by commit e7b04ac0 "jbd2: don't wake kjournald unnecessarily",
attempted to fix this problem, but it only avoided kjournald spinning
forever by fixing the logic in jbd2_log_start_commit().

Unfortunately, in the codepaths in fs/ext4/fsync.c and fs/ext4/inode.c
that might call jbd2_log_start_commit() with a stale tid, those
functions will subsequently call jbd2_log_wait_commit() with the same
stale tid, and then wait for a very long time.  To fix this, we
replace the calls to jbd2_log_start_commit() and
jbd2_log_wait_commit() with a call to a new function,
jbd2_complete_transaction(), which will correctly handle stale tid's.

As a bonus, jbd2_complete_transaction() will avoid locking
j_state_lock for writing unless a commit needs to be started.  This
should have a small (but probably not measurable) improvement for
ext4's scalability.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reported-by: Ben Hutchings <ben@decadent.org.uk>
Reported-by: George Barnett <gbarnett@atlassian.com>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
---
 fs/ext4/fsync.c      |  3 +--
 fs/ext4/inode.c      |  3 +--
 fs/jbd2/journal.c    | 31 +++++++++++++++++++++++++++++++
 include/linux/jbd2.h |  1 +
 4 files changed, 34 insertions(+), 4 deletions(-)

--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -260,8 +260,7 @@ int ext4_sync_file(struct file *file, lo
 	if (journal->j_flags & JBD2_BARRIER &&
 	    !jbd2_trans_will_send_data_barrier(journal, commit_tid))
 		needs_barrier = true;
-	jbd2_log_start_commit(journal, commit_tid);
-	ret = jbd2_log_wait_commit(journal, commit_tid);
+	ret = jbd2_complete_transaction(journal, commit_tid);
 	if (needs_barrier)
 		blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
  out:
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -146,8 +146,7 @@ void ext4_evict_inode(struct inode *inod
 			journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
 			tid_t commit_tid = EXT4_I(inode)->i_datasync_tid;
 
-			jbd2_log_start_commit(journal, commit_tid);
-			jbd2_log_wait_commit(journal, commit_tid);
+			jbd2_complete_transaction(journal, commit_tid);
 			filemap_write_and_wait(&inode->i_data);
 		}
 		truncate_inode_pages(&inode->i_data, 0);
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -663,6 +663,37 @@ int jbd2_log_wait_commit(journal_t *jour
 }
 
 /*
+ * When this function returns the transaction corresponding to tid
+ * will be completed.  If the transaction has currently running, start
+ * committing that transaction before waiting for it to complete.  If
+ * the transaction id is stale, it is by definition already completed,
+ * so just return SUCCESS.
+ */
+int jbd2_complete_transaction(journal_t *journal, tid_t tid)
+{
+	int	need_to_wait = 1;
+
+	read_lock(&journal->j_state_lock);
+	if (journal->j_running_transaction &&
+	    journal->j_running_transaction->t_tid == tid) {
+		if (journal->j_commit_request != tid) {
+			/* transaction not yet started, so request it */
+			read_unlock(&journal->j_state_lock);
+			jbd2_log_start_commit(journal, tid);
+			goto wait_commit;
+		}
+	} else if (!(journal->j_committing_transaction &&
+		     journal->j_committing_transaction->t_tid == tid))
+		need_to_wait = 0;
+	read_unlock(&journal->j_state_lock);
+	if (!need_to_wait)
+		return 0;
+wait_commit:
+	return jbd2_log_wait_commit(journal, tid);
+}
+EXPORT_SYMBOL(jbd2_complete_transaction);
+
+/*
  * Log buffer allocation routines:
  */
 
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1165,6 +1165,7 @@ int __jbd2_log_start_commit(journal_t *j
 int jbd2_journal_start_commit(journal_t *journal, tid_t *tid);
 int jbd2_journal_force_commit_nested(journal_t *journal);
 int jbd2_log_wait_commit(journal_t *journal, tid_t tid);
+int jbd2_complete_transaction(journal_t *journal, tid_t tid);
 int jbd2_log_do_checkpoint(journal_t *journal);
 int jbd2_trans_will_send_data_barrier(journal_t *journal, tid_t tid);
 

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] ext4/jbd2: don't wait (forever) for stale tid caused by wraparound
  2013-03-18  4:53       ` Ben Hutchings
@ 2013-03-18 14:31         ` Theodore Ts'o
  2013-03-18 14:34           ` Theodore Ts'o
  0 siblings, 1 reply; 9+ messages in thread
From: Theodore Ts'o @ 2013-03-18 14:31 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: George Barnett, linux-ext4, Debian kernel maintainers

On Mon, Mar 18, 2013 at 04:53:15AM +0000, Ben Hutchings wrote:
> 
> We need you to verify that this fix works first.  If it does, it should
> get included in the various 3.x.y stable branches and in Debian kernel
> packages.

I suspect it will be hard for George to verify this, since it requires
a tid wrap, which by definition takes a long time.

Also, this will probably not hit the stable kernels until after the
next merge window, since it's already post -rc3 and I really want to
make sure this gets a lot of testing and review.

I'll also note that I managed to trigger a BUG when incrementing by a
factor of (1 << 24), but we don't see a BUG_ON when incrementing by
((1 << 24) + 1).  (Neither of these testing changes were in the patch
that I sent out; so the patch is "safe" in that I very much doubt it
will make things worse --- those changes were to stress test the patch
so that I wouldn't have to wait several months until the tid wrapped
to test whether we had finally fixed all of the potential problems.)
So there is something we probably do want to look at a bit more
closely before we formally push this fix into mainline.

As far as the Debian servers are concerned, I'm pretty sure the patch
should be safe in that it won't make things worse than they were
before --- however, if you are looking for the lowest risk approach,
it's probably better to simply schedule downtime every few months and
force a reboot at a time that is minimizes developer inconvenience.
You can use "dumpe2fs -h /dev/XXX" to get the current sequence number
of the journal, if you measure the sequence number separated by 24 or
48 hours, you should be able to calculate when the the sequence number
will have incremented by 2**31, and thus calculate the frequency of
scheduled reboots for your workload.

Regards,

						- Ted

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] ext4/jbd2: don't wait (forever) for stale tid caused by wraparound
  2013-03-18 14:31         ` Theodore Ts'o
@ 2013-03-18 14:34           ` Theodore Ts'o
  0 siblings, 0 replies; 9+ messages in thread
From: Theodore Ts'o @ 2013-03-18 14:34 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: George Barnett, linux-ext4, Debian kernel maintainers

One other observeration; problems relating to tid wrap have been
around for a very long time.  It's an issue for both ext3 and ext4, so
Red Hat and SuSE have been shipping QA'ed enterprise kernels with this
issue for over a decade.  This is one of the reasons why I don't think
it's a good idea to rush a fix into mainline, the stable 3.2.x
kernels, or the Debian kernel, until we really make sure we have any
proposed bug fixes fully tested and verified.

						- Ted

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] ext4/jbd2: don't wait (forever) for stale tid caused by wraparound
  2013-03-18  2:54   ` [PATCH] ext4/jbd2: don't wait (forever) for stale tid caused by wraparound Theodore Ts'o
  2013-03-18  4:24     ` George Barnett
@ 2013-03-21 20:46     ` Jan Kara
  2013-03-21 21:09       ` Theodore Ts'o
  1 sibling, 1 reply; 9+ messages in thread
From: Jan Kara @ 2013-03-21 20:46 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Ben Hutchings, George Barnett, linux-ext4,
	Debian kernel maintainers

On Sun 17-03-13 22:54:01, Ted Tso wrote:
> On Sat, Mar 16, 2013 at 05:34:22AM +0000, Ben Hutchings wrote:
> > > We use debian for a number of machines in our storage infrastructure
> > > and we have recently been seeing a number of "hangs". We primary
> > > notice this by seeing nfsd processes locking up and then a hung task
> > > killer going wild. We finally managed to get a trace last night - its
> > > pasted below: 
> 
> Thanks for reporting this.  I thought we had fixed this in 3.0.
> Before then, when we had a tid wrap, it would result in kjournald
> spinning forever.  I suspect this was your "spontaneous reboots" that
> you mentioned you mentioned when you were using 2.6.39 --- did you
> have a hardware or softward watchdog timer enabled by any chance?
> 
> Since we didn't have a good way of reproducing the problem at the
> time, I didn't realize that the problem had not been fully fixed;
> since while jbd2_log_start_commit() would no longer cause kjournald to
> spin forwever, a subsequent call to jbd2_log_wait_commit() with a
> stale transaction id would wait for a very long time (possibly until
> the heat death of the universe :-)
> 
> I think a patch like this should fix things; I've run a stress test
> with a hack to increment the transaction id by 1 << 24 after each
> commit, to more quickly cause an tid wrap, and the regression tests
> seem to be passing without complaint.
> 
> 						- Ted
> 
> From 76b05344fef573701b22ced444223188f054f94d Mon Sep 17 00:00:00 2001
> From: Theodore Ts'o <tytso@mit.edu>
> Date: Sun, 17 Mar 2013 22:24:46 -0400
> Subject: [PATCH] ext4/jbd2: don't wait (forever) for stale tid caused by
>  wraparound
> 
> In the case where an inode has a very stale transaction id (tid) in
> i_datasync_tid or i_sync_tid, it's possible that after a very large
> (2**31) number of transactions, that the tid number space might wrap,
> causing tid_geq()'s calculations to fail.
> 
> Commit deeeaf13 "jbd2: fix fsync() tid wraparound bug", later modified
> by commit e7b04ac0 "jbd2: don't wake kjournald unnecessarily",
> attempted to fix this problem, but it only avoided kjournald spinning
> forever by fixing the logic in jbd2_log_start_commit().
> 
> Unfortunately, in the codepaths in fs/ext4/fsync.c and fs/ext4/inode.c
> that might call jbd2_log_start_commit() with a stale tid, those
> functions will subsequently call jbd2_log_wait_commit() with the same
> stale tid, and then wait for a very long time.  To fix this, we
> replace the calls to jbd2_log_start_commit() and
> jbd2_log_wait_commit() with a call to a new function,
> jbd2_complete_transaction(), which will correctly handle stale tid's.
> 
> As a bonus, jbd2_complete_transaction() will avoid locking
> j_state_lock for writing unless a commit needs to be started.  This
> should have a small (but probably not measurable) improvement for
> ext4's scalability.
> 
> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> Reported-by: Ben Hutchings <ben@decadent.org.uk>
> Reported-by: George Barnett <gbarnett@atlassian.com>
> Cc: stable@vger.kernel.org
  Good catch! But shouldn't we rather fix jbd2_log_wait_commit() instead of
inventing new function? So jbd2_log_wait_commit() would do something like:

			__func__, journal->j_commit_request, tid);
	}
#endif
+	/* Not running or committing trans => must be already committed. */
+	if (!((journal->j_running_transaction &&
+	     journal->j_running_transaction->t_tid == tid) ||
+	    (journal->j_committing_transaction &&
+	     journal->j_committing_transaction->t_tid == tid))) {
+		read_unlock(&journal->j_state_lock);
+		return 0;
+	}
	while (tid_gt(tid, journal->j_commit_sequence)) {
		jbd_debug(1, "JBD2: want %d, j_commit_sequence=%d\n",
				  tid, journal->j_commit_sequence);
 
								Honza

> ---
>  fs/ext4/fsync.c      |  3 +--
>  fs/ext4/inode.c      |  3 +--
>  fs/jbd2/journal.c    | 31 +++++++++++++++++++++++++++++++
>  include/linux/jbd2.h |  1 +
>  4 files changed, 34 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
> index 3278e64..e0ba8a4 100644
> --- a/fs/ext4/fsync.c
> +++ b/fs/ext4/fsync.c
> @@ -166,8 +166,7 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
>  	if (journal->j_flags & JBD2_BARRIER &&
>  	    !jbd2_trans_will_send_data_barrier(journal, commit_tid))
>  		needs_barrier = true;
> -	jbd2_log_start_commit(journal, commit_tid);
> -	ret = jbd2_log_wait_commit(journal, commit_tid);
> +	ret = jbd2_complete_transaction(journal, commit_tid);
>  	if (needs_barrier) {
>  		err = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
>  		if (!ret)
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index b6fab7c..de4b58d 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -211,8 +211,7 @@ void ext4_evict_inode(struct inode *inode)
>  			journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
>  			tid_t commit_tid = EXT4_I(inode)->i_datasync_tid;
>  
> -			jbd2_log_start_commit(journal, commit_tid);
> -			jbd2_log_wait_commit(journal, commit_tid);
> +			jbd2_complete_transaction(journal, commit_tid);
>  			filemap_write_and_wait(&inode->i_data);
>  		}
>  		truncate_inode_pages(&inode->i_data, 0);
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index ed10991..886ec2f 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -710,6 +710,37 @@ int jbd2_log_wait_commit(journal_t *journal, tid_t tid)
>  }
>  
>  /*
> + * When this function returns the transaction corresponding to tid
> + * will be completed.  If the transaction has currently running, start
> + * committing that transaction before waiting for it to complete.  If
> + * the transaction id is stale, it is by definition already completed,
> + * so just return SUCCESS.
> + */
> +int jbd2_complete_transaction(journal_t *journal, tid_t tid)
> +{
> +	int	need_to_wait = 1;
> +
> +	read_lock(&journal->j_state_lock);
> +	if (journal->j_running_transaction &&
> +	    journal->j_running_transaction->t_tid == tid) {
> +		if (journal->j_commit_request != tid) {
> +			/* transaction not yet started, so request it */
> +			read_unlock(&journal->j_state_lock);
> +			jbd2_log_start_commit(journal, tid);
> +			goto wait_commit;
> +		}
> +	} else if (!(journal->j_committing_transaction &&
> +		     journal->j_committing_transaction->t_tid == tid))
> +		need_to_wait = 0;
> +	read_unlock(&journal->j_state_lock);
> +	if (!need_to_wait)
> +		return 0;
> +wait_commit:
> +	return jbd2_log_wait_commit(journal, tid);
> +}
> +EXPORT_SYMBOL(jbd2_complete_transaction);
> +
> +/*
>   * Log buffer allocation routines:
>   */
>  
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index 50e5a5e..f028975 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -1200,6 +1200,7 @@ int __jbd2_log_start_commit(journal_t *journal, tid_t tid);
>  int jbd2_journal_start_commit(journal_t *journal, tid_t *tid);
>  int jbd2_journal_force_commit_nested(journal_t *journal);
>  int jbd2_log_wait_commit(journal_t *journal, tid_t tid);
> +int jbd2_complete_transaction(journal_t *journal, tid_t tid);
>  int jbd2_log_do_checkpoint(journal_t *journal);
>  int jbd2_trans_will_send_data_barrier(journal_t *journal, tid_t tid);
>  
> -- 
> 1.7.12.rc0.22.gcdd159b
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] ext4/jbd2: don't wait (forever) for stale tid caused by wraparound
  2013-03-21 20:46     ` Jan Kara
@ 2013-03-21 21:09       ` Theodore Ts'o
  2013-03-21 22:41         ` Jan Kara
  0 siblings, 1 reply; 9+ messages in thread
From: Theodore Ts'o @ 2013-03-21 21:09 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ben Hutchings, George Barnett, linux-ext4,
	Debian kernel maintainers

On Thu, Mar 21, 2013 at 09:46:38PM +0100, Jan Kara wrote:
>   Good catch! But shouldn't we rather fix jbd2_log_wait_commit() instead of
> inventing new function?

In most of the places where we call jbd2_log_start_commit(), we're
actually starting the current running transaction.  So the fact that
we pass in a tid, and we're having to validate that the tid is
actually a valid one, is a bit of a waste.  So in the long run I think
it's worth rethinking whether or not jbd2_log_{start,wait}_commit()
should exist in their current form, or whether we should reorganize
their functionality (i.e., by having a jbd2_start_running_commit(),
for example.).  Piling on fixes to jbd2_log_wait_commit() would make
it get even more complicated, and I think if we separate out the
various ways in which we use these functions, we can make the code
simpler and easier to read.

In fact, I had started making this rather large set of changes when I
decided it would be better to save that kind of wholesale refactoring
for the next merge window.  So the reason why I ended up fixing the
patch the way I did was to keep things simple.

Also as I mentioned in the commit description, by using a single
function I was also able to optimize the locking the locking somewhat.

						- Ted

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] ext4/jbd2: don't wait (forever) for stale tid caused by wraparound
  2013-03-21 21:09       ` Theodore Ts'o
@ 2013-03-21 22:41         ` Jan Kara
  0 siblings, 0 replies; 9+ messages in thread
From: Jan Kara @ 2013-03-21 22:41 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Jan Kara, Ben Hutchings, George Barnett, linux-ext4,
	Debian kernel maintainers

On Thu 21-03-13 17:09:40, Ted Tso wrote:
> On Thu, Mar 21, 2013 at 09:46:38PM +0100, Jan Kara wrote:
> >   Good catch! But shouldn't we rather fix jbd2_log_wait_commit() instead of
> > inventing new function?
> 
> In most of the places where we call jbd2_log_start_commit(), we're
> actually starting the current running transaction.  So the fact that
> we pass in a tid, and we're having to validate that the tid is
> actually a valid one, is a bit of a waste.  So in the long run I think
> it's worth rethinking whether or not jbd2_log_{start,wait}_commit()
> should exist in their current form, or whether we should reorganize
> their functionality (i.e., by having a jbd2_start_running_commit(),
> for example.).  Piling on fixes to jbd2_log_wait_commit() would make
> it get even more complicated, and I think if we separate out the
> various ways in which we use these functions, we can make the code
> simpler and easier to read.
  I don't find jbd2_log_wait_commit() that complex that it couldn't bear
another if :) But given there are really two waiting operations that make
sense:
a) request commit of running transaction and wait for it
b) wait for committing transaction

then I agree there may be a better interface. OTOH I'm somewhat
curious about the new interface because the only race-free way of
identifying a transaction you want to wait for is using its tid.

> In fact, I had started making this rather large set of changes when I
> decided it would be better to save that kind of wholesale refactoring
> for the next merge window.  So the reason why I ended up fixing the
> patch the way I did was to keep things simple.
> 
> Also as I mentioned in the commit description, by using a single
> function I was also able to optimize the locking the locking somewhat.
  Yeah. I'm not as much opposed to the new function doing start commit
& wait but what I dislike is the fact that we have still exposed the
function jbd2_log_wait_commit() which can possibly lockup if tid overflows.
I agree there aren't currently any other callers where this could happen
but in a few years who knows...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-03-21 22:41 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <B2EC601CDDA242189A46599B31EA6AD3@atlassian.com>
2013-03-16  5:34 ` jbd2 tid wrap seen on NFS server Ben Hutchings
2013-03-18  2:54   ` [PATCH] ext4/jbd2: don't wait (forever) for stale tid caused by wraparound Theodore Ts'o
2013-03-18  4:24     ` George Barnett
2013-03-18  4:53       ` Ben Hutchings
2013-03-18 14:31         ` Theodore Ts'o
2013-03-18 14:34           ` Theodore Ts'o
2013-03-21 20:46     ` Jan Kara
2013-03-21 21:09       ` Theodore Ts'o
2013-03-21 22:41         ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).