* jbd2 tid wrap seen on NFS server [not found] <B2EC601CDDA242189A46599B31EA6AD3@atlassian.com> @ 2013-03-16 5:34 ` Ben Hutchings 2013-03-18 2:54 ` [PATCH] ext4/jbd2: don't wait (forever) for stale tid caused by wraparound Theodore Ts'o 0 siblings, 1 reply; 9+ messages in thread From: Ben Hutchings @ 2013-03-16 5:34 UTC (permalink / raw) To: George Barnett; +Cc: linux-ext4, Debian kernel maintainers [-- Attachment #1: Type: text/plain, Size: 3615 bytes --] On Sat, 2013-03-16 at 16:12 +1100, George Barnett wrote: > Hi, > > We use debian for a number of machines in our storage infrastructure > and we have recently been seeing a number of "hangs". We primary > notice this by seeing nfsd processes locking up and then a hung task > killer going wild. We finally managed to get a trace last night - its > pasted below: > > We did not see this crash under 2.6.39 back port however this kernel > spontaneously rebooted at ~200 days uptime (we had about 3/4 of our > infra reboot in a few weeks. It was not a good time for our ops > teams). > > I would be grateful if anybody who could help me narrow this down > would jump in and help with requests for further info, or provide > further advice. There was some discussion on the ext4 development list in December about why this warning can appear and how that should be dealt. It didn't seem like this actually got resolved though. I'm cc'ing that list in the hope of either reviving that discussion or finding out what the fix is. Ben. > [11309697.466397] ------------[ cut here ]------------ > [11309697.466556] WARNING: at /build/buildd-linux_3.2.23-1~bpo60+2-amd64-oLufer/linux-3.2.23/fs/jbd2/journal.c:507 __jbd2_log_start_commit+0x7e/0x8c [jbd2]() > [11309697.466660] Hardware name: X8DT6 > [11309697.466728] JBD2: bad log_start_commit: 2205591757 2205591757 14613566 0 > [11309697.466798] Modules linked in: netconsole autofs4 8021q garp bridge stp nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc bonding tcp_htcp ext4 jbd2 crc16 configfs loop ohci_hcd tpm_tis tpm i7core_edac i2c_i801 snd_pcm snd_timer snd soundcore edac_core i2c_core ioatdma tpm_bios snd_page_alloc coretemp crc32c_intel psmouse pcspkr joydev evdev acpi_cpufreq mperf processor serio_raw button thermal_sys ext3 jbd mbcache usbhid hid sd_mod ses enclosure crc_t10dif uhci_hcd ahci libahci libata igb ehci_hcd e1000e usbcore dca megaraid_sas usb_common scsi_mod [last unloaded: netconsole] > [11309697.470190] Pid: 62, comm: kswapd0 Not tainted 3.2.0-0.bpo.3-amd64 #1 > [11309697.470261] Call Trace: > [11309697.470329] [<ffffffff810498a8>] ? warn_slowpath_common+0x78/0x8c > [11309697.470399] [<ffffffff8104995a>] ? warn_slowpath_fmt+0x45/0x4a > [11309697.470471] [<ffffffffa01cabad>] ? __jbd2_log_start_commit+0x7e/0x8c [jbd2] > [11309697.470558] [<ffffffffa01cac83>] ? jbd2_log_start_commit+0x21/0x2f [jbd2] > [11309697.470634] [<ffffffffa02dee7a>] ? ext4_evict_inode+0x86/0x2d1 [ext4] > [11309697.470707] [<ffffffff81119626>] ? evict+0x9a/0x14e > [11309697.470775] [<ffffffff811198b4>] ? dispose_list+0x35/0x3f > [11309697.470844] [<ffffffff81119b87>] ? prune_icache_sb+0x2c9/0x2d8 > [11309697.470915] [<ffffffff811081b0>] ? prune_super+0xd6/0x147 > [11309697.470987] [<ffffffff810cb9e2>] ? shrink_slab+0x1a3/0x266 > [11309697.471056] [<ffffffff810cd937>] ? balance_pgdat+0x335/0x625 > [11309697.471126] [<ffffffff810cdf31>] ? kswapd+0x30a/0x325 > [11309697.471196] [<ffffffff81063815>] ? wake_up_bit+0x20/0x20 > [11309697.471265] [<ffffffff810cdc27>] ? balance_pgdat+0x625/0x625 > [11309697.471334] [<ffffffff810cdc27>] ? balance_pgdat+0x625/0x625 > [11309697.471403] [<ffffffff810633d9>] ? kthread+0x7a/0x82 > [11309697.471472] [<ffffffff8136d3f4>] ? kernel_thread_helper+0x4/0x10 > [11309697.471543] [<ffffffff8106335f>] ? kthread_worker_fn+0x147/0x147 > [11309697.471613] [<ffffffff8136d3f0>] ? gs_change+0x13/0x13 > [11309697.471680] ---[ end trace 56d2be5ea52d0917 ]--- -- Ben Hutchings It is easier to change the specification to fit the program than vice versa. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH] ext4/jbd2: don't wait (forever) for stale tid caused by wraparound 2013-03-16 5:34 ` jbd2 tid wrap seen on NFS server Ben Hutchings @ 2013-03-18 2:54 ` Theodore Ts'o 2013-03-18 4:24 ` George Barnett 2013-03-21 20:46 ` Jan Kara 0 siblings, 2 replies; 9+ messages in thread From: Theodore Ts'o @ 2013-03-18 2:54 UTC (permalink / raw) To: Ben Hutchings; +Cc: George Barnett, linux-ext4, Debian kernel maintainers On Sat, Mar 16, 2013 at 05:34:22AM +0000, Ben Hutchings wrote: > > We use debian for a number of machines in our storage infrastructure > > and we have recently been seeing a number of "hangs". We primary > > notice this by seeing nfsd processes locking up and then a hung task > > killer going wild. We finally managed to get a trace last night - its > > pasted below: Thanks for reporting this. I thought we had fixed this in 3.0. Before then, when we had a tid wrap, it would result in kjournald spinning forever. I suspect this was your "spontaneous reboots" that you mentioned you mentioned when you were using 2.6.39 --- did you have a hardware or softward watchdog timer enabled by any chance? Since we didn't have a good way of reproducing the problem at the time, I didn't realize that the problem had not been fully fixed; since while jbd2_log_start_commit() would no longer cause kjournald to spin forwever, a subsequent call to jbd2_log_wait_commit() with a stale transaction id would wait for a very long time (possibly until the heat death of the universe :-) I think a patch like this should fix things; I've run a stress test with a hack to increment the transaction id by 1 << 24 after each commit, to more quickly cause an tid wrap, and the regression tests seem to be passing without complaint. - Ted >From 76b05344fef573701b22ced444223188f054f94d Mon Sep 17 00:00:00 2001 From: Theodore Ts'o <tytso@mit.edu> Date: Sun, 17 Mar 2013 22:24:46 -0400 Subject: [PATCH] ext4/jbd2: don't wait (forever) for stale tid caused by wraparound In the case where an inode has a very stale transaction id (tid) in i_datasync_tid or i_sync_tid, it's possible that after a very large (2**31) number of transactions, that the tid number space might wrap, causing tid_geq()'s calculations to fail. Commit deeeaf13 "jbd2: fix fsync() tid wraparound bug", later modified by commit e7b04ac0 "jbd2: don't wake kjournald unnecessarily", attempted to fix this problem, but it only avoided kjournald spinning forever by fixing the logic in jbd2_log_start_commit(). Unfortunately, in the codepaths in fs/ext4/fsync.c and fs/ext4/inode.c that might call jbd2_log_start_commit() with a stale tid, those functions will subsequently call jbd2_log_wait_commit() with the same stale tid, and then wait for a very long time. To fix this, we replace the calls to jbd2_log_start_commit() and jbd2_log_wait_commit() with a call to a new function, jbd2_complete_transaction(), which will correctly handle stale tid's. As a bonus, jbd2_complete_transaction() will avoid locking j_state_lock for writing unless a commit needs to be started. This should have a small (but probably not measurable) improvement for ext4's scalability. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reported-by: Ben Hutchings <ben@decadent.org.uk> Reported-by: George Barnett <gbarnett@atlassian.com> Cc: stable@vger.kernel.org --- fs/ext4/fsync.c | 3 +-- fs/ext4/inode.c | 3 +-- fs/jbd2/journal.c | 31 +++++++++++++++++++++++++++++++ include/linux/jbd2.h | 1 + 4 files changed, 34 insertions(+), 4 deletions(-) diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c index 3278e64..e0ba8a4 100644 --- a/fs/ext4/fsync.c +++ b/fs/ext4/fsync.c @@ -166,8 +166,7 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync) if (journal->j_flags & JBD2_BARRIER && !jbd2_trans_will_send_data_barrier(journal, commit_tid)) needs_barrier = true; - jbd2_log_start_commit(journal, commit_tid); - ret = jbd2_log_wait_commit(journal, commit_tid); + ret = jbd2_complete_transaction(journal, commit_tid); if (needs_barrier) { err = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL); if (!ret) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index b6fab7c..de4b58d 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -211,8 +211,7 @@ void ext4_evict_inode(struct inode *inode) journal_t *journal = EXT4_SB(inode->i_sb)->s_journal; tid_t commit_tid = EXT4_I(inode)->i_datasync_tid; - jbd2_log_start_commit(journal, commit_tid); - jbd2_log_wait_commit(journal, commit_tid); + jbd2_complete_transaction(journal, commit_tid); filemap_write_and_wait(&inode->i_data); } truncate_inode_pages(&inode->i_data, 0); diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c index ed10991..886ec2f 100644 --- a/fs/jbd2/journal.c +++ b/fs/jbd2/journal.c @@ -710,6 +710,37 @@ int jbd2_log_wait_commit(journal_t *journal, tid_t tid) } /* + * When this function returns the transaction corresponding to tid + * will be completed. If the transaction has currently running, start + * committing that transaction before waiting for it to complete. If + * the transaction id is stale, it is by definition already completed, + * so just return SUCCESS. + */ +int jbd2_complete_transaction(journal_t *journal, tid_t tid) +{ + int need_to_wait = 1; + + read_lock(&journal->j_state_lock); + if (journal->j_running_transaction && + journal->j_running_transaction->t_tid == tid) { + if (journal->j_commit_request != tid) { + /* transaction not yet started, so request it */ + read_unlock(&journal->j_state_lock); + jbd2_log_start_commit(journal, tid); + goto wait_commit; + } + } else if (!(journal->j_committing_transaction && + journal->j_committing_transaction->t_tid == tid)) + need_to_wait = 0; + read_unlock(&journal->j_state_lock); + if (!need_to_wait) + return 0; +wait_commit: + return jbd2_log_wait_commit(journal, tid); +} +EXPORT_SYMBOL(jbd2_complete_transaction); + +/* * Log buffer allocation routines: */ diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h index 50e5a5e..f028975 100644 --- a/include/linux/jbd2.h +++ b/include/linux/jbd2.h @@ -1200,6 +1200,7 @@ int __jbd2_log_start_commit(journal_t *journal, tid_t tid); int jbd2_journal_start_commit(journal_t *journal, tid_t *tid); int jbd2_journal_force_commit_nested(journal_t *journal); int jbd2_log_wait_commit(journal_t *journal, tid_t tid); +int jbd2_complete_transaction(journal_t *journal, tid_t tid); int jbd2_log_do_checkpoint(journal_t *journal); int jbd2_trans_will_send_data_barrier(journal_t *journal, tid_t tid); -- 1.7.12.rc0.22.gcdd159b ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] ext4/jbd2: don't wait (forever) for stale tid caused by wraparound 2013-03-18 2:54 ` [PATCH] ext4/jbd2: don't wait (forever) for stale tid caused by wraparound Theodore Ts'o @ 2013-03-18 4:24 ` George Barnett 2013-03-18 4:53 ` Ben Hutchings 2013-03-21 20:46 ` Jan Kara 1 sibling, 1 reply; 9+ messages in thread From: George Barnett @ 2013-03-18 4:24 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Ben Hutchings, linux-ext4, Debian kernel maintainers [-- Attachment #1: Type: text/plain, Size: 2107 bytes --] On Monday, 18 March 2013 at 1:54 PM, Theodore Ts'o wrote: > Thanks for reporting this. I thought we had fixed this in 3.0. > Before then, when we had a tid wrap, it would result in kjournald > spinning forever. I suspect this was your "spontaneous reboots" that > you mentioned you mentioned when you were using 2.6.39 --- did you > have a hardware or softward watchdog timer enabled by any chance? > > Thank you for your prompt attention on this. It's greatly appreciated! We believe our previous spontaneous reboots were caused by https://bugzilla.kernel.org/show_bug.cgi?id=16991 which was resolved by our move to a 3.2 kernel (we were on a 2.6.38-bpo kernel ^1). We do not presently use any watchdogs. > Since we didn't have a good way of reproducing the problem at the > time, I didn't realize that the problem had not been fully fixed; > since while jbd2_log_start_commit() would no longer cause kjournald to > spin forwever, a subsequent call to jbd2_log_wait_commit() with a > stale transaction id would wait for a very long time (possibly until > the heat death of the universe :-) > > This would mirror what we've seen, although our ops guys haven't been waiting around for any universes to die :) > I think a patch like this should fix things; I've run a stress test > with a hack to increment the transaction id by 1 << 24 after each > commit, to more quickly cause an tid wrap, and the regression tests > seem to be passing without complaint. > > Excellent news. Again, thank you for your help in this regard. @Ben - could you let me know what your preferred course of action would be here? As I'm sure you can understand, I do not wish to maintain a forked kernel from Debian upstream. Is this something you would be prepared to integrate into the 3.2 BPO kernels? Best regards, George 1. We moved to 2.6.38 in order to get access to the packet steering patches which were put into the kernel in ~.33 or .34 from memory. This gave us quite a nice performance bump to our storage speed and hence we didn't want to lose it by going back to .32 to get the 200d uptime bug fix. [-- Attachment #2: Type: text/html, Size: 2981 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] ext4/jbd2: don't wait (forever) for stale tid caused by wraparound 2013-03-18 4:24 ` George Barnett @ 2013-03-18 4:53 ` Ben Hutchings 2013-03-18 14:31 ` Theodore Ts'o 0 siblings, 1 reply; 9+ messages in thread From: Ben Hutchings @ 2013-03-18 4:53 UTC (permalink / raw) To: George Barnett; +Cc: Theodore Ts'o, linux-ext4, Debian kernel maintainers [-- Attachment #1.1: Type: text/plain, Size: 1346 bytes --] On Mon, 2013-03-18 at 15:24 +1100, George Barnett wrote: > On Monday, 18 March 2013 at 1:54 PM, Theodore Ts'o wrote: [...] > > I think a patch like this should fix things; I've run a stress test > > with a hack to increment the transaction id by 1 << 24 after each > > commit, to more quickly cause an tid wrap, and the regression tests > > seem to be passing without complaint. > Excellent news. Again, thank you for your help in this regard. > > > @Ben - could you let me know what your preferred course of action > would be here? As I'm sure you can understand, I do not wish to > maintain a forked kernel from Debian upstream. Is this something you > would be prepared to integrate into the 3.2 BPO kernels? [...] We need you to verify that this fix works first. If it does, it should get included in the various 3.x.y stable branches and in Debian kernel packages. Ted might prefer you to test this against current mainline (3.9-rc3), though you may find it easier to test against the version you already have. In the latter case you'll need a slightly modified version of the patch (attached) and instructions at <http://kernel-handbook.alioth.debian.org/ch-common-tasks.html#s-common-official>. Ben. -- Ben Hutchings Never attribute to conspiracy what can adequately be explained by stupidity. [-- Attachment #1.2: ext4-jbd2-don-t-wait-forever-for-stale-tid-caused-by-wraparound.patch --] [-- Type: text/x-patch, Size: 4527 bytes --] From: Theodore Ts'o <tytso@mit.edu> Date: Sun, 17 Mar 2013 22:24:46 -0400 Subject: [PATCH] ext4/jbd2: don't wait (forever) for stale tid caused by wraparound In the case where an inode has a very stale transaction id (tid) in i_datasync_tid or i_sync_tid, it's possible that after a very large (2**31) number of transactions, that the tid number space might wrap, causing tid_geq()'s calculations to fail. Commit deeeaf13 "jbd2: fix fsync() tid wraparound bug", later modified by commit e7b04ac0 "jbd2: don't wake kjournald unnecessarily", attempted to fix this problem, but it only avoided kjournald spinning forever by fixing the logic in jbd2_log_start_commit(). Unfortunately, in the codepaths in fs/ext4/fsync.c and fs/ext4/inode.c that might call jbd2_log_start_commit() with a stale tid, those functions will subsequently call jbd2_log_wait_commit() with the same stale tid, and then wait for a very long time. To fix this, we replace the calls to jbd2_log_start_commit() and jbd2_log_wait_commit() with a call to a new function, jbd2_complete_transaction(), which will correctly handle stale tid's. As a bonus, jbd2_complete_transaction() will avoid locking j_state_lock for writing unless a commit needs to be started. This should have a small (but probably not measurable) improvement for ext4's scalability. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reported-by: Ben Hutchings <ben@decadent.org.uk> Reported-by: George Barnett <gbarnett@atlassian.com> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> --- fs/ext4/fsync.c | 3 +-- fs/ext4/inode.c | 3 +-- fs/jbd2/journal.c | 31 +++++++++++++++++++++++++++++++ include/linux/jbd2.h | 1 + 4 files changed, 34 insertions(+), 4 deletions(-) --- a/fs/ext4/fsync.c +++ b/fs/ext4/fsync.c @@ -260,8 +260,7 @@ int ext4_sync_file(struct file *file, lo if (journal->j_flags & JBD2_BARRIER && !jbd2_trans_will_send_data_barrier(journal, commit_tid)) needs_barrier = true; - jbd2_log_start_commit(journal, commit_tid); - ret = jbd2_log_wait_commit(journal, commit_tid); + ret = jbd2_complete_transaction(journal, commit_tid); if (needs_barrier) blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL); out: --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -146,8 +146,7 @@ void ext4_evict_inode(struct inode *inod journal_t *journal = EXT4_SB(inode->i_sb)->s_journal; tid_t commit_tid = EXT4_I(inode)->i_datasync_tid; - jbd2_log_start_commit(journal, commit_tid); - jbd2_log_wait_commit(journal, commit_tid); + jbd2_complete_transaction(journal, commit_tid); filemap_write_and_wait(&inode->i_data); } truncate_inode_pages(&inode->i_data, 0); --- a/fs/jbd2/journal.c +++ b/fs/jbd2/journal.c @@ -663,6 +663,37 @@ int jbd2_log_wait_commit(journal_t *jour } /* + * When this function returns the transaction corresponding to tid + * will be completed. If the transaction has currently running, start + * committing that transaction before waiting for it to complete. If + * the transaction id is stale, it is by definition already completed, + * so just return SUCCESS. + */ +int jbd2_complete_transaction(journal_t *journal, tid_t tid) +{ + int need_to_wait = 1; + + read_lock(&journal->j_state_lock); + if (journal->j_running_transaction && + journal->j_running_transaction->t_tid == tid) { + if (journal->j_commit_request != tid) { + /* transaction not yet started, so request it */ + read_unlock(&journal->j_state_lock); + jbd2_log_start_commit(journal, tid); + goto wait_commit; + } + } else if (!(journal->j_committing_transaction && + journal->j_committing_transaction->t_tid == tid)) + need_to_wait = 0; + read_unlock(&journal->j_state_lock); + if (!need_to_wait) + return 0; +wait_commit: + return jbd2_log_wait_commit(journal, tid); +} +EXPORT_SYMBOL(jbd2_complete_transaction); + +/* * Log buffer allocation routines: */ --- a/include/linux/jbd2.h +++ b/include/linux/jbd2.h @@ -1165,6 +1165,7 @@ int __jbd2_log_start_commit(journal_t *j int jbd2_journal_start_commit(journal_t *journal, tid_t *tid); int jbd2_journal_force_commit_nested(journal_t *journal); int jbd2_log_wait_commit(journal_t *journal, tid_t tid); +int jbd2_complete_transaction(journal_t *journal, tid_t tid); int jbd2_log_do_checkpoint(journal_t *journal); int jbd2_trans_will_send_data_barrier(journal_t *journal, tid_t tid); [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] ext4/jbd2: don't wait (forever) for stale tid caused by wraparound 2013-03-18 4:53 ` Ben Hutchings @ 2013-03-18 14:31 ` Theodore Ts'o 2013-03-18 14:34 ` Theodore Ts'o 0 siblings, 1 reply; 9+ messages in thread From: Theodore Ts'o @ 2013-03-18 14:31 UTC (permalink / raw) To: Ben Hutchings; +Cc: George Barnett, linux-ext4, Debian kernel maintainers On Mon, Mar 18, 2013 at 04:53:15AM +0000, Ben Hutchings wrote: > > We need you to verify that this fix works first. If it does, it should > get included in the various 3.x.y stable branches and in Debian kernel > packages. I suspect it will be hard for George to verify this, since it requires a tid wrap, which by definition takes a long time. Also, this will probably not hit the stable kernels until after the next merge window, since it's already post -rc3 and I really want to make sure this gets a lot of testing and review. I'll also note that I managed to trigger a BUG when incrementing by a factor of (1 << 24), but we don't see a BUG_ON when incrementing by ((1 << 24) + 1). (Neither of these testing changes were in the patch that I sent out; so the patch is "safe" in that I very much doubt it will make things worse --- those changes were to stress test the patch so that I wouldn't have to wait several months until the tid wrapped to test whether we had finally fixed all of the potential problems.) So there is something we probably do want to look at a bit more closely before we formally push this fix into mainline. As far as the Debian servers are concerned, I'm pretty sure the patch should be safe in that it won't make things worse than they were before --- however, if you are looking for the lowest risk approach, it's probably better to simply schedule downtime every few months and force a reboot at a time that is minimizes developer inconvenience. You can use "dumpe2fs -h /dev/XXX" to get the current sequence number of the journal, if you measure the sequence number separated by 24 or 48 hours, you should be able to calculate when the the sequence number will have incremented by 2**31, and thus calculate the frequency of scheduled reboots for your workload. Regards, - Ted ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] ext4/jbd2: don't wait (forever) for stale tid caused by wraparound 2013-03-18 14:31 ` Theodore Ts'o @ 2013-03-18 14:34 ` Theodore Ts'o 0 siblings, 0 replies; 9+ messages in thread From: Theodore Ts'o @ 2013-03-18 14:34 UTC (permalink / raw) To: Ben Hutchings; +Cc: George Barnett, linux-ext4, Debian kernel maintainers One other observeration; problems relating to tid wrap have been around for a very long time. It's an issue for both ext3 and ext4, so Red Hat and SuSE have been shipping QA'ed enterprise kernels with this issue for over a decade. This is one of the reasons why I don't think it's a good idea to rush a fix into mainline, the stable 3.2.x kernels, or the Debian kernel, until we really make sure we have any proposed bug fixes fully tested and verified. - Ted ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] ext4/jbd2: don't wait (forever) for stale tid caused by wraparound 2013-03-18 2:54 ` [PATCH] ext4/jbd2: don't wait (forever) for stale tid caused by wraparound Theodore Ts'o 2013-03-18 4:24 ` George Barnett @ 2013-03-21 20:46 ` Jan Kara 2013-03-21 21:09 ` Theodore Ts'o 1 sibling, 1 reply; 9+ messages in thread From: Jan Kara @ 2013-03-21 20:46 UTC (permalink / raw) To: Theodore Ts'o Cc: Ben Hutchings, George Barnett, linux-ext4, Debian kernel maintainers On Sun 17-03-13 22:54:01, Ted Tso wrote: > On Sat, Mar 16, 2013 at 05:34:22AM +0000, Ben Hutchings wrote: > > > We use debian for a number of machines in our storage infrastructure > > > and we have recently been seeing a number of "hangs". We primary > > > notice this by seeing nfsd processes locking up and then a hung task > > > killer going wild. We finally managed to get a trace last night - its > > > pasted below: > > Thanks for reporting this. I thought we had fixed this in 3.0. > Before then, when we had a tid wrap, it would result in kjournald > spinning forever. I suspect this was your "spontaneous reboots" that > you mentioned you mentioned when you were using 2.6.39 --- did you > have a hardware or softward watchdog timer enabled by any chance? > > Since we didn't have a good way of reproducing the problem at the > time, I didn't realize that the problem had not been fully fixed; > since while jbd2_log_start_commit() would no longer cause kjournald to > spin forwever, a subsequent call to jbd2_log_wait_commit() with a > stale transaction id would wait for a very long time (possibly until > the heat death of the universe :-) > > I think a patch like this should fix things; I've run a stress test > with a hack to increment the transaction id by 1 << 24 after each > commit, to more quickly cause an tid wrap, and the regression tests > seem to be passing without complaint. > > - Ted > > From 76b05344fef573701b22ced444223188f054f94d Mon Sep 17 00:00:00 2001 > From: Theodore Ts'o <tytso@mit.edu> > Date: Sun, 17 Mar 2013 22:24:46 -0400 > Subject: [PATCH] ext4/jbd2: don't wait (forever) for stale tid caused by > wraparound > > In the case where an inode has a very stale transaction id (tid) in > i_datasync_tid or i_sync_tid, it's possible that after a very large > (2**31) number of transactions, that the tid number space might wrap, > causing tid_geq()'s calculations to fail. > > Commit deeeaf13 "jbd2: fix fsync() tid wraparound bug", later modified > by commit e7b04ac0 "jbd2: don't wake kjournald unnecessarily", > attempted to fix this problem, but it only avoided kjournald spinning > forever by fixing the logic in jbd2_log_start_commit(). > > Unfortunately, in the codepaths in fs/ext4/fsync.c and fs/ext4/inode.c > that might call jbd2_log_start_commit() with a stale tid, those > functions will subsequently call jbd2_log_wait_commit() with the same > stale tid, and then wait for a very long time. To fix this, we > replace the calls to jbd2_log_start_commit() and > jbd2_log_wait_commit() with a call to a new function, > jbd2_complete_transaction(), which will correctly handle stale tid's. > > As a bonus, jbd2_complete_transaction() will avoid locking > j_state_lock for writing unless a commit needs to be started. This > should have a small (but probably not measurable) improvement for > ext4's scalability. > > Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> > Reported-by: Ben Hutchings <ben@decadent.org.uk> > Reported-by: George Barnett <gbarnett@atlassian.com> > Cc: stable@vger.kernel.org Good catch! But shouldn't we rather fix jbd2_log_wait_commit() instead of inventing new function? So jbd2_log_wait_commit() would do something like: __func__, journal->j_commit_request, tid); } #endif + /* Not running or committing trans => must be already committed. */ + if (!((journal->j_running_transaction && + journal->j_running_transaction->t_tid == tid) || + (journal->j_committing_transaction && + journal->j_committing_transaction->t_tid == tid))) { + read_unlock(&journal->j_state_lock); + return 0; + } while (tid_gt(tid, journal->j_commit_sequence)) { jbd_debug(1, "JBD2: want %d, j_commit_sequence=%d\n", tid, journal->j_commit_sequence); Honza > --- > fs/ext4/fsync.c | 3 +-- > fs/ext4/inode.c | 3 +-- > fs/jbd2/journal.c | 31 +++++++++++++++++++++++++++++++ > include/linux/jbd2.h | 1 + > 4 files changed, 34 insertions(+), 4 deletions(-) > > diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c > index 3278e64..e0ba8a4 100644 > --- a/fs/ext4/fsync.c > +++ b/fs/ext4/fsync.c > @@ -166,8 +166,7 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync) > if (journal->j_flags & JBD2_BARRIER && > !jbd2_trans_will_send_data_barrier(journal, commit_tid)) > needs_barrier = true; > - jbd2_log_start_commit(journal, commit_tid); > - ret = jbd2_log_wait_commit(journal, commit_tid); > + ret = jbd2_complete_transaction(journal, commit_tid); > if (needs_barrier) { > err = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL); > if (!ret) > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index b6fab7c..de4b58d 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -211,8 +211,7 @@ void ext4_evict_inode(struct inode *inode) > journal_t *journal = EXT4_SB(inode->i_sb)->s_journal; > tid_t commit_tid = EXT4_I(inode)->i_datasync_tid; > > - jbd2_log_start_commit(journal, commit_tid); > - jbd2_log_wait_commit(journal, commit_tid); > + jbd2_complete_transaction(journal, commit_tid); > filemap_write_and_wait(&inode->i_data); > } > truncate_inode_pages(&inode->i_data, 0); > diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c > index ed10991..886ec2f 100644 > --- a/fs/jbd2/journal.c > +++ b/fs/jbd2/journal.c > @@ -710,6 +710,37 @@ int jbd2_log_wait_commit(journal_t *journal, tid_t tid) > } > > /* > + * When this function returns the transaction corresponding to tid > + * will be completed. If the transaction has currently running, start > + * committing that transaction before waiting for it to complete. If > + * the transaction id is stale, it is by definition already completed, > + * so just return SUCCESS. > + */ > +int jbd2_complete_transaction(journal_t *journal, tid_t tid) > +{ > + int need_to_wait = 1; > + > + read_lock(&journal->j_state_lock); > + if (journal->j_running_transaction && > + journal->j_running_transaction->t_tid == tid) { > + if (journal->j_commit_request != tid) { > + /* transaction not yet started, so request it */ > + read_unlock(&journal->j_state_lock); > + jbd2_log_start_commit(journal, tid); > + goto wait_commit; > + } > + } else if (!(journal->j_committing_transaction && > + journal->j_committing_transaction->t_tid == tid)) > + need_to_wait = 0; > + read_unlock(&journal->j_state_lock); > + if (!need_to_wait) > + return 0; > +wait_commit: > + return jbd2_log_wait_commit(journal, tid); > +} > +EXPORT_SYMBOL(jbd2_complete_transaction); > + > +/* > * Log buffer allocation routines: > */ > > diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h > index 50e5a5e..f028975 100644 > --- a/include/linux/jbd2.h > +++ b/include/linux/jbd2.h > @@ -1200,6 +1200,7 @@ int __jbd2_log_start_commit(journal_t *journal, tid_t tid); > int jbd2_journal_start_commit(journal_t *journal, tid_t *tid); > int jbd2_journal_force_commit_nested(journal_t *journal); > int jbd2_log_wait_commit(journal_t *journal, tid_t tid); > +int jbd2_complete_transaction(journal_t *journal, tid_t tid); > int jbd2_log_do_checkpoint(journal_t *journal); > int jbd2_trans_will_send_data_barrier(journal_t *journal, tid_t tid); > > -- > 1.7.12.rc0.22.gcdd159b > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] ext4/jbd2: don't wait (forever) for stale tid caused by wraparound 2013-03-21 20:46 ` Jan Kara @ 2013-03-21 21:09 ` Theodore Ts'o 2013-03-21 22:41 ` Jan Kara 0 siblings, 1 reply; 9+ messages in thread From: Theodore Ts'o @ 2013-03-21 21:09 UTC (permalink / raw) To: Jan Kara Cc: Ben Hutchings, George Barnett, linux-ext4, Debian kernel maintainers On Thu, Mar 21, 2013 at 09:46:38PM +0100, Jan Kara wrote: > Good catch! But shouldn't we rather fix jbd2_log_wait_commit() instead of > inventing new function? In most of the places where we call jbd2_log_start_commit(), we're actually starting the current running transaction. So the fact that we pass in a tid, and we're having to validate that the tid is actually a valid one, is a bit of a waste. So in the long run I think it's worth rethinking whether or not jbd2_log_{start,wait}_commit() should exist in their current form, or whether we should reorganize their functionality (i.e., by having a jbd2_start_running_commit(), for example.). Piling on fixes to jbd2_log_wait_commit() would make it get even more complicated, and I think if we separate out the various ways in which we use these functions, we can make the code simpler and easier to read. In fact, I had started making this rather large set of changes when I decided it would be better to save that kind of wholesale refactoring for the next merge window. So the reason why I ended up fixing the patch the way I did was to keep things simple. Also as I mentioned in the commit description, by using a single function I was also able to optimize the locking the locking somewhat. - Ted ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] ext4/jbd2: don't wait (forever) for stale tid caused by wraparound 2013-03-21 21:09 ` Theodore Ts'o @ 2013-03-21 22:41 ` Jan Kara 0 siblings, 0 replies; 9+ messages in thread From: Jan Kara @ 2013-03-21 22:41 UTC (permalink / raw) To: Theodore Ts'o Cc: Jan Kara, Ben Hutchings, George Barnett, linux-ext4, Debian kernel maintainers On Thu 21-03-13 17:09:40, Ted Tso wrote: > On Thu, Mar 21, 2013 at 09:46:38PM +0100, Jan Kara wrote: > > Good catch! But shouldn't we rather fix jbd2_log_wait_commit() instead of > > inventing new function? > > In most of the places where we call jbd2_log_start_commit(), we're > actually starting the current running transaction. So the fact that > we pass in a tid, and we're having to validate that the tid is > actually a valid one, is a bit of a waste. So in the long run I think > it's worth rethinking whether or not jbd2_log_{start,wait}_commit() > should exist in their current form, or whether we should reorganize > their functionality (i.e., by having a jbd2_start_running_commit(), > for example.). Piling on fixes to jbd2_log_wait_commit() would make > it get even more complicated, and I think if we separate out the > various ways in which we use these functions, we can make the code > simpler and easier to read. I don't find jbd2_log_wait_commit() that complex that it couldn't bear another if :) But given there are really two waiting operations that make sense: a) request commit of running transaction and wait for it b) wait for committing transaction then I agree there may be a better interface. OTOH I'm somewhat curious about the new interface because the only race-free way of identifying a transaction you want to wait for is using its tid. > In fact, I had started making this rather large set of changes when I > decided it would be better to save that kind of wholesale refactoring > for the next merge window. So the reason why I ended up fixing the > patch the way I did was to keep things simple. > > Also as I mentioned in the commit description, by using a single > function I was also able to optimize the locking the locking somewhat. Yeah. I'm not as much opposed to the new function doing start commit & wait but what I dislike is the fact that we have still exposed the function jbd2_log_wait_commit() which can possibly lockup if tid overflows. I agree there aren't currently any other callers where this could happen but in a few years who knows... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2013-03-21 22:41 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <B2EC601CDDA242189A46599B31EA6AD3@atlassian.com> 2013-03-16 5:34 ` jbd2 tid wrap seen on NFS server Ben Hutchings 2013-03-18 2:54 ` [PATCH] ext4/jbd2: don't wait (forever) for stale tid caused by wraparound Theodore Ts'o 2013-03-18 4:24 ` George Barnett 2013-03-18 4:53 ` Ben Hutchings 2013-03-18 14:31 ` Theodore Ts'o 2013-03-18 14:34 ` Theodore Ts'o 2013-03-21 20:46 ` Jan Kara 2013-03-21 21:09 ` Theodore Ts'o 2013-03-21 22:41 ` Jan Kara
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).