* [PATCH] xfstests-bld: exclude ext4 defrag tests from unsupported configurations @ 2015-06-10 18:13 Eric Whitney 2015-06-10 19:28 ` Theodore Ts'o 0 siblings, 1 reply; 14+ messages in thread From: Eric Whitney @ 2015-06-10 18:13 UTC (permalink / raw) To: linux-ext4; +Cc: tytso Online defrag is not currently implemented for journaled data and bigalloc ext4 file systems. The xfstests that exercise online defrag do not verify that the test file system supports that feature before executing, resulting in spurious test failures. (In the case of ext4/307, the test reports success but actually fails because the e4compact test component is ignoring failures.) For the time being, simply exclude these tests to eliminate the test failure noise. Signed-off-by: Eric Whitney <enwlinux@gmail.com> --- kvm-xfstests/test-appliance/files/root/conf/bigalloc.exclude | 6 ++++++ kvm-xfstests/test-appliance/files/root/conf/bigalloc_1k.exclude | 6 ++++++ kvm-xfstests/test-appliance/files/root/conf/data_journal.exclude | 6 ++++++ 3 files changed, 18 insertions(+) create mode 100644 kvm-xfstests/test-appliance/files/root/conf/bigalloc.exclude create mode 100644 kvm-xfstests/test-appliance/files/root/conf/bigalloc_1k.exclude diff --git a/kvm-xfstests/test-appliance/files/root/conf/bigalloc.exclude b/kvm-xfstests/test-appliance/files/root/conf/bigalloc.exclude new file mode 100644 index 0000000..b925016 --- /dev/null +++ b/kvm-xfstests/test-appliance/files/root/conf/bigalloc.exclude @@ -0,0 +1,6 @@ +ext4/301 +ext4/302 +ext4/303 +ext4/304 +ext4/307 +ext4/308 diff --git a/kvm-xfstests/test-appliance/files/root/conf/bigalloc_1k.exclude b/kvm-xfstests/test-appliance/files/root/conf/bigalloc_1k.exclude new file mode 100644 index 0000000..b925016 --- /dev/null +++ b/kvm-xfstests/test-appliance/files/root/conf/bigalloc_1k.exclude @@ -0,0 +1,6 @@ +ext4/301 +ext4/302 +ext4/303 +ext4/304 +ext4/307 +ext4/308 diff --git a/kvm-xfstests/test-appliance/files/root/conf/data_journal.exclude b/kvm-xfstests/test-appliance/files/root/conf/data_journal.exclude index e2bcac3..5d3dcae 100644 --- a/kvm-xfstests/test-appliance/files/root/conf/data_journal.exclude +++ b/kvm-xfstests/test-appliance/files/root/conf/data_journal.exclude @@ -1 +1,7 @@ +ext4/301 +ext4/302 +ext4/303 +ext4/304 +ext4/307 +ext4/308 generic/068 -- 2.1.4 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH] xfstests-bld: exclude ext4 defrag tests from unsupported configurations 2015-06-10 18:13 [PATCH] xfstests-bld: exclude ext4 defrag tests from unsupported configurations Eric Whitney @ 2015-06-10 19:28 ` Theodore Ts'o 2015-06-11 20:00 ` Eric Whitney 0 siblings, 1 reply; 14+ messages in thread From: Theodore Ts'o @ 2015-06-10 19:28 UTC (permalink / raw) To: Eric Whitney; +Cc: linux-ext4 On Wed, Jun 10, 2015 at 02:13:19PM -0400, Eric Whitney wrote: > Online defrag is not currently implemented for journaled data and > bigalloc ext4 file systems. The xfstests that exercise online defrag > do not verify that the test file system supports that feature before > executing, resulting in spurious test failures. (In the case of > ext4/307, the test reports success but actually fails because the > e4compact test component is ignoring failures.) > > For the time being, simply exclude these tests to eliminate the test > failure noise. > > Signed-off-by: Eric Whitney <enwlinux@gmail.com> Thanks, applied. BTW, I had since removed data_journal.exclude because generic/068 should be passing with data journalling enabled. - Ted ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] xfstests-bld: exclude ext4 defrag tests from unsupported configurations 2015-06-10 19:28 ` Theodore Ts'o @ 2015-06-11 20:00 ` Eric Whitney 2015-06-12 0:24 ` Eric Whitney 0 siblings, 1 reply; 14+ messages in thread From: Eric Whitney @ 2015-06-11 20:00 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Eric Whitney, linux-ext4 I haven't tried to run generic/068 in quite a while, but do see it's passing now. I'll close the related bugzillas. Do you know which patch fixed this? Thanks, Eric * Theodore Ts'o <tytso@mit.edu>: > On Wed, Jun 10, 2015 at 02:13:19PM -0400, Eric Whitney wrote: > > Online defrag is not currently implemented for journaled data and > > bigalloc ext4 file systems. The xfstests that exercise online defrag > > do not verify that the test file system supports that feature before > > executing, resulting in spurious test failures. (In the case of > > ext4/307, the test reports success but actually fails because the > > e4compact test component is ignoring failures.) > > > > For the time being, simply exclude these tests to eliminate the test > > failure noise. > > > > Signed-off-by: Eric Whitney <enwlinux@gmail.com> > > Thanks, applied. BTW, I had since removed data_journal.exclude > because generic/068 should be passing with data journalling enabled. > > - Ted ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] xfstests-bld: exclude ext4 defrag tests from unsupported configurations 2015-06-11 20:00 ` Eric Whitney @ 2015-06-12 0:24 ` Eric Whitney 2015-06-15 1:14 ` Theodore Ts'o 0 siblings, 1 reply; 14+ messages in thread From: Eric Whitney @ 2015-06-12 0:24 UTC (permalink / raw) To: Eric Whitney; +Cc: Theodore Ts'o, linux-ext4 Spoke too soon - hit what looks like the old problem on ARM running 4.1-rc7 on the ninth trial run. Various components of the test are left in the 'D' state after a familiar oops - kernel BUG at fs/jbd2/transaction.c:2150 while truncating. I'll see if I can still get the problem to appear on x86_64. Eric * Eric Whitney <enwlinux@gmail.com>: > I haven't tried to run generic/068 in quite a while, but do see it's passing > now. I'll close the related bugzillas. Do you know which patch fixed this? > > Thanks, > Eric > > > * Theodore Ts'o <tytso@mit.edu>: > > On Wed, Jun 10, 2015 at 02:13:19PM -0400, Eric Whitney wrote: > > > Online defrag is not currently implemented for journaled data and > > > bigalloc ext4 file systems. The xfstests that exercise online defrag > > > do not verify that the test file system supports that feature before > > > executing, resulting in spurious test failures. (In the case of > > > ext4/307, the test reports success but actually fails because the > > > e4compact test component is ignoring failures.) > > > > > > For the time being, simply exclude these tests to eliminate the test > > > failure noise. > > > > > > Signed-off-by: Eric Whitney <enwlinux@gmail.com> > > > > Thanks, applied. BTW, I had since removed data_journal.exclude > > because generic/068 should be passing with data journalling enabled. > > > > - Ted ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] xfstests-bld: exclude ext4 defrag tests from unsupported configurations 2015-06-12 0:24 ` Eric Whitney @ 2015-06-15 1:14 ` Theodore Ts'o 2015-06-15 1:23 ` [PATCH] ext4: fix race between truncate and __ext4_journalled_writepage() Theodore Ts'o 0 siblings, 1 reply; 14+ messages in thread From: Theodore Ts'o @ 2015-06-15 1:14 UTC (permalink / raw) To: Eric Whitney; +Cc: linux-ext4 On Thu, Jun 11, 2015 at 08:24:21PM -0400, Eric Whitney wrote: > Spoke too soon - hit what looks like the old problem on ARM running 4.1-rc7 > on the ninth trial run. Various components of the test are left in the 'D' > state after a familiar oops - kernel BUG at fs/jbd2/transaction.c:2150 while > truncating. > > I'll see if I can still get the problem to appear on x86_64. I was able to reproduce this on x86_64, after running a kernel build on the host as an antogonist. I'm guessing that the race window is pretty narrow, so that's why it wasn't triggering for me easily until I had some additional I/O traffic to slow things down further. After some investigation, it looks like this bug has been around since 2008(!). I'll send out a fix. - Ted ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH] ext4: fix race between truncate and __ext4_journalled_writepage() 2015-06-15 1:14 ` Theodore Ts'o @ 2015-06-15 1:23 ` Theodore Ts'o 2015-06-15 12:33 ` Jan Kara 2015-06-15 20:39 ` Eric Whitney 0 siblings, 2 replies; 14+ messages in thread From: Theodore Ts'o @ 2015-06-15 1:23 UTC (permalink / raw) To: Ext4 Developers List; +Cc: enwlinux, jack, Theodore Ts'o, stable The commit cf108bca465d: "ext4: Invert the locking order of page_lock and transaction start" caused __ext4_journalled_writepage() to drop the page lock before the page was written back, as part of changing the locking order to jbd2_journal_start -> page_lock. However, this introduced a potential race if there was a truncate racing with the data=journalled writeback mode. Fix this by grabbing the page lock after starting the journal handle, and then checking to see if page had gotten truncated out from under us. This fixes a number of different crashes or BUG_ON's when running xfstests generic/086 in data=journalled mode, including: jbd2_journal_dirty_metadata: vdc-8: bad jh for block 84434: transaction (ec90434 ransaction ( (null), 0), jh->b_next_transaction ( (null), 0), jlist 0 - and - kernel BUG at /usr/projects/linux/ext4/fs/jbd2/transaction.c:2200! ... Call Trace: [<c02b2ded>] ? __ext4_journalled_invalidatepage+0x117/0x117 [<c02b2de5>] __ext4_journalled_invalidatepage+0x10f/0x117 [<c02b2ded>] ? __ext4_journalled_invalidatepage+0x117/0x117 [<c027d883>] ? lock_buffer+0x36/0x36 [<c02b2dfa>] ext4_journalled_invalidatepage+0xd/0x22 [<c0229139>] do_invalidatepage+0x22/0x26 [<c0229198>] truncate_inode_page+0x5b/0x85 [<c022934b>] truncate_inode_pages_range+0x156/0x38c [<c0229592>] truncate_inode_pages+0x11/0x15 [<c022962d>] truncate_pagecache+0x55/0x71 [<c02b913b>] ext4_setattr+0x4a9/0x560 [<c01ca542>] ? current_kernel_time+0x10/0x44 [<c026c4d8>] notify_change+0x1c7/0x2be [<c0256a00>] do_truncate+0x65/0x85 [<c0226f31>] ? file_ra_state_init+0x12/0x29 - and - WARNING: CPU: 1 PID: 1331 at /usr/projects/linux/ext4/fs/jbd2/transaction.c:1396 irty_metadata+0x14a/0x1ae() ... Call Trace: [<c01b879f>] ? console_unlock+0x3a1/0x3ce [<c082cbb4>] dump_stack+0x48/0x60 [<c0178b65>] warn_slowpath_common+0x89/0xa0 [<c02ef2cf>] ? jbd2_journal_dirty_metadata+0x14a/0x1ae [<c0178bef>] warn_slowpath_null+0x14/0x18 [<c02ef2cf>] jbd2_journal_dirty_metadata+0x14a/0x1ae [<c02d8615>] __ext4_handle_dirty_metadata+0xd4/0x19d [<c02b2f44>] write_end_fn+0x40/0x53 [<c02b4a16>] ext4_walk_page_buffers+0x4e/0x6a [<c02b59e7>] ext4_writepage+0x354/0x3b8 [<c02b2f04>] ? mpage_release_unused_pages+0xd4/0xd4 [<c02b1b21>] ? wait_on_buffer+0x2c/0x2c [<c02b5a4b>] ? ext4_writepage+0x3b8/0x3b8 [<c02b5a5b>] __writepage+0x10/0x2e [<c0225956>] write_cache_pages+0x22d/0x32c [<c02b5a4b>] ? ext4_writepage+0x3b8/0x3b8 [<c02b6ee8>] ext4_writepages+0x102/0x607 [<c019adfe>] ? sched_clock_local+0x10/0x10e [<c01a8a7c>] ? __lock_is_held+0x2e/0x44 [<c01a8ad5>] ? lock_is_held+0x43/0x51 [<c0226dff>] do_writepages+0x1c/0x29 [<c0276bed>] __writeback_single_inode+0xc3/0x545 [<c0277c07>] writeback_sb_inodes+0x21f/0x36d ... Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org --- fs/ext4/inode.c | 23 +++++++++++++++++++---- 1 file changed, 19 insertions(+), 4 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 0554b0b..263a46c 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -1701,19 +1701,32 @@ static int __ext4_journalled_writepage(struct page *page, ext4_walk_page_buffers(handle, page_bufs, 0, len, NULL, bget_one); } - /* As soon as we unlock the page, it can go away, but we have - * references to buffers so we are safe */ + /* + * We need to release the page lock before we start the + * journal, so grab a reference so the page won't disappear + * out from under us. + */ + get_page(page); unlock_page(page); handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, ext4_writepage_trans_blocks(inode)); if (IS_ERR(handle)) { ret = PTR_ERR(handle); - goto out; + put_page(page); + goto out_no_pagelock; } - BUG_ON(!ext4_handle_valid(handle)); + lock_page(page); + put_page(page); + if (page->mapping != mapping) { + /* The page got truncated from under us */ + ext4_journal_stop(handle); + ret = 0; + goto out; + } + if (inline_data) { BUFFER_TRACE(inode_bh, "get write access"); ret = ext4_journal_get_write_access(handle, inode_bh); @@ -1739,6 +1752,8 @@ static int __ext4_journalled_writepage(struct page *page, NULL, bput_one); ext4_set_inode_state(inode, EXT4_STATE_JDATA); out: + unlock_page(page); +out_no_pagelock: brelse(inode_bh); return ret; } -- 2.3.0 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH] ext4: fix race between truncate and __ext4_journalled_writepage() 2015-06-15 1:23 ` [PATCH] ext4: fix race between truncate and __ext4_journalled_writepage() Theodore Ts'o @ 2015-06-15 12:33 ` Jan Kara 2015-06-15 13:06 ` Theodore Ts'o 2015-06-15 20:39 ` Eric Whitney 1 sibling, 1 reply; 14+ messages in thread From: Jan Kara @ 2015-06-15 12:33 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Ext4 Developers List, enwlinux, jack, stable On Sun 14-06-15 21:23:50, Ted Tso wrote: > The commit cf108bca465d: "ext4: Invert the locking order of page_lock > and transaction start" caused __ext4_journalled_writepage() to drop > the page lock before the page was written back, as part of changing > the locking order to jbd2_journal_start -> page_lock. However, this > introduced a potential race if there was a truncate racing with the > data=journalled writeback mode. > > Fix this by grabbing the page lock after starting the journal handle, > and then checking to see if page had gotten truncated out from under > us. > > This fixes a number of different crashes or BUG_ON's when running > xfstests generic/086 in data=journalled mode, including: > > jbd2_journal_dirty_metadata: vdc-8: bad jh for block 84434: transaction (ec90434 > ransaction ( (null), 0), jh->b_next_transaction ( (null), 0), jlist 0 > > - and - > > kernel BUG at /usr/projects/linux/ext4/fs/jbd2/transaction.c:2200! Yeah, that's nasty. Thanks for debugging this! However I think your fix reintroduces the original deadlock issues. do_journal_get_write_access() can end up blocking waiting for jbd2 thread to finish a commit while jbd2 thread may be blocked waiting for the page to be unlocked. After some thought I don't think the deadlock is real since do_journal_get_write_access() will currently only block if a buffer is under writeout to the journal and at that point we don't wait for page locks anymore. Also ext4_write_begin() does the same in data=journal mode and we haven't observed deadlocks so far. But still things look really fragile here. A clean fix for these problems would be to implement ext4_journalled_writepages() which will start a transaction and then writeback a bunch of pages. Similarly for write_begin() case we could start the transaction in ext4_write() (and loop there since a single write may need to be split among several transactions). However this is relatively extensive work given how rarely the code is used... So for now, feel free to add: Acked-by: Jan Kara <jack@suse.cz> to the patch. Honza > --- > fs/ext4/inode.c | 23 +++++++++++++++++++---- > 1 file changed, 19 insertions(+), 4 deletions(-) > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 0554b0b..263a46c 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -1701,19 +1701,32 @@ static int __ext4_journalled_writepage(struct page *page, > ext4_walk_page_buffers(handle, page_bufs, 0, len, > NULL, bget_one); > } > - /* As soon as we unlock the page, it can go away, but we have > - * references to buffers so we are safe */ > + /* > + * We need to release the page lock before we start the > + * journal, so grab a reference so the page won't disappear > + * out from under us. > + */ > + get_page(page); > unlock_page(page); > > handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, > ext4_writepage_trans_blocks(inode)); > if (IS_ERR(handle)) { > ret = PTR_ERR(handle); > - goto out; > + put_page(page); > + goto out_no_pagelock; > } > - > BUG_ON(!ext4_handle_valid(handle)); > > + lock_page(page); > + put_page(page); > + if (page->mapping != mapping) { > + /* The page got truncated from under us */ > + ext4_journal_stop(handle); > + ret = 0; > + goto out; > + } > + > if (inline_data) { > BUFFER_TRACE(inode_bh, "get write access"); > ret = ext4_journal_get_write_access(handle, inode_bh); > @@ -1739,6 +1752,8 @@ static int __ext4_journalled_writepage(struct page *page, > NULL, bput_one); > ext4_set_inode_state(inode, EXT4_STATE_JDATA); > out: > + unlock_page(page); > +out_no_pagelock: > brelse(inode_bh); > return ret; > } > -- > 2.3.0 > -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] ext4: fix race between truncate and __ext4_journalled_writepage() 2015-06-15 12:33 ` Jan Kara @ 2015-06-15 13:06 ` Theodore Ts'o 2015-06-15 17:03 ` Jan Kara 2015-06-15 17:59 ` Andreas Dilger 0 siblings, 2 replies; 14+ messages in thread From: Theodore Ts'o @ 2015-06-15 13:06 UTC (permalink / raw) To: Jan Kara; +Cc: Ext4 Developers List, enwlinux, stable On Mon, Jun 15, 2015 at 02:33:52PM +0200, Jan Kara wrote: > Yeah, that's nasty. Thanks for debugging this! However I think your fix > reintroduces the original deadlock issues. do_journal_get_write_access() > can end up blocking waiting for jbd2 thread to finish a commit while jbd2 > thread may be blocked waiting for the page to be unlocked. > > After some thought I don't think the deadlock is real since > do_journal_get_write_access() will currently only block if a buffer is > under writeout to the journal and at that point we don't wait for page > locks anymore. Also ext4_write_begin() does the same in data=journal mode > and we haven't observed deadlocks so far. But still things look really > fragile here. The reason why there are no deadlocks is the writeback in the commit thread happens when the inode gets written back --- but that only happens for data=ordered inodes, not data=journalled mode. I was a little worried about what might happen when after the 'j' chattr attribute gets set on an inode, and the inode was still on the ordered flush list. Hmm... I think we could also maybe fix this by having ext4_change_inode_journal_flag() force a journal commit before setting the JOURNAL_DATA flag. If we did that, we could just avoid dropping the page_lock in __ext4_journalled_writepage() altogether. What do you think? - Ted ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] ext4: fix race between truncate and __ext4_journalled_writepage() 2015-06-15 13:06 ` Theodore Ts'o @ 2015-06-15 17:03 ` Jan Kara 2015-06-15 19:37 ` Theodore Ts'o 2015-06-15 17:59 ` Andreas Dilger 1 sibling, 1 reply; 14+ messages in thread From: Jan Kara @ 2015-06-15 17:03 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Jan Kara, Ext4 Developers List, enwlinux, stable On Mon 15-06-15 09:06:11, Ted Tso wrote: > On Mon, Jun 15, 2015 at 02:33:52PM +0200, Jan Kara wrote: > > Yeah, that's nasty. Thanks for debugging this! However I think your fix > > reintroduces the original deadlock issues. do_journal_get_write_access() > > can end up blocking waiting for jbd2 thread to finish a commit while jbd2 > > thread may be blocked waiting for the page to be unlocked. > > > > After some thought I don't think the deadlock is real since > > do_journal_get_write_access() will currently only block if a buffer is > > under writeout to the journal and at that point we don't wait for page > > locks anymore. Also ext4_write_begin() does the same in data=journal mode > > and we haven't observed deadlocks so far. But still things look really > > fragile here. > > The reason why there are no deadlocks is the writeback in the commit > thread happens when the inode gets written back --- but that only > happens for data=ordered inodes, not data=journalled mode. I was a > little worried about what might happen when after the 'j' chattr > attribute gets set on an inode, and the inode was still on the ordered > flush list. > > Hmm... I think we could also maybe fix this by having > ext4_change_inode_journal_flag() force a journal commit before setting > the JOURNAL_DATA flag. If we did that, we could just avoid dropping > the page_lock in __ext4_journalled_writepage() altogether. > > What do you think? I think that fully switching lock ordering for data=journal mode back to page lock -> transaction start (which is what you effectively do when you never drop page lock in ->writepage) is rather error prone. We'd have to be careful to avoid lock inversion also for places like ->write_begin, ->releasepage, ->invalidatepage etc. For example ext4_write_begin() will currently call lock_page() with transaction started which could deadlock against journalled writepage you suggest. So effectively we'd have to completely separate aops for data=journal mode. Doable but I'm not sure it's worth it. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] ext4: fix race between truncate and __ext4_journalled_writepage() 2015-06-15 17:03 ` Jan Kara @ 2015-06-15 19:37 ` Theodore Ts'o 0 siblings, 0 replies; 14+ messages in thread From: Theodore Ts'o @ 2015-06-15 19:37 UTC (permalink / raw) To: Jan Kara; +Cc: Ext4 Developers List, enwlinux, stable On Mon, Jun 15, 2015 at 07:03:25PM +0200, Jan Kara wrote: > I think that fully switching lock ordering for data=journal mode back to > page lock -> transaction start (which is what you effectively do when you > never drop page lock in ->writepage) is rather error prone. We'd have to be > careful to avoid lock inversion also for places like ->write_begin, > ->releasepage, ->invalidatepage etc. For example ext4_write_begin() will > currently call lock_page() with transaction started which could deadlock > against journalled writepage you suggest. So effectively we'd have to > completely separate aops for data=journal mode. Doable but I'm not sure > it's worth it. Good point. OTOH, things are kind of fragile with respect to ext4_write_begin() in data=journal mode, as you've pointed out. But it's not causing any problems as far as we know, so so I'll use the patch which you ack'ed and maybe later on, we can figure out a better way to clean this up. Thanks, - Ted ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] ext4: fix race between truncate and __ext4_journalled_writepage() 2015-06-15 13:06 ` Theodore Ts'o 2015-06-15 17:03 ` Jan Kara @ 2015-06-15 17:59 ` Andreas Dilger 2015-06-16 12:57 ` Jan Kara 1 sibling, 1 reply; 14+ messages in thread From: Andreas Dilger @ 2015-06-15 17:59 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Jan Kara, Ext4 Developers List, enwlinux, stable On Jun 15, 2015, at 7:06 AM, Theodore Ts'o <tytso@mit.edu> wrote: > > On Mon, Jun 15, 2015 at 02:33:52PM +0200, Jan Kara wrote: >> Yeah, that's nasty. Thanks for debugging this! However I think your fix >> reintroduces the original deadlock issues. do_journal_get_write_access() >> can end up blocking waiting for jbd2 thread to finish a commit while jbd2 >> thread may be blocked waiting for the page to be unlocked. >> >> After some thought I don't think the deadlock is real since >> do_journal_get_write_access() will currently only block if a buffer is >> under writeout to the journal and at that point we don't wait for page >> locks anymore. Also ext4_write_begin() does the same in data=journal mode >> and we haven't observed deadlocks so far. But still things look really >> fragile here. > > The reason why there are no deadlocks is the writeback in the commit > thread happens when the inode gets written back --- but that only > happens for data=ordered inodes, not data=journalled mode. I was a > little worried about what might happen when after the 'j' chattr > attribute gets set on an inode, and the inode was still on the ordered > flush list. > > Hmm... I think we could also maybe fix this by having > ext4_change_inode_journal_flag() force a journal commit before setting > the JOURNAL_DATA flag. If we did that, we could just avoid dropping > the page_lock in __ext4_journalled_writepage() altogether. That is already being done: int ext4_change_inode_journal_flag(struct inode *inode, int val) { : : jbd2_journal_lock_updates(journal); : : /** * void jbd2_journal_lock_updates () - establish a transaction barrier. * @journal: Journal to establish a barrier on. * * This locks out any further updates from being started, and blocks * until all existing updates have completed, returning only once the * journal is in a quiescent state with no updates running. * * The journal lock should not be held on entry. */ Cheers, Andreas ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] ext4: fix race between truncate and __ext4_journalled_writepage() 2015-06-15 17:59 ` Andreas Dilger @ 2015-06-16 12:57 ` Jan Kara 0 siblings, 0 replies; 14+ messages in thread From: Jan Kara @ 2015-06-16 12:57 UTC (permalink / raw) To: Andreas Dilger Cc: Theodore Ts'o, Jan Kara, Ext4 Developers List, enwlinux, stable On Mon 15-06-15 11:59:59, Andreas Dilger wrote: > On Jun 15, 2015, at 7:06 AM, Theodore Ts'o <tytso@mit.edu> wrote: > > > > On Mon, Jun 15, 2015 at 02:33:52PM +0200, Jan Kara wrote: > >> Yeah, that's nasty. Thanks for debugging this! However I think your fix > >> reintroduces the original deadlock issues. do_journal_get_write_access() > >> can end up blocking waiting for jbd2 thread to finish a commit while jbd2 > >> thread may be blocked waiting for the page to be unlocked. > >> > >> After some thought I don't think the deadlock is real since > >> do_journal_get_write_access() will currently only block if a buffer is > >> under writeout to the journal and at that point we don't wait for page > >> locks anymore. Also ext4_write_begin() does the same in data=journal mode > >> and we haven't observed deadlocks so far. But still things look really > >> fragile here. > > > > The reason why there are no deadlocks is the writeback in the commit > > thread happens when the inode gets written back --- but that only > > happens for data=ordered inodes, not data=journalled mode. I was a > > little worried about what might happen when after the 'j' chattr > > attribute gets set on an inode, and the inode was still on the ordered > > flush list. > > > > Hmm... I think we could also maybe fix this by having > > ext4_change_inode_journal_flag() force a journal commit before setting > > the JOURNAL_DATA flag. If we did that, we could just avoid dropping > > the page_lock in __ext4_journalled_writepage() altogether. > > That is already being done: > > int ext4_change_inode_journal_flag(struct inode *inode, int val) > { > : > : > > jbd2_journal_lock_updates(journal); > > : > : > > > /** > * void jbd2_journal_lock_updates () - establish a transaction barrier. > * @journal: Journal to establish a barrier on. > * > * This locks out any further updates from being started, and blocks > * until all existing updates have completed, returning only once the > * journal is in a quiescent state with no updates running. > * > * The journal lock should not be held on entry. > */ Well, jbd2_journal_lock_updates() makes sure there are no handles for the running transaction but it doesn't ensure current transaction is committed (which is what Ted wanted because he wanted to avoid inode being both ordered and journaled in the same transaction). Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] ext4: fix race between truncate and __ext4_journalled_writepage() 2015-06-15 1:23 ` [PATCH] ext4: fix race between truncate and __ext4_journalled_writepage() Theodore Ts'o 2015-06-15 12:33 ` Jan Kara @ 2015-06-15 20:39 ` Eric Whitney 2015-06-16 19:29 ` Eric Whitney 1 sibling, 1 reply; 14+ messages in thread From: Eric Whitney @ 2015-06-15 20:39 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Ext4 Developers List, enwlinux, jack, stable * Theodore Ts'o <tytso@mit.edu>: > The commit cf108bca465d: "ext4: Invert the locking order of page_lock > and transaction start" caused __ext4_journalled_writepage() to drop > the page lock before the page was written back, as part of changing > the locking order to jbd2_journal_start -> page_lock. However, this > introduced a potential race if there was a truncate racing with the > data=journalled writeback mode. > > Fix this by grabbing the page lock after starting the journal handle, > and then checking to see if page had gotten truncated out from under > us. > > This fixes a number of different crashes or BUG_ON's when running > xfstests generic/086 in data=journalled mode, including: > > jbd2_journal_dirty_metadata: vdc-8: bad jh for block 84434: transaction (ec90434 > ransaction ( (null), 0), jh->b_next_transaction ( (null), 0), jlist 0 > > - and - > > kernel BUG at /usr/projects/linux/ext4/fs/jbd2/transaction.c:2200! > ... > Call Trace: > [<c02b2ded>] ? __ext4_journalled_invalidatepage+0x117/0x117 > [<c02b2de5>] __ext4_journalled_invalidatepage+0x10f/0x117 > [<c02b2ded>] ? __ext4_journalled_invalidatepage+0x117/0x117 > [<c027d883>] ? lock_buffer+0x36/0x36 > [<c02b2dfa>] ext4_journalled_invalidatepage+0xd/0x22 > [<c0229139>] do_invalidatepage+0x22/0x26 > [<c0229198>] truncate_inode_page+0x5b/0x85 > [<c022934b>] truncate_inode_pages_range+0x156/0x38c > [<c0229592>] truncate_inode_pages+0x11/0x15 > [<c022962d>] truncate_pagecache+0x55/0x71 > [<c02b913b>] ext4_setattr+0x4a9/0x560 > [<c01ca542>] ? current_kernel_time+0x10/0x44 > [<c026c4d8>] notify_change+0x1c7/0x2be > [<c0256a00>] do_truncate+0x65/0x85 > [<c0226f31>] ? file_ra_state_init+0x12/0x29 > > - and - > > WARNING: CPU: 1 PID: 1331 at /usr/projects/linux/ext4/fs/jbd2/transaction.c:1396 > irty_metadata+0x14a/0x1ae() > ... > Call Trace: > [<c01b879f>] ? console_unlock+0x3a1/0x3ce > [<c082cbb4>] dump_stack+0x48/0x60 > [<c0178b65>] warn_slowpath_common+0x89/0xa0 > [<c02ef2cf>] ? jbd2_journal_dirty_metadata+0x14a/0x1ae > [<c0178bef>] warn_slowpath_null+0x14/0x18 > [<c02ef2cf>] jbd2_journal_dirty_metadata+0x14a/0x1ae > [<c02d8615>] __ext4_handle_dirty_metadata+0xd4/0x19d > [<c02b2f44>] write_end_fn+0x40/0x53 > [<c02b4a16>] ext4_walk_page_buffers+0x4e/0x6a > [<c02b59e7>] ext4_writepage+0x354/0x3b8 > [<c02b2f04>] ? mpage_release_unused_pages+0xd4/0xd4 > [<c02b1b21>] ? wait_on_buffer+0x2c/0x2c > [<c02b5a4b>] ? ext4_writepage+0x3b8/0x3b8 > [<c02b5a5b>] __writepage+0x10/0x2e > [<c0225956>] write_cache_pages+0x22d/0x32c > [<c02b5a4b>] ? ext4_writepage+0x3b8/0x3b8 > [<c02b6ee8>] ext4_writepages+0x102/0x607 > [<c019adfe>] ? sched_clock_local+0x10/0x10e > [<c01a8a7c>] ? __lock_is_held+0x2e/0x44 > [<c01a8ad5>] ? lock_is_held+0x43/0x51 > [<c0226dff>] do_writepages+0x1c/0x29 > [<c0276bed>] __writeback_single_inode+0xc3/0x545 > [<c0277c07>] writeback_sb_inodes+0x21f/0x36d > ... > > Signed-off-by: Theodore Ts'o <tytso@mit.edu> > Cc: stable@vger.kernel.org > --- > fs/ext4/inode.c | 23 +++++++++++++++++++---- > 1 file changed, 19 insertions(+), 4 deletions(-) > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 0554b0b..263a46c 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -1701,19 +1701,32 @@ static int __ext4_journalled_writepage(struct page *page, > ext4_walk_page_buffers(handle, page_bufs, 0, len, > NULL, bget_one); > } > - /* As soon as we unlock the page, it can go away, but we have > - * references to buffers so we are safe */ > + /* > + * We need to release the page lock before we start the > + * journal, so grab a reference so the page won't disappear > + * out from under us. > + */ > + get_page(page); > unlock_page(page); > > handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, > ext4_writepage_trans_blocks(inode)); > if (IS_ERR(handle)) { > ret = PTR_ERR(handle); > - goto out; > + put_page(page); > + goto out_no_pagelock; > } > - > BUG_ON(!ext4_handle_valid(handle)); > > + lock_page(page); > + put_page(page); > + if (page->mapping != mapping) { > + /* The page got truncated from under us */ > + ext4_journal_stop(handle); > + ret = 0; > + goto out; > + } > + > if (inline_data) { > BUFFER_TRACE(inode_bh, "get write access"); > ret = ext4_journal_get_write_access(handle, inode_bh); > @@ -1739,6 +1752,8 @@ static int __ext4_journalled_writepage(struct page *page, > NULL, bput_one); > ext4_set_inode_state(inode, EXT4_STATE_JDATA); > out: > + unlock_page(page); > +out_no_pagelock: > brelse(inode_bh); > return ret; > } > -- > 2.3.0 > This patch looks promising. I'm running a 1000 trial stress test on a Pandaboard where I've generally been able to force a couple of manifestations of this bug to appear within 5 to 10 runs. Applied to 4.1-rc7, it's passed 135 trials cleanly. The full series will complete sometime tomorrow. I was also able to reproduce the problem on x86_64 pretty consistently in four runs or less on 4.1-rc7; I'm planning a stress test there as well once -rc8 regression is complete. Thanks! Eric ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] ext4: fix race between truncate and __ext4_journalled_writepage() 2015-06-15 20:39 ` Eric Whitney @ 2015-06-16 19:29 ` Eric Whitney 0 siblings, 0 replies; 14+ messages in thread From: Eric Whitney @ 2015-06-16 19:29 UTC (permalink / raw) To: Eric Whitney; +Cc: Theodore Ts'o, Ext4 Developers List, jack, stable * Eric Whitney <enwlinux@gmail.com>: > * Theodore Ts'o <tytso@mit.edu>: > > The commit cf108bca465d: "ext4: Invert the locking order of page_lock > > and transaction start" caused __ext4_journalled_writepage() to drop > > the page lock before the page was written back, as part of changing > > the locking order to jbd2_journal_start -> page_lock. However, this > > introduced a potential race if there was a truncate racing with the > > data=journalled writeback mode. > > > > Fix this by grabbing the page lock after starting the journal handle, > > and then checking to see if page had gotten truncated out from under > > us. > > > > This fixes a number of different crashes or BUG_ON's when running > > xfstests generic/086 in data=journalled mode, including: > > > > jbd2_journal_dirty_metadata: vdc-8: bad jh for block 84434: transaction (ec90434 > > ransaction ( (null), 0), jh->b_next_transaction ( (null), 0), jlist 0 > > > > - and - > > > > kernel BUG at /usr/projects/linux/ext4/fs/jbd2/transaction.c:2200! > > ... > > Call Trace: > > [<c02b2ded>] ? __ext4_journalled_invalidatepage+0x117/0x117 > > [<c02b2de5>] __ext4_journalled_invalidatepage+0x10f/0x117 > > [<c02b2ded>] ? __ext4_journalled_invalidatepage+0x117/0x117 > > [<c027d883>] ? lock_buffer+0x36/0x36 > > [<c02b2dfa>] ext4_journalled_invalidatepage+0xd/0x22 > > [<c0229139>] do_invalidatepage+0x22/0x26 > > [<c0229198>] truncate_inode_page+0x5b/0x85 > > [<c022934b>] truncate_inode_pages_range+0x156/0x38c > > [<c0229592>] truncate_inode_pages+0x11/0x15 > > [<c022962d>] truncate_pagecache+0x55/0x71 > > [<c02b913b>] ext4_setattr+0x4a9/0x560 > > [<c01ca542>] ? current_kernel_time+0x10/0x44 > > [<c026c4d8>] notify_change+0x1c7/0x2be > > [<c0256a00>] do_truncate+0x65/0x85 > > [<c0226f31>] ? file_ra_state_init+0x12/0x29 > > > > - and - > > > > WARNING: CPU: 1 PID: 1331 at /usr/projects/linux/ext4/fs/jbd2/transaction.c:1396 > > irty_metadata+0x14a/0x1ae() > > ... > > Call Trace: > > [<c01b879f>] ? console_unlock+0x3a1/0x3ce > > [<c082cbb4>] dump_stack+0x48/0x60 > > [<c0178b65>] warn_slowpath_common+0x89/0xa0 > > [<c02ef2cf>] ? jbd2_journal_dirty_metadata+0x14a/0x1ae > > [<c0178bef>] warn_slowpath_null+0x14/0x18 > > [<c02ef2cf>] jbd2_journal_dirty_metadata+0x14a/0x1ae > > [<c02d8615>] __ext4_handle_dirty_metadata+0xd4/0x19d > > [<c02b2f44>] write_end_fn+0x40/0x53 > > [<c02b4a16>] ext4_walk_page_buffers+0x4e/0x6a > > [<c02b59e7>] ext4_writepage+0x354/0x3b8 > > [<c02b2f04>] ? mpage_release_unused_pages+0xd4/0xd4 > > [<c02b1b21>] ? wait_on_buffer+0x2c/0x2c > > [<c02b5a4b>] ? ext4_writepage+0x3b8/0x3b8 > > [<c02b5a5b>] __writepage+0x10/0x2e > > [<c0225956>] write_cache_pages+0x22d/0x32c > > [<c02b5a4b>] ? ext4_writepage+0x3b8/0x3b8 > > [<c02b6ee8>] ext4_writepages+0x102/0x607 > > [<c019adfe>] ? sched_clock_local+0x10/0x10e > > [<c01a8a7c>] ? __lock_is_held+0x2e/0x44 > > [<c01a8ad5>] ? lock_is_held+0x43/0x51 > > [<c0226dff>] do_writepages+0x1c/0x29 > > [<c0276bed>] __writeback_single_inode+0xc3/0x545 > > [<c0277c07>] writeback_sb_inodes+0x21f/0x36d > > ... > > > > Signed-off-by: Theodore Ts'o <tytso@mit.edu> > > Cc: stable@vger.kernel.org > > --- > > fs/ext4/inode.c | 23 +++++++++++++++++++---- > > 1 file changed, 19 insertions(+), 4 deletions(-) > > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > > index 0554b0b..263a46c 100644 > > --- a/fs/ext4/inode.c > > +++ b/fs/ext4/inode.c > > @@ -1701,19 +1701,32 @@ static int __ext4_journalled_writepage(struct page *page, > > ext4_walk_page_buffers(handle, page_bufs, 0, len, > > NULL, bget_one); > > } > > - /* As soon as we unlock the page, it can go away, but we have > > - * references to buffers so we are safe */ > > + /* > > + * We need to release the page lock before we start the > > + * journal, so grab a reference so the page won't disappear > > + * out from under us. > > + */ > > + get_page(page); > > unlock_page(page); > > > > handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, > > ext4_writepage_trans_blocks(inode)); > > if (IS_ERR(handle)) { > > ret = PTR_ERR(handle); > > - goto out; > > + put_page(page); > > + goto out_no_pagelock; > > } > > - > > BUG_ON(!ext4_handle_valid(handle)); > > > > + lock_page(page); > > + put_page(page); > > + if (page->mapping != mapping) { > > + /* The page got truncated from under us */ > > + ext4_journal_stop(handle); > > + ret = 0; > > + goto out; > > + } > > + > > if (inline_data) { > > BUFFER_TRACE(inode_bh, "get write access"); > > ret = ext4_journal_get_write_access(handle, inode_bh); > > @@ -1739,6 +1752,8 @@ static int __ext4_journalled_writepage(struct page *page, > > NULL, bput_one); > > ext4_set_inode_state(inode, EXT4_STATE_JDATA); > > out: > > + unlock_page(page); > > +out_no_pagelock: > > brelse(inode_bh); > > return ret; > > } > > -- > > 2.3.0 > > > > This patch looks promising. I'm running a 1000 trial stress test on a > Pandaboard where I've generally been able to force a couple of manifestations > of this bug to appear within 5 to 10 runs. Applied to 4.1-rc7, it's passed > 135 trials cleanly. The full series will complete sometime tomorrow. > > I was also able to reproduce the problem on x86_64 pretty consistently in > four runs or less on 4.1-rc7; I'm planning a stress test there as well once > -rc8 regression is complete. > 1000 consecutive runs of generic/068 on the data_journal test case have completed successfully on an x86_64 guest and a Pandaboard running a 4.1-rc7 kernel with this patch. That's a couple of orders of magnitude more runs than have previously completed without hangs on these test system configurations, so Tested-by: Eric Whitney <enwlinux@gmail.com> if useful at this point. Eric ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2015-06-16 19:29 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-06-10 18:13 [PATCH] xfstests-bld: exclude ext4 defrag tests from unsupported configurations Eric Whitney 2015-06-10 19:28 ` Theodore Ts'o 2015-06-11 20:00 ` Eric Whitney 2015-06-12 0:24 ` Eric Whitney 2015-06-15 1:14 ` Theodore Ts'o 2015-06-15 1:23 ` [PATCH] ext4: fix race between truncate and __ext4_journalled_writepage() Theodore Ts'o 2015-06-15 12:33 ` Jan Kara 2015-06-15 13:06 ` Theodore Ts'o 2015-06-15 17:03 ` Jan Kara 2015-06-15 19:37 ` Theodore Ts'o 2015-06-15 17:59 ` Andreas Dilger 2015-06-16 12:57 ` Jan Kara 2015-06-15 20:39 ` Eric Whitney 2015-06-16 19:29 ` Eric Whitney
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).