Delayed allocation and journal locking order inversion.

public inbox for linux-ext4@vger.kernel.org
 help / color / mirror / Atom feed

* Delayed allocation and journal locking order inversion.
@ 2008-05-28  9:16 Aneesh Kumar K.V
  2008-05-28 10:08 ` Jan Kara
  0 siblings, 1 reply; 4+ messages in thread
From: Aneesh Kumar K.V @ 2008-05-28  9:16 UTC (permalink / raw)
  To: Jan Kara, Mingming Cao; +Cc: ext4 development

Hi Jan and Mingming,

I am observing hangs with the delalloc with locking order inversion
patches. I guess we can't start journal and call write_cache_pages.
The process get stuck as below

fsstress      D 00000008     0  2520      1
       c69c9d70 00000046 c69c9d28 00000008 c6a300a0 c69c50e0 c69c5244 c1210d80 
       00000000 c7a102a0 c69c50e0 c043c960 c69c9da8 c69c9d6c c0246fe8 00000000 
       00000000 00000000 c69c9da8 c1210d80 c69c9da8 c11c0998 c69c9d7c c043a8cb 
Call Trace:
 [<c043c960>] ? _spin_unlock_irqrestore+0x36/0x58
 [<c0246fe8>] ? blk_unplug+0x63/0x6b
 [<c043a8cb>] io_schedule+0x1e/0x28
 [<c014aac1>] sync_page+0x36/0x3a
 [<c043aa17>] __wait_on_bit_lock+0x30/0x59
 [<c014aa8b>] ? sync_page+0x0/0x3a
 [<c014aa77>] __lock_page+0x4e/0x56
 [<c01325a4>] ? wake_bit_function+0x0/0x43
 [<c014ffca>] write_cache_pages+0x120/0x296
 [<c018c516>] ? __mpage_da_writepage+0x0/0x105
 [<c043c89d>] ? _spin_unlock+0x27/0x3c
 [<c018bde8>] mpage_da_writepages+0x5c/0x7e
 [<c01faa8f>] ? jbd2_journal_start+0xce/0xf0
 [<c01faaa4>] ? jbd2_journal_start+0xe3/0xf0
 [<c01d893b>] ? ext4_da_get_block_write+0x0/0x151
 [<c01d8cc6>] ext4_da_writepages+0xbe/0x116
 [<c01d8c08>] ? ext4_da_writepages+0x0/0x116
 [<c015018a>] do_writepages+0x23/0x34
 [<c0180ffa>] __writeback_single_inode+0x12a/0x260
 [<c0181480>] sync_sb_inodes+0x18d/0x25b
 [<c01815d0>] sync_inodes_sb+0x82/0x94
 [<c0181638>] __sync_inodes+0x56/0x9c
 [<c0181692>] sync_inodes+0x14/0x2c
 [<c0183bc1>] do_sync+0x14/0x50
 [<c0183c0a>] sys_sync+0xd/0x13
 [<c0103931>] sysenter_past_esp+0x6a/0xb1

And kjournald is stuck because it is expecting journal_stop to be
called.

kjournald2    D 00007e87     0  2511      2
       c69b1f00 00000046 c69b1ec0 00007e87 c6a04140 c7ae9060 c7ae91c4 c1210d80 
       00000000 c7a10500 c7ae9060 c69b5c24 c69b1ef4 00000246 c01fbf82 00000246 
       c69b5c14 c79fe41c c69b5c14 c69b5c14 c79fe41c c69b5d54 c69b1f9c c01fbf87 
Call Trace:
 [<c01fbf82>] ? jbd2_journal_commit_transaction+0x139/0x1167
 [<c01fbf87>] jbd2_journal_commit_transaction+0x13e/0x1167
 [<c010811a>] ? native_sched_clock+0x92/0xa6
 [<c013a261>] ? lock_release_holdtime+0x50/0x56
 [<c013bff7>] ? trace_hardirqs_on+0xe9/0x111
 [<c0132571>] ? autoremove_wake_function+0x0/0x33
 [<c0129d5f>] ? try_to_del_timer_sync+0x48/0x4f
 [<c0200149>] kjournald2+0xac/0x1e3
 [<c0132571>] ? autoremove_wake_function+0x0/0x33
 [<c020009d>] ? kjournald2+0x0/0x1e3
 [<c01324af>] kthread+0x3b/0x64
 [<c0132474>] ? kthread+0x0/0x64
 [<c0104633>] kernel_thread_helper+0x7/0x10
 =======================


Most of the other process are now stuck on i_mutex. because one of the
journal_start is now stuck at

fsstress      D 00006e2e     0  2517      1
       c6985d64 00000046 c6985d24 00006e2e c697d0a0 c7a20080 c7a201e4 c1210d80 
       00000000 c7a52c40 c7a20080 c69b5c24 c6985d58 00000246 c01fa915 00000246 
       c69b5c14 c79fe3e0 c69b5c14 c69b5c8c c79fe3e0 c6985d84 c6985da4 c01fa91a 
Call Trace:
 [<c01fa915>] ? start_this_handle+0x2b1/0x35d
 [<c01fa91a>] start_this_handle+0x2b6/0x35d
 [<c0132571>] ? autoremove_wake_function+0x0/0x33
 [<c01faa68>] jbd2_journal_start+0xa7/0xf0
 [<c01dde10>] ext4_journal_start_sb+0x40/0x42
 [<c01d9120>] ext4_dirty_inode+0x24/0x8a
 [<c0181788>] __mark_inode_dirty+0x26/0x146
 [<c0179857>] file_update_time+0x8b/0xb4
 [<c014be0c>] __generic_file_aio_write_nolock+0x2fc/0x445
 [<c013be64>] ? mark_held_locks+0x41/0x5c
 [<c013bff7>] ? trace_hardirqs_on+0xe9/0x111
 [<c014bfab>] generic_file_aio_write+0x56/0xa9
 [<c01d3e22>] ext4_file_write+0x9c/0x110
 [<c0168129>] do_sync_write+0xab/0xe9
 [<c014aaea>] ? unlock_page+0x25/0x28
 [<c0132571>] ? autoremove_wake_function+0x0/0x33
 [<c010811a>] ? native_sched_clock+0x92/0xa6
 [<c013a261>] ? lock_release_holdtime+0x50/0x56
 [<c016807e>] ? do_sync_write+0x0/0xe9
 [<c0168805>] vfs_write+0x8c/0x108
 [<c0168d9e>] sys_write+0x3b/0x60
 [<c0103931>] sysenter_past_esp+0x6a/0xb1


The full dmesg log is at 
http://www.radian.org/~kvaneesh/ext4/delalloc-lockinversion/dmesg-1.log

Also starting journal in writepages make unmount throw lockdep errors.

unlink does journal_start and lock_super.
umount does lock_super and later it need to sync_inodes does writepages
which does a journal_start.

I guess we will have to rework the delalloc related changes.

-aneesh

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Delayed allocation and journal locking order inversion.
  2008-05-28  9:16 Delayed allocation and journal locking order inversion Aneesh Kumar K.V
@ 2008-05-28 10:08 ` Jan Kara
  2008-05-29  7:50   ` Aneesh Kumar K.V
  0 siblings, 1 reply; 4+ messages in thread
From: Jan Kara @ 2008-05-28 10:08 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Mingming Cao, ext4 development

[-- Attachment #1: Type: text/plain, Size: 3156 bytes --]

  Hi Aneesh,

  Thanks for testing!

On Wed 28-05-08 14:46:48, Aneesh Kumar K.V wrote:
> I am observing hangs with the delalloc with locking order inversion
> patches. I guess we can't start journal and call write_cache_pages.
  This should be fine after the lock inversion...

> The process get stuck as below
> 
> fsstress      D 00000008     0  2520      1
>        c69c9d70 00000046 c69c9d28 00000008 c6a300a0 c69c50e0 c69c5244 c1210d80 
>        00000000 c7a102a0 c69c50e0 c043c960 c69c9da8 c69c9d6c c0246fe8 00000000 
>        00000000 00000000 c69c9da8 c1210d80 c69c9da8 c11c0998 c69c9d7c c043a8cb 
> Call Trace:
>  [<c043c960>] ? _spin_unlock_irqrestore+0x36/0x58
>  [<c0246fe8>] ? blk_unplug+0x63/0x6b
>  [<c043a8cb>] io_schedule+0x1e/0x28
>  [<c014aac1>] sync_page+0x36/0x3a
>  [<c043aa17>] __wait_on_bit_lock+0x30/0x59
>  [<c014aa8b>] ? sync_page+0x0/0x3a
>  [<c014aa77>] __lock_page+0x4e/0x56
>  [<c01325a4>] ? wake_bit_function+0x0/0x43
>  [<c014ffca>] write_cache_pages+0x120/0x296
>  [<c018c516>] ? __mpage_da_writepage+0x0/0x105
>  [<c043c89d>] ? _spin_unlock+0x27/0x3c
>  [<c018bde8>] mpage_da_writepages+0x5c/0x7e
>  [<c01faa8f>] ? jbd2_journal_start+0xce/0xf0
>  [<c01faaa4>] ? jbd2_journal_start+0xe3/0xf0
>  [<c01d893b>] ? ext4_da_get_block_write+0x0/0x151
>  [<c01d8cc6>] ext4_da_writepages+0xbe/0x116
>  [<c01d8c08>] ? ext4_da_writepages+0x0/0x116
>  [<c015018a>] do_writepages+0x23/0x34
>  [<c0180ffa>] __writeback_single_inode+0x12a/0x260
>  [<c0181480>] sync_sb_inodes+0x18d/0x25b
>  [<c01815d0>] sync_inodes_sb+0x82/0x94
>  [<c0181638>] __sync_inodes+0x56/0x9c
>  [<c0181692>] sync_inodes+0x14/0x2c
>  [<c0183bc1>] do_sync+0x14/0x50
>  [<c0183c0a>] sys_sync+0xd/0x13
>  [<c0103931>] sysenter_past_esp+0x6a/0xb1
  The question here is, who is holding the lock from the page we wait
for here. The two processes you show below don't seem to hold it. I'll
check the full log ... searching ... I see!
  The problem is in generic_write_end()! It calls mark_inode_dirty() under
page lock. That can possibly start a new transaction (which happened in
your case) and that violates lock ordering (mark_inode_dirty() got stuck
waiting for journal commit which is stuck waiting for other user to do
journal_stop which waits for the page lock). Actually, there is no real
need to call mark_inode_dirty() from under page lock - we just need to
update i_size there. Something like the patch attached (untested)?

<snip>
> The full dmesg log is at 
> http://www.radian.org/~kvaneesh/ext4/delalloc-lockinversion/dmesg-1.log
> 
> Also starting journal in writepages make unmount throw lockdep errors.
> 
> unlink does journal_start and lock_super.
> umount does lock_super and later it need to sync_inodes does writepages
> which does a journal_start.
  Well, but isn't there this problem even without the lock inversion patch?
This is inversion between lock_super and journal_start. It hasn't been
changed by the lock inversion patch as far as I can tell. If you send me
lockdep traces I can have a look what we could do...

> I guess we will have to rework the delalloc related changes.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

[-- Attachment #2: vfs-2.6.25-generic_write_end.diff --]
[-- Type: text/x-patch, Size: 1520 bytes --]

commit 0553a5f120aeab4365c541d053482eb39e8c2d1a
Author: Jan Kara <jack@suse.cz>
Date:   Wed May 28 11:13:41 2008 +0200

    vfs: Move mark_inode_dirty() from under page lock in generic_write_end()
    
    There's no need to call mark_inode_dirty() under page lock in
    generic_write_end(). It unnecessarily makes hold time of page lock longer
    and more importantly it forces locking order of page lock and transaction
    start for journaling filesystems.
    
    Signed-off-by: Jan Kara <jack@suse.cz>

diff --git a/fs/buffer.c b/fs/buffer.c
index 177f2ac..2f86ca5 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2062,6 +2062,7 @@ int generic_write_end(struct file *file, struct address_space *mapping,
 			struct page *page, void *fsdata)
 {
 	struct inode *inode = mapping->host;
+	int i_size_changed = 0;
 
 	copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
 
@@ -2074,12 +2075,21 @@ int generic_write_end(struct file *file, struct address_space *mapping,
 	 */
 	if (pos+copied > inode->i_size) {
 		i_size_write(inode, pos+copied);
-		mark_inode_dirty(inode);
+		i_size_changed = 1;
 	}
 
 	unlock_page(page);
 	page_cache_release(page);
 
+	/*
+	 * We don't mark inode dirty under page lock. First, it unnecessarily
+	 * makes the holding time of page lock longer. Second, it forces lock
+	 * ordering of page lock and transaction start for journaling
+	 * filesystems.
+	 */
+	if (i_size_changed)
+		mark_inode_dirty(inode);
+
 	return copied;
 }
 EXPORT_SYMBOL(generic_write_end);

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: Delayed allocation and journal locking order inversion.
  2008-05-28 10:08 ` Jan Kara
@ 2008-05-29  7:50   ` Aneesh Kumar K.V
  2008-05-29 12:27     ` Jan Kara
  0 siblings, 1 reply; 4+ messages in thread
From: Aneesh Kumar K.V @ 2008-05-29  7:50 UTC (permalink / raw)
  To: Jan Kara; +Cc: Mingming Cao, ext4 development

On Wed, May 28, 2008 at 12:08:33PM +0200, Jan Kara wrote:
>   Hi Aneesh,
>   The question here is, who is holding the lock from the page we wait
> for here. The two processes you show below don't seem to hold it. I'll
> check the full log ... searching ... I see!
>   The problem is in generic_write_end()! It calls mark_inode_dirty() under
> page lock. That can possibly start a new transaction (which happened in
> your case) and that violates lock ordering (mark_inode_dirty() got stuck
> waiting for journal commit which is stuck waiting for other user to do
> journal_stop which waits for the page lock). Actually, there is no real
> need to call mark_inode_dirty() from under page lock - we just need to
> update i_size there. Something like the patch attached (untested)?
> 

The patch works.   Peter Zijlstra have patches to add lockdep annotation
to lock_page. I guess we will have to test the lock inversion patch with
the lockdep annotation to catch the deadlock scenarios like above.

http://programming.kicks-ass.net/kernel-patches/lockdep-page_lock/

Regarding delalloc I still have issues. The writepage can get called
with buffer_head marked delay and dirty as show below. This will result
in block allocation under lock_page.

RIP: 0010:[<ffffffff8030cc64>]  [<ffffffff8030cc64>] ext4_da_writepage+0x26/0xad

Call Trace:
 [<ffffffff8026999f>] shrink_page_list+0x31e/0x588
 [<ffffffff80269d35>] shrink_inactive_list+0x12c/0x40d
 [<ffffffff805ce6a2>] ? _spin_unlock_irqrestore+0x3f/0x68
 [<ffffffff8024e83a>] ? trace_hardirqs_on+0xf1/0x115
 [<ffffffff805ce6af>] ? _spin_unlock_irqrestore+0x4c/0x68
 [<ffffffff803975b7>] ? __up_read+0x8c/0x94
 [<ffffffff8026a0f3>] shrink_zone+0xdd/0x103
 [<ffffffff8026ad69>] kswapd+0x34b/0x53e
 [<ffffffff80268e4d>] ? isolate_pages_global+0x0/0x34
 [<ffffffff80243ff4>] ? autoremove_wake_function+0x0/0x36
 [<ffffffff805ce6af>] ? _spin_unlock_irqrestore+0x4c/0x68
 [<ffffffff8026aa1e>] ? kswapd+0x0/0x53e
 [<ffffffff80243d1b>] kthread+0x44/0x6b
 [<ffffffff8020c1f8>] child_rip+0xa/0x12
 [<ffffffff8020b8e3>] ? restore_args+0x0/0x30
 [<ffffffff80243fcf>] ? kthreadd+0x16b/0x190
 [<ffffffff80243cd7>] ? kthread+0x0/0x6b
 [<ffffffff8020c1ee>] ? child_rip+0x0/0x12


-aneesh

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Delayed allocation and journal locking order inversion.
  2008-05-29  7:50   ` Aneesh Kumar K.V
@ 2008-05-29 12:27     ` Jan Kara
  0 siblings, 0 replies; 4+ messages in thread
From: Jan Kara @ 2008-05-29 12:27 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Mingming Cao, ext4 development

On Thu 29-05-08 13:20:56, Aneesh Kumar K.V wrote:
> On Wed, May 28, 2008 at 12:08:33PM +0200, Jan Kara wrote:
> >   Hi Aneesh,
> >   The question here is, who is holding the lock from the page we wait
> > for here. The two processes you show below don't seem to hold it. I'll
> > check the full log ... searching ... I see!
> >   The problem is in generic_write_end()! It calls mark_inode_dirty() under
> > page lock. That can possibly start a new transaction (which happened in
> > your case) and that violates lock ordering (mark_inode_dirty() got stuck
> > waiting for journal commit which is stuck waiting for other user to do
> > journal_stop which waits for the page lock). Actually, there is no real
> > need to call mark_inode_dirty() from under page lock - we just need to
> > update i_size there. Something like the patch attached (untested)?
> > 
> 
> The patch works.   Peter Zijlstra have patches to add lockdep annotation
> to lock_page. I guess we will have to test the lock inversion patch with
> the lockdep annotation to catch the deadlock scenarios like above.
>
> http://programming.kicks-ass.net/kernel-patches/lockdep-page_lock/
  Yes, that would be useful. Thanks for the pointer.

> Regarding delalloc I still have issues. The writepage can get called
> with buffer_head marked delay and dirty as show below. This will result
> in block allocation under lock_page.
> 
> RIP: 0010:[<ffffffff8030cc64>]  [<ffffffff8030cc64>] ext4_da_writepage+0x26/0xad
> 
> Call Trace:
>  [<ffffffff8026999f>] shrink_page_list+0x31e/0x588
>  [<ffffffff80269d35>] shrink_inactive_list+0x12c/0x40d
>  [<ffffffff805ce6a2>] ? _spin_unlock_irqrestore+0x3f/0x68
>  [<ffffffff8024e83a>] ? trace_hardirqs_on+0xf1/0x115
>  [<ffffffff805ce6af>] ? _spin_unlock_irqrestore+0x4c/0x68
>  [<ffffffff803975b7>] ? __up_read+0x8c/0x94
>  [<ffffffff8026a0f3>] shrink_zone+0xdd/0x103
>  [<ffffffff8026ad69>] kswapd+0x34b/0x53e
>  [<ffffffff80268e4d>] ? isolate_pages_global+0x0/0x34
>  [<ffffffff80243ff4>] ? autoremove_wake_function+0x0/0x36
>  [<ffffffff805ce6af>] ? _spin_unlock_irqrestore+0x4c/0x68
>  [<ffffffff8026aa1e>] ? kswapd+0x0/0x53e
>  [<ffffffff80243d1b>] kthread+0x44/0x6b
>  [<ffffffff8020c1f8>] child_rip+0xa/0x12
>  [<ffffffff8020b8e3>] ? restore_args+0x0/0x30
>  [<ffffffff80243fcf>] ? kthreadd+0x16b/0x190
>  [<ffffffff80243cd7>] ? kthread+0x0/0x6b
>  [<ffffffff8020c1ee>] ? child_rip+0x0/0x12
  I see two options here:
1) Just refuse to write the page if we see we have to do block allocation,
   there's no transaction running and for_reclaim is set (or we could even
   refuse the write if getting a new handle would mean blocking but that
   starts to get ugly). The page will be eventually written out via
   writepages call as far as I know.
2) Do similar tricks as in ext4_journaled_writepage() if we see we need to
   start a transaction - i.e., unlock the page, start the transaction, lock
   the page again, check that it's still the page we are interested in,
   proceed...

Option 2) has the disadvantage that when we are looking for a page to evict
(which usually means we are under memory pressure), we do complicated
locking which may be slow. 1) has the disadvantage that there can be
substantial portion of memory we will refuse to write out... I'm not sure
what is better.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2008-05-29 12:27 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-28  9:16 Delayed allocation and journal locking order inversion Aneesh Kumar K.V
2008-05-28 10:08 ` Jan Kara
2008-05-29  7:50   ` Aneesh Kumar K.V
2008-05-29 12:27     ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox