[PATCH] ext4: fix potential deadlock in ext4_nonda

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] ext4: fix potential deadlock in ext4_nonda_switch()
@ 2012-09-20  3:23 Theodore Ts'o
  2012-09-21 22:12 ` Eric Sandeen
  0 siblings, 1 reply; 7+ messages in thread
From: Theodore Ts'o @ 2012-09-20  3:23 UTC (permalink / raw)
  To: Ext4 Developers List; +Cc: Theodore Ts'o, stable

In ext4_nonda_switch(), if the file system is getting full we used to
call writeback_inodes_sb_if_idle().  The problem is that we can be
holding i_mutex already, and this causes a potential deadlock when
writeback_inodes_sb_if_idle() when it tries to take s_umount.  (See
lockdep output below).

As it turns out we don't need need to hold s_umount; the fact that we
are in the middle of the write(2) system call will keep the superblock
pinned.  So we can just call writeback_inodes_sb() directly without
taking s_umount first.

   [ INFO: possible circular locking dependency detected ]
   3.6.0-rc1-00042-gce894ca #367 Not tainted
   -------------------------------------------------------
   dd/8298 is trying to acquire lock:
    (&type->s_umount_key#18){++++..}, at: [<c02277d4>] writeback_inodes_sb_if_idle+0x28/0x46

   but task is already holding lock:
    (&sb->s_type->i_mutex_key#8){+.+...}, at: [<c01ddcce>] generic_file_aio_write+0x5f/0xd3

   which lock already depends on the new lock.

   2 locks held by dd/8298:
    #0:  (sb_writers#2){.+.+.+}, at: [<c01ddcc5>] generic_file_aio_write+0x56/0xd3
    #1:  (&sb->s_type->i_mutex_key#8){+.+...}, at: [<c01ddcce>] generic_file_aio_write+0x5f/0xd3

   stack backtrace:
   Pid: 8298, comm: dd Not tainted 3.6.0-rc1-00042-gce894ca #367
   Call Trace:
    [<c015b79c>] ? console_unlock+0x345/0x372
    [<c06d62a1>] print_circular_bug+0x190/0x19d
    [<c019906c>] __lock_acquire+0x86d/0xb6c
    [<c01999db>] ? mark_held_locks+0x5c/0x7b
    [<c0199724>] lock_acquire+0x66/0xb9
    [<c02277d4>] ? writeback_inodes_sb_if_idle+0x28/0x46
    [<c06db935>] down_read+0x28/0x58
    [<c02277d4>] ? writeback_inodes_sb_if_idle+0x28/0x46
    [<c02277d4>] writeback_inodes_sb_if_idle+0x28/0x46
    [<c026f3b2>] ext4_nonda_switch+0xe1/0xf4
    [<c0271ece>] ext4_da_write_begin+0x27/0x193
    [<c01dcdb0>] generic_file_buffered_write+0xc8/0x1bb
    [<c01ddc47>] __generic_file_aio_write+0x1dd/0x205
    [<c01ddce7>] generic_file_aio_write+0x78/0xd3
    [<c026d336>] ext4_file_write+0x480/0x4a6
    [<c0198c1d>] ? __lock_acquire+0x41e/0xb6c
    [<c0180944>] ? sched_clock_cpu+0x11a/0x13e
    [<c01967e9>] ? trace_hardirqs_off+0xb/0xd
    [<c018099f>] ? local_clock+0x37/0x4e
    [<c0209f2c>] do_sync_write+0x67/0x9d
    [<c0209ec5>] ? wait_on_retry_sync_kiocb+0x44/0x44
    [<c020a7b9>] vfs_write+0x7b/0xe6
    [<c020a9a6>] sys_write+0x3b/0x64
    [<c06dd4bd>] syscall_call+0x7/0xb

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
---
 fs/ext4/inode.c   | 4 ++--
 fs/fs-writeback.c | 1 +
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0a9a89e..d4c60f2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2449,8 +2449,8 @@ static int ext4_nonda_switch(struct super_block *sb)
 	 * Even if we don't switch but are nearing capacity,
 	 * start pushing delalloc when 1/2 of free blocks are dirty.
 	 */
-	if (free_blocks < 2 * dirty_blocks)
-		writeback_inodes_sb_if_idle(sb, WB_REASON_FS_FREE_SPACE);
+	if ((free_blocks < 2 * dirty_blocks) && writeback_in_progress(sb->s_bdi))
+		writeback_inodes_sb(sb, WB_REASON_FS_FREE_SPACE);
 
 	return 0;
 }
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index be3efc4..5602d73 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -63,6 +63,7 @@ int writeback_in_progress(struct backing_dev_info *bdi)
 {
 	return test_bit(BDI_writeback_running, &bdi->state);
 }
+EXPORT_SYMBOL(writeback_in_progress);
 
 static inline struct backing_dev_info *inode_to_bdi(struct inode *inode)
 {
-- 
1.7.12.rc0.22.gcdd159b


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] ext4: fix potential deadlock in ext4_nonda_switch()
  2012-09-20  3:23 [PATCH] ext4: fix potential deadlock in ext4_nonda_switch() Theodore Ts'o
@ 2012-09-21 22:12 ` Eric Sandeen
  2012-09-21 23:59   ` Theodore Ts'o
  0 siblings, 1 reply; 7+ messages in thread
From: Eric Sandeen @ 2012-09-21 22:12 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Ext4 Developers List, stable

On 9/19/12 10:23 PM, Theodore Ts'o wrote:
> In ext4_nonda_switch(), if the file system is getting full we used to
> call writeback_inodes_sb_if_idle().  The problem is that we can be
> holding i_mutex already, and this causes a potential deadlock when
> writeback_inodes_sb_if_idle() when it tries to take s_umount.  (See
> lockdep output below).
> 
> As it turns out we don't need need to hold s_umount; the fact that we
> are in the middle of the write(2) system call will keep the superblock
> pinned.  So we can just call writeback_inodes_sb() directly without
> taking s_umount first.
> 
>    [ INFO: possible circular locking dependency detected ]
>    3.6.0-rc1-00042-gce894ca #367 Not tainted
>    -------------------------------------------------------
>    dd/8298 is trying to acquire lock:
>     (&type->s_umount_key#18){++++..}, at: [<c02277d4>] writeback_inodes_sb_if_idle+0x28/0x46
> 
>    but task is already holding lock:
>     (&sb->s_type->i_mutex_key#8){+.+...}, at: [<c01ddcce>] generic_file_aio_write+0x5f/0xd3
> 
>    which lock already depends on the new lock.
> 
>    2 locks held by dd/8298:
>     #0:  (sb_writers#2){.+.+.+}, at: [<c01ddcc5>] generic_file_aio_write+0x56/0xd3
>     #1:  (&sb->s_type->i_mutex_key#8){+.+...}, at: [<c01ddcce>] generic_file_aio_write+0x5f/0xd3
> 
>    stack backtrace:
>    Pid: 8298, comm: dd Not tainted 3.6.0-rc1-00042-gce894ca #367
>    Call Trace:
>     [<c015b79c>] ? console_unlock+0x345/0x372
>     [<c06d62a1>] print_circular_bug+0x190/0x19d
>     [<c019906c>] __lock_acquire+0x86d/0xb6c
>     [<c01999db>] ? mark_held_locks+0x5c/0x7b
>     [<c0199724>] lock_acquire+0x66/0xb9
>     [<c02277d4>] ? writeback_inodes_sb_if_idle+0x28/0x46
>     [<c06db935>] down_read+0x28/0x58
>     [<c02277d4>] ? writeback_inodes_sb_if_idle+0x28/0x46
>     [<c02277d4>] writeback_inodes_sb_if_idle+0x28/0x46
>     [<c026f3b2>] ext4_nonda_switch+0xe1/0xf4
>     [<c0271ece>] ext4_da_write_begin+0x27/0x193
>     [<c01dcdb0>] generic_file_buffered_write+0xc8/0x1bb
>     [<c01ddc47>] __generic_file_aio_write+0x1dd/0x205
>     [<c01ddce7>] generic_file_aio_write+0x78/0xd3
>     [<c026d336>] ext4_file_write+0x480/0x4a6
>     [<c0198c1d>] ? __lock_acquire+0x41e/0xb6c
>     [<c0180944>] ? sched_clock_cpu+0x11a/0x13e
>     [<c01967e9>] ? trace_hardirqs_off+0xb/0xd
>     [<c018099f>] ? local_clock+0x37/0x4e
>     [<c0209f2c>] do_sync_write+0x67/0x9d
>     [<c0209ec5>] ? wait_on_retry_sync_kiocb+0x44/0x44
>     [<c020a7b9>] vfs_write+0x7b/0xe6
>     [<c020a9a6>] sys_write+0x3b/0x64
>     [<c06dd4bd>] syscall_call+0x7/0xb
> 
> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> Cc: stable@vger.kernel.org
> ---
>  fs/ext4/inode.c   | 4 ++--
>  fs/fs-writeback.c | 1 +
>  2 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 0a9a89e..d4c60f2 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2449,8 +2449,8 @@ static int ext4_nonda_switch(struct super_block *sb)
>  	 * Even if we don't switch but are nearing capacity,
>  	 * start pushing delalloc when 1/2 of free blocks are dirty.
>  	 */
> -	if (free_blocks < 2 * dirty_blocks)
> -		writeback_inodes_sb_if_idle(sb, WB_REASON_FS_FREE_SPACE);
> +	if ((free_blocks < 2 * dirty_blocks) && writeback_in_progress(sb->s_bdi))
> +		writeback_inodes_sb(sb, WB_REASON_FS_FREE_SPACE);

Looks to me like this inverts the logic.

We used to write back if idle, now we fire it off if it's already underway.

Shouldn't it be:

+	if ((free_blocks < 2 * dirty_blocks) && !writeback_in_progress(sb->s_bdi))
+		writeback_inodes_sb(sb, WB_REASON_FS_FREE_SPACE);

-Eric

>  
>  	return 0;
>  }
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index be3efc4..5602d73 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -63,6 +63,7 @@ int writeback_in_progress(struct backing_dev_info *bdi)
>  {
>  	return test_bit(BDI_writeback_running, &bdi->state);
>  }
> +EXPORT_SYMBOL(writeback_in_progress);
>  
>  static inline struct backing_dev_info *inode_to_bdi(struct inode *inode)
>  {
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] ext4: fix potential deadlock in ext4_nonda_switch()
  2012-09-21 22:12 ` Eric Sandeen
@ 2012-09-21 23:59   ` Theodore Ts'o
  2012-09-27 17:18     ` Dmitry Monakhov
  0 siblings, 1 reply; 7+ messages in thread
From: Theodore Ts'o @ 2012-09-21 23:59 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Ext4 Developers List, stable

On Fri, Sep 21, 2012 at 05:12:05PM -0500, Eric Sandeen wrote:
> > -	if (free_blocks < 2 * dirty_blocks)
> > -		writeback_inodes_sb_if_idle(sb, WB_REASON_FS_FREE_SPACE);
> > +	if ((free_blocks < 2 * dirty_blocks) && writeback_in_progress(sb->s_bdi))
> > +		writeback_inodes_sb(sb, WB_REASON_FS_FREE_SPACE);
> 
> Looks to me like this inverts the logic.
> 
> We used to write back if idle, now we fire it off if it's already underway.
> 
> Shouldn't it be:
> 
> +	if ((free_blocks < 2 * dirty_blocks) && !writeback_in_progress(sb->s_bdi))
> +		writeback_inodes_sb(sb, WB_REASON_FS_FREE_SPACE);

Oops, nice catch.  Thanks for the review!!

I've added the missing '!' to the patch.

     	   					- Ted

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] ext4: fix potential deadlock in ext4_nonda_switch()
  2012-09-21 23:59   ` Theodore Ts'o
@ 2012-09-27 17:18     ` Dmitry Monakhov
  2012-09-27 18:09       ` Theodore Ts'o
  0 siblings, 1 reply; 7+ messages in thread
From: Dmitry Monakhov @ 2012-09-27 17:18 UTC (permalink / raw)
  To: Theodore Ts'o, Eric Sandeen; +Cc: Ext4 Developers List, stable

On Fri, 21 Sep 2012 19:59:12 -0400, Theodore Ts'o <tytso@mit.edu> wrote:
> On Fri, Sep 21, 2012 at 05:12:05PM -0500, Eric Sandeen wrote:
> > > -	if (free_blocks < 2 * dirty_blocks)
> > > -		writeback_inodes_sb_if_idle(sb, WB_REASON_FS_FREE_SPACE);
> > > +	if ((free_blocks < 2 * dirty_blocks) && writeback_in_progress(sb->s_bdi))
> > > +		writeback_inodes_sb(sb, WB_REASON_FS_FREE_SPACE);
> > 
> > Looks to me like this inverts the logic.
> > 
> > We used to write back if idle, now we fire it off if it's already underway.
> > 
> > Shouldn't it be:
> > 
> > +	if ((free_blocks < 2 * dirty_blocks) && !writeback_in_progress(sb->s_bdi))
> > +		writeback_inodes_sb(sb, WB_REASON_FS_FREE_SPACE);
> 
> Oops, nice catch.  Thanks for the review!!
> 
> I've added the missing '!' to the patch.
Hmm... even this '!' patch is still not good. I've got that complain
WARNING: at fs/fs-writeback.c:1316 writeback_inodes_sb_nr+0x1a3/0x200()
Hardware name:         
Modules linked in: ext4 jbd2 cpufreq_ondemand acpi_cpufreq freq_table
mperf coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode
sg xhci_hcd button ext3 jbd mbcache sd_mod crc_t10dif aesni_intel
ablk_helper cryptd aes_x86_64 aes_generic ahci libahci pata_acpi
ata_generic dm_mirror dm_region_hash dm_log dm_mod
Pid: 1896, comm: fio Not tainted 3.6.0-rc1+ #77
Call Trace:
 [<ffffffff81069433>] warn_slowpath_common+0xc3/0xf0
 [<ffffffff8106947a>] warn_slowpath_null+0x1a/0x20
 [<ffffffff81253a13>] writeback_inodes_sb_nr+0x1a3/0x200
 [<ffffffff81253f4e>] writeback_inodes_sb+0x2e/0x40
 [<ffffffffa03d73d9>] ext4_nonda_switch+0xe9/0x120 [ext4]
 [<ffffffffa03dd2ea>] ext4_da_write_begin+0x4a/0x330 [ext4]
 [<ffffffff811874ca>] generic_perform_write+0x11a/0x360
 [<ffffffff810dc9f7>] ? current_kernel_time+0x97/0xb0
 [<ffffffff81187771>] generic_file_buffered_write+0x61/0xd0
 [<ffffffff8118b98a>] __generic_file_aio_write+0x6ca/0x820
 [<ffffffff8118bb93>] generic_file_aio_write+0xb3/0x150
 [<ffffffffa03d2785>] ext4_file_write+0x155/0x190 [ext4]
 [<ffffffffa03d2630>] ? ext4_file_dio_write+0x550/0x550 [ext4]
 [<ffffffff81281f6e>] aio_rw_vect_retry+0xce/0x200
 [<ffffffff81281ea0>] ? aio_advance_iovec+0x130/0x130
 [<ffffffff81281ea0>] ? aio_advance_iovec+0x130/0x130
 [<ffffffff812832b6>] aio_run_iocb+0xd6/0x2e0
 [<ffffffff817397c8>] io_submit_one+0x38a/0x3ff
 [<ffffffff81284c8e>] do_io_submit+0x2be/0x3d0
 [<ffffffff81284db0>] sys_io_submit+0x10/0x20
 [<ffffffff8175c8e9>] system_call_fastpath+0x16/0x1b
---[ end trace 586bc928292ed61e ]---

> 
>      	   					- Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] ext4: fix potential deadlock in ext4_nonda_switch()
  2012-09-27 17:18     ` Dmitry Monakhov
@ 2012-09-27 18:09       ` Theodore Ts'o
  2012-09-27 18:16         ` Dmitry Monakhov
  0 siblings, 1 reply; 7+ messages in thread
From: Theodore Ts'o @ 2012-09-27 18:09 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: Eric Sandeen, Ext4 Developers List, stable

On Thu, Sep 27, 2012 at 09:18:30PM +0400, Dmitry Monakhov wrote:
> Hmm... even this '!' patch is still not good. I've got that complain
> WARNING: at fs/fs-writeback.c:1316 writeback_inodes_sb_nr+0x1a3/0x200()

This WARN_ON checking to make sure the s_umount mutex is grabbed.

What kernel are you applying this patch against?  Commit 14da92001:
"fs: Protect write paths by sb_start_write - sb_end_write" adds a call
to sb_start_write() in generic_file_aio_write() which means by the
time we call ext4_nonda_switch(), s_umount is grabbed, which probably
explains why I'm not seeing this --- 14da92001 was added to mainline
as of 3.6-rc1, and I'm guessing you're using an older kernel?

This is something we need to consider, though, since "ext4: fix
potential deadlock in ext4_nonda_switch()" is currently marked for
stable, but commit 14da92001 is not marked for stable.  So if this
commit gets backported w/o 14da92001, stable kernel users will see
this warning.

						- Ted

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] ext4: fix potential deadlock in ext4_nonda_switch()
  2012-09-27 18:09       ` Theodore Ts'o
@ 2012-09-27 18:16         ` Dmitry Monakhov
  2012-09-27 20:14           ` Theodore Ts'o
  0 siblings, 1 reply; 7+ messages in thread
From: Dmitry Monakhov @ 2012-09-27 18:16 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Eric Sandeen, Ext4 Developers List, stable

On Thu, 27 Sep 2012 14:09:49 -0400, Theodore Ts'o <tytso@mit.edu> wrote:
> On Thu, Sep 27, 2012 at 09:18:30PM +0400, Dmitry Monakhov wrote:
> > Hmm... even this '!' patch is still not good. I've got that complain
> > WARNING: at fs/fs-writeback.c:1316 writeback_inodes_sb_nr+0x1a3/0x200()
> 
> This WARN_ON checking to make sure the s_umount mutex is grabbed.
> 
> What kernel are you applying this patch against?  Commit 14da92001:
I use b1725167c38ab current ext4.git/dev head.
> "fs: Protect write paths by sb_start_write - sb_end_write" adds a call
> to sb_start_write() in generic_file_aio_write() which means by the
> time we call ext4_nonda_switch(), s_umount is grabbed, which probably
> explains why I'm not seeing this --- 14da92001 was added to mainline
> as of 3.6-rc1, and I'm guessing you're using an older kernel?
> 

> This is something we need to consider, though, since "ext4: fix
> potential deadlock in ext4_nonda_switch()" is currently marked for
> stable, but commit 14da92001 is not marked for stable.  So if this
> commit gets backported w/o 14da92001, stable kernel users will see
> this warning.
> 
> 						- Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] ext4: fix potential deadlock in ext4_nonda_switch()
  2012-09-27 18:16         ` Dmitry Monakhov
@ 2012-09-27 20:14           ` Theodore Ts'o
  0 siblings, 0 replies; 7+ messages in thread
From: Theodore Ts'o @ 2012-09-27 20:14 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: Eric Sandeen, Ext4 Developers List, stable

Ah, you're right.  I *can* easily see the problem; somehow I just
didn't notice it, but when I look back at my test logs, it's
definitely there.  I thought sb_start_write() took s_umount, but it
doesn't, because it was trying to solve the exact same lock ordering
problem.  Drat....

Unfortunately, there doesn't seem to be a good solution here.  The two
options I see is either (a) create another version of
writeback_inodes_sb() which doesn't have the WARN_ON check in
fs/fs-writeback.c (the warning is a false positive, since there are
other mechanisms which protect the superblock from being unmounted
while the write system call is in progress), or (b) call
writeback_inodes_sb() out of a workqueue.

		      	       - Ted

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-09-27 20:14 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-09-20  3:23 [PATCH] ext4: fix potential deadlock in ext4_nonda_switch() Theodore Ts'o
2012-09-21 22:12 ` Eric Sandeen
2012-09-21 23:59   ` Theodore Ts'o
2012-09-27 17:18     ` Dmitry Monakhov
2012-09-27 18:09       ` Theodore Ts'o
2012-09-27 18:16         ` Dmitry Monakhov
2012-09-27 20:14           ` Theodore Ts'o

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).