All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ben Greear <greearb@candelatech.com>
To: Stephen Hemminger <shemminger@vyatta.com>
Cc: linux-kernel@vger.kernel.org
Subject: Re: 2.6.34.1+hacks:  kernel deadlock related to file-IO?
Date: Wed, 28 Jul 2010 11:14:57 -0700	[thread overview]
Message-ID: <4C5073A1.8090006@candelatech.com> (raw)
In-Reply-To: <20100727231427.1f3eb97b@nehalam>

On 07/27/2010 11:14 PM, Stephen Hemminger wrote:
> On Tue, 27 Jul 2010 17:14:53 -0700
> Ben Greear<greearb@candelatech.com>  wrote:
>
>> On 07/27/2010 04:51 PM, Stephen Hemminger wrote:
>>> On Tue, 27 Jul 2010 16:11:47 -0700
>>> Ben Greear<greearb@candelatech.com>   wrote:
>>>
>>>> NOTE:  This is from a hacked 2.6.34.1 kernel, with
>>>> a module of my own devising loaded and in active use.
>>>
>>> Ben please don't ask LKML for help with problems when
>>> you have a binary only module loaded...
>>
>> Yeah, I know..but I spent all day hacking .34 enough to
>> run lockdep and get those backtraces..and they *might* be
>> valid.  There's a small chance someone else has seen similar
>> things but not reported it for whatever reason, so maybe
>> they will chime in.
>>
>> And finally, I will attempt to reproduce with some official
>> kernels, and will report what I find one way or another.
>>
>> The .34 kernel has been a real bitch to get stable  (You
>> know things must be bad when my name's on -stable patches),
>> so I wouldn't be surprised if I'm hitting another real bug.
>
> Understand your pain. The hardest problem is often finding a reproducible case.
> If the problem is not reproducible it is hard to add trace info
> or use bisection.

Well, I'm having no luck reproducing w/out my module, but I've
also seen runs of > 1 hour with my module and no problems, so maybe
just not getting lucky yet.

With it loaded, it's reproducible but not too often, and can run
clean for hours before I can see the hung tasks.  They are always hanging in
the same places.  Also, if I do something like 'sysrq t',
then it seems to pause the system long enough for the blocked
processes to recover, and if I manage to stop my traffic
generator (which takes virtually all load off the system),
then it will also recover.

So, it's not a full deadlock..just a very long blockage.  When
the system is 'locked', some console access works just fine,
including ssh logins.  But, if I rerun one of the exact commands
that is hung (say, ls foo/ > /tmp/foo_listing.txt), then that process
will also hang in D state, with stack-trace showing the hang in
do_get_write_access.

Without my module, the total traffic load is less, and there is
one less process using 100% of a CPU.  And of course it could
be doing any number of buggy things...just that I'm having a real
hard time figuring out what it might possibly be doing to
cause the hangs I see.


Stack traces for 3 of the processes locked > 2 minutes and lockdep lock
printouts below if anyone is interested.


I'm going to keep trying to find a way to reproduce on standard-ish
kernels.


Thanks,
Ben



Jul 28 10:24:45 localhost kernel: SysRq : Show Locks Held
Jul 28 10:24:45 localhost kernel:
Jul 28 10:24:45 localhost kernel: Showing all locks held in the system:
Jul 28 10:24:45 localhost kernel: 2 locks held by flush-253:0/243:
Jul 28 10:24:45 localhost kernel: #0:  (&type->s_umount_key#19){++++..}, at: [<ffffffff810fa51a>] writeback_inodes_wb+0x275/0x553
Jul 28 10:24:45 localhost kernel: #1:  (jbd2_handle){+.+...}, at: [<ffffffff81178a20>] start_this_handle+0x4d7/0x539
Jul 28 10:24:45 localhost kernel: 1 lock held by bash/2723:
Jul 28 10:24:45 localhost kernel: #0:  (&tty->atomic_read_lock){+.+...}, at: [<ffffffff8124d968>] n_tty_read+0x27c/0x785
Jul 28 10:24:45 localhost kernel: 1 lock held by bash/2753:
Jul 28 10:24:45 localhost kernel: #0:  (&tty->atomic_read_lock){+.+...}, at: [<ffffffff8124d968>] n_tty_read+0x27c/0x785
Jul 28 10:24:45 localhost kernel: 1 lock held by mingetty/2990:
Jul 28 10:24:45 localhost kernel: #0:  (&tty->atomic_read_lock){+.+...}, at: [<ffffffff8124d968>] n_tty_read+0x27c/0x785
Jul 28 10:24:45 localhost kernel: 1 lock held by mingetty/2991:
Jul 28 10:24:45 localhost kernel: #0:  (&tty->atomic_read_lock){+.+...}, at: [<ffffffff8124d968>] n_tty_read+0x27c/0x785
Jul 28 10:24:45 localhost kernel: 1 lock held by mingetty/2992:
Jul 28 10:24:45 localhost kernel: #0:  (&tty->atomic_read_lock){+.+...}, at: [<ffffffff8124d968>] n_tty_read+0x27c/0x785
Jul 28 10:24:45 localhost kernel: 1 lock held by mingetty/2994:
Jul 28 10:24:45 localhost kernel: #0:  (&tty->atomic_read_lock){+.+...}, at: [<ffffffff8124d968>] n_tty_read+0x27c/0x785
Jul 28 10:24:45 localhost kernel: 1 lock held by mingetty/2996:
Jul 28 10:24:45 localhost kernel: #0:  (&tty->atomic_read_lock){+.+...}, at: [<ffffffff8124d968>] n_tty_read+0x27c/0x785
Jul 28 10:24:45 localhost kernel: 1 lock held by mingetty/2997:
Jul 28 10:24:45 localhost kernel: #0:  (&tty->atomic_read_lock){+.+...}, at: [<ffffffff8124d968>] n_tty_read+0x27c/0x785
Jul 28 10:24:45 localhost kernel: 1 lock held by bash/3451:
Jul 28 10:24:45 localhost kernel: #0:  (&tty->atomic_read_lock){+.+...}, at: [<ffffffff8124d968>] n_tty_read+0x27c/0x785
Jul 28 10:24:45 localhost kernel: 1 lock held by bash/14171:
Jul 28 10:24:45 localhost kernel: #0:  (&tty->atomic_read_lock){+.+...}, at: [<ffffffff8124d968>] n_tty_read+0x27c/0x785
Jul 28 10:24:45 localhost kernel: 3 locks held by ip/21154:
Jul 28 10:24:45 localhost kernel: #0:  (jbd2_handle){+.+...}, at: [<ffffffff81178a20>] start_this_handle+0x4d7/0x539
Jul 28 10:24:45 localhost kernel: #1:  (&ei->i_data_sem){++++..}, at: [<ffffffff81151b3d>] ext4_get_blocks+0xe4/0x1bb
Jul 28 10:24:45 localhost kernel: #2:  (&lg->lg_mutex){+.+...}, at: [<ffffffff81167ab8>] ext4_mb_initialize_context+0x177/0x186
Jul 28 10:24:45 localhost kernel: 4 locks held by sh/21158:
Jul 28 10:24:45 localhost kernel: #0:  (&sb->s_type->i_mutex_key#10){+.+.+.}, at: [<ffffffff810ddd14>] do_truncate+0x5b/0x84
Jul 28 10:24:45 localhost kernel: #1:  (&sb->s_type->i_alloc_sem_key#5){+.+...}, at: [<ffffffff810f2771>] notify_change+0x16c/0x2a3
Jul 28 10:24:45 localhost kernel: #2:  (jbd2_handle){+.+...}, at: [<ffffffff81178a20>] start_this_handle+0x4d7/0x539
Jul 28 10:24:45 localhost kernel: #3:  (&ei->i_data_sem){++++..}, at: [<ffffffff81164aa8>] ext4_ext_truncate+0x93/0x78f
Jul 28 10:24:45 localhost kernel: 4 locks held by sh/21160:
Jul 28 10:24:45 localhost kernel: #0:  (&sb->s_type->i_mutex_key#10){+.+.+.}, at: [<ffffffff810ddd14>] do_truncate+0x5b/0x84
Jul 28 10:24:45 localhost kernel: #1:  (&sb->s_type->i_alloc_sem_key#5){+.+...}, at: [<ffffffff810f2771>] notify_change+0x16c/0x2a3
Jul 28 10:24:45 localhost kernel: #2:  (jbd2_handle){+.+...}, at: [<ffffffff81178a20>] start_this_handle+0x4d7/0x539
Jul 28 10:24:45 localhost kernel: #3:  (&ei->i_data_sem){++++..}, at: [<ffffffff81164aa8>] ext4_ext_truncate+0x93/0x78f
Jul 28 10:24:45 localhost kernel:
Jul 28 10:24:45 localhost kernel: =============================================
Jul 28 10:24:45 localhost kernel:


[root@simech2 ~]# cat /proc/21160/stack
[<ffffffff81177dea>] do_get_write_access+0x22d/0x422
[<ffffffff811780de>] jbd2_journal_get_write_access+0x26/0x3c
[<ffffffff811654bd>] __ext4_journal_get_write_access+0x50/0x5a
[<ffffffff81169981>] ext4_free_blocks+0x39c/0x712
[<ffffffff81164d8a>] ext4_ext_truncate+0x375/0x78f
[<ffffffff8115360f>] ext4_truncate+0x95/0x54e
[<ffffffff810a959b>] vmtruncate+0x47/0x52
[<ffffffff810f250b>] inode_setattr+0x2b/0x125
[<ffffffff81153d42>] ext4_setattr+0x27a/0x2fa
[<ffffffff810f278e>] notify_change+0x189/0x2a3
[<ffffffff810ddd20>] do_truncate+0x67/0x84
[<ffffffff810e8e40>] do_last+0x5b4/0x5b9
[<ffffffff810e9282>] do_filp_open+0x1f3/0x601
[<ffffffff810dd091>] do_sys_open+0x5e/0x10a
[<ffffffff810dd166>] sys_open+0x1b/0x1d
[<ffffffff81009a82>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff


[root@simech2 ~]# cat /proc/21154/stack
[<ffffffff81177dea>] do_get_write_access+0x22d/0x422
[<ffffffff811780de>] jbd2_journal_get_write_access+0x26/0x3c
[<ffffffff811654bd>] __ext4_journal_get_write_access+0x50/0x5a
[<ffffffff811676f5>] ext4_mb_mark_diskspace_used+0x65/0x2b1
[<ffffffff8116b94a>] ext4_mb_new_blocks+0x19e/0x308
[<ffffffff811646de>] ext4_ext_get_blocks+0x133f/0x1676
[<ffffffff81151b76>] ext4_get_blocks+0x11d/0x1bb
[<ffffffff81152050>] mpage_da_map_blocks+0xc2/0x334
[<ffffffff811525c0>] ext4_da_writepages+0x2fe/0x499
[<ffffffff810a6bed>] do_writepages+0x1f/0x28
[<ffffffff810a08d8>] __filemap_fdatawrite_range+0x4e/0x50
[<ffffffff810a0ddd>] filemap_flush+0x17/0x19
[<ffffffff81150ec5>] ext4_alloc_da_blocks+0x2b/0x2d
[<ffffffff8114b81c>] ext4_release_file+0x28/0xa6
[<ffffffff810dfddf>] __fput+0x120/0x1cd
[<ffffffff810dfea1>] fput+0x15/0x17
[<ffffffff810dcf1f>] filp_close+0x63/0x6d
[<ffffffff81043b1f>] put_files_struct+0xb1/0x15f
[<ffffffff81043c03>] exit_files+0x36/0x3b
[<ffffffff810454e5>] do_exit+0x21c/0x6f4
[<ffffffff81045a30>] do_group_exit+0x73/0x9c
[<ffffffff81045a6b>] sys_exit_group+0x12/0x16
[<ffffffff81009a82>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff



[root@simech2 ~]# cat /proc/243/stack
[<ffffffff8109fcfd>] sync_page+0x44/0x48
[<ffffffff8109fc6f>] __lock_page+0x61/0x68
[<ffffffff810a6a17>] write_cache_pages+0x1a9/0x33c
[<ffffffff811525a4>] ext4_da_writepages+0x2e2/0x499
[<ffffffff810a6bed>] do_writepages+0x1f/0x28
[<ffffffff810f9b71>] writeback_single_inode+0xe5/0x2e0
[<ffffffff810fa6c9>] writeback_inodes_wb+0x424/0x553
[<ffffffff810fa92c>] wb_writeback+0x134/0x1bb
[<ffffffff810fabad>] wb_do_writeback+0x14c/0x162
[<ffffffff810fabfd>] bdi_writeback_task+0x3a/0xbb
[<ffffffff810b1dd0>] bdi_start_fn+0x63/0xc7
[<ffffffff81056737>] kthread+0x7a/0x82
[<ffffffff8100a924>] kernel_thread_helper+0x4/0x10
[<ffffffffffffffff>] 0xffffffffffffffff






-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


      reply	other threads:[~2010-07-28 18:15 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-07-27 23:11 2.6.34.1+hacks: kernel deadlock related to file-IO? Ben Greear
2010-07-27 23:51 ` Stephen Hemminger
2010-07-28  0:14   ` Ben Greear
2010-07-28  6:14     ` Stephen Hemminger
2010-07-28 18:14       ` Ben Greear [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4C5073A1.8090006@candelatech.com \
    --to=greearb@candelatech.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=shemminger@vyatta.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.