All of lore.kernel.org
 help / color / mirror / Atom feed
* Disk output lockup amd64 nforce4 device-mapper 2.6.12_rc2
@ 2005-04-20 21:51 Mikael Andersson
  2005-04-21  9:06 ` Disk output lockup 2.6.12_rc2 2.6.11.7 Mikael Andersson
  0 siblings, 1 reply; 4+ messages in thread
From: Mikael Andersson @ 2005-04-20 21:51 UTC (permalink / raw)
  To: linux-kernel

During heavy io-load a lockup occurs that appears to prevent any disk
output from taking place. fs is reiserfs on two device-mapper mirrored
200G maxtor disks. After the lockup occurs you can to things like 'ls',
but echo > test.txt will hang.

A typical workload producing the error is doing:
rsync of large (1GB) over 100Mbit ethernet
simultaneous compilation / gunzip

I've disabled preemption, and tried with and without acpi enabled, with
and without smp support (it was smp by default so i switched it off).
Also tried with another nic (rtl8139) since i got an nv_stop_tx:
TransmitterStatus remained busy<6> in the logs. I also tried 2.6.11.7
with the same result.

This produces the lockup after 10-40 minutes. I've got a complete log
from alt-sysreq-T if anyone wants to look at it. The following is a dump
for the bash and kmirrord process when trying echo > test.txt.
The output for kmirrord is included because it's exactly as specified
below both before and after echo > test.txt is invoked, this looked
peculiar to me.

Later runs showed what was to my eyes similar results, some callpaths
weren't exactly the same, but kmirrord was somewhere in mempool_alloc
and the bash process was in do_journal_end, whatever that means :)


bash D ffffc2000038e000 0 6529 6381 (NOTLB)
ffff81003dfc1bd8 0000000000000086
0000000000000000 ffff810001b6ba60
6db6db6db6db6db7 ffffffff801acc46
ffff810001b6ba98 ffff81003d8c27e0
00000000000027cd ffffffff805a61c0

Call Trace:
<ffffffff801acc46>{__d_lookup+358} <ffffffff801ff133>{queue_log_writer+115}
<ffffffff8012e6a0>{default_wake_function+0}
<ffffffff801ff9fc>{do_journal_end+604}
<ffffffff801e0162>{search_by_entry_key+50}
<ffffffff802004ed>{do_journal_begin_r+77}
<ffffffff802006df>{do_journal_begin_r+575}
<ffffffff8020096c>{journal_begin+204}
<ffffffff801e10d7>{reiserfs_create+183}
<ffffffff801a0dcc>{vfs_create+188}
<ffffffff801a11e6>{open_namei+454}
<ffffffff8018a727>{filp_open+39}
<ffffffff8018a82f>{get_unused_fd+223}
<ffffffff8019e39a>{getname+138}
<ffffffff8018abfc>{sys_open+76}
<ffffffff8010dfe6>{system_call+126}


kmirrord/0 D ffff81003f1bccd8 0 978 9 1731 977 (L-TLB)
ffff81003f931ab8 0000000000000046
0000000000000200 ffffffff8016a2d6
ffff810036d18e80 0000000000000001
000000733f931c30 ffff81003f8d4030
000000000000952e ffff810036cf67e0

Call Trace:
<ffffffff8016a2d6>{cache_alloc_refill+1222}
<ffffffff804a2f9f>{io_schedule+15}
<ffffffff80164e23>{mempool_alloc+707} <ffffffff80413250>{bvec_get_page+0}
<ffffffff801543c0>{autoremove_wake_function+0}
<ffffffff801543c0>{autoremove_wake_function+0}
<ffffffff8041cffc>{__rh_alloc+44}
<ffffffff8041d483>{rh_inc_pending+67}
<ffffffff8041e39b>{do_work+2955}
<ffffffff8041e3c8>{do_work+3000}
<ffffffff804a16b4>{thread_return+0}
<ffffffff8041d810>{do_work+0}
<ffffffff8014ce3b>{worker_thread+715}
<ffffffff8012e6a0>{default_wake_function+0}
<ffffffff8012e6a0>{default_wake_function+0}
<ffffffff8014cb70>{worker_thread+0}
<ffffffff801537ed>{kthread+205} <ffffffff80153830>{keventd_create_kthread+0}
<ffffffff8010eb1f>{child_rip+8} <ffffffff80153830>{keventd_create_kthread+0}
<ffffffff80153720>{kthread+0}
<ffffffff8010eb17>{child_rip+0}

/Mikael Andersson

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Disk output lockup 2.6.12_rc2 2.6.11.7
  2005-04-20 21:51 Disk output lockup amd64 nforce4 device-mapper 2.6.12_rc2 Mikael Andersson
@ 2005-04-21  9:06 ` Mikael Andersson
  2005-05-26  6:38   ` Andrew Morton
  0 siblings, 1 reply; 4+ messages in thread
From: Mikael Andersson @ 2005-04-21  9:06 UTC (permalink / raw)
  To: linux-kernel

Mikael Andersson wrote:
> During heavy io-load a lockup occurs that appears to prevent any disk
> output from taking place. fs is reiserfs on two device-mapper mirrored
> 200G maxtor disks. After the lockup occurs you can to things like 'ls',
> but echo > test.txt will hang.

fs is now ext3

> 
> A typical workload producing the error is doing:
> rsync of large (1GB) over 100Mbit ethernet
> simultaneous compilation / gunzip

Or almost anything that writes something to the disk.

> 
> I've disabled preemption, and tried with and without acpi enabled, with
> and without smp support (it was smp by default so i switched it off).
> Also tried with another nic (rtl8139) since i got an nv_stop_tx:
> TransmitterStatus remained busy<6> in the logs. I also tried 2.6.11.7
> with the same result.

 Tried converting to ext3, some problem, albeit the lockups are less
severe. More of the locked processes can be killed and echo > test.txt
works. So _some_ io gets through.
 The output from sysrq-T is somewhat less confusing though, it appears
then hung processes is somehow being hung in __generic_unplug_device,  i
had a look at the assembler, but couldn't make heads or tails of it. the
code at __generic_unplug_device+19 was test %eax,%eax immediately
preceded by a callq to the test instruction. Obviously something magic
(by my eyes) is going on here.

 Also tried 2.6.12_rc3-mm3

 I'd really like to find a solution to this since it kinda borks the
nice an shiny machine if it can't handle large files without getting
into trouble.

 I've been working on this for two days, have been trying to find
similar bug reports, trying a lot of different kernels and kernel
options to no avail.
 I'm a little out of options right now, any ideas for something to try,
patches to test, or some help in understanding what's happening ?


kmirrord/0 D ffff81003f1bccd8 0 978 9 1731 977 (L-TLB)
Call Trace:
<ffffffff8016a2d6>{cache_alloc_refill+1222}
<ffffffff804a2f9f>{io_schedule+15}
--
kjournald D ffff81003e94bcd8 0 1748 1 2060 953 (L-TLB)
Call Trace:
<ffffffff802e9c13>{__generic_unplug_device+19}
<ffffffff802e9cfd>{generic_unplug_device+189}
--
rsync D 000000701553dccf 0 6903 6901 (NOTLB)
Call Trace:
<ffffffff802e9c13>{__generic_unplug_device+19}
<ffffffff802e9cfd>{generic_unplug_device+189}
--
x86_64-pc-lin D 0000006dc7d23e49 0 13785 13742 (NOTLB)
Call Trace:
<ffffffff802e9cfd>{generic_unplug_device+189}
<ffffffff8040e3ad>{dm_unplug_all+29}

/Mikael Andersson

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Disk output lockup 2.6.12_rc2 2.6.11.7
  2005-04-21  9:06 ` Disk output lockup 2.6.12_rc2 2.6.11.7 Mikael Andersson
@ 2005-05-26  6:38   ` Andrew Morton
  2005-05-30 10:30     ` Mikael Andersson
  0 siblings, 1 reply; 4+ messages in thread
From: Andrew Morton @ 2005-05-26  6:38 UTC (permalink / raw)
  To: Mikael Andersson; +Cc: dm-devel


Mikeal, this smells like a devicemapper lockup.  Could you please test
2.6.12-rc5 and provide us with a status update?

Thanks.


Mikael Andersson <mikael@karett.se> wrote:
>
> Mikael Andersson wrote:
> > During heavy io-load a lockup occurs that appears to prevent any disk
> > output from taking place. fs is reiserfs on two device-mapper mirrored
> > 200G maxtor disks. After the lockup occurs you can to things like 'ls',
> > but echo > test.txt will hang.
> 
> fs is now ext3
> 
> > 
> > A typical workload producing the error is doing:
> > rsync of large (1GB) over 100Mbit ethernet
> > simultaneous compilation / gunzip
> 
> Or almost anything that writes something to the disk.
> 
> > 
> > I've disabled preemption, and tried with and without acpi enabled, with
> > and without smp support (it was smp by default so i switched it off).
> > Also tried with another nic (rtl8139) since i got an nv_stop_tx:
> > TransmitterStatus remained busy<6> in the logs. I also tried 2.6.11.7
> > with the same result.
> 
>  Tried converting to ext3, some problem, albeit the lockups are less
> severe. More of the locked processes can be killed and echo > test.txt
> works. So _some_ io gets through.
>  The output from sysrq-T is somewhat less confusing though, it appears
> then hung processes is somehow being hung in __generic_unplug_device,  i
> had a look at the assembler, but couldn't make heads or tails of it. the
> code at __generic_unplug_device+19 was test %eax,%eax immediately
> preceded by a callq to the test instruction. Obviously something magic
> (by my eyes) is going on here.
> 
>  Also tried 2.6.12_rc3-mm3
> 
>  I'd really like to find a solution to this since it kinda borks the
> nice an shiny machine if it can't handle large files without getting
> into trouble.
> 
>  I've been working on this for two days, have been trying to find
> similar bug reports, trying a lot of different kernels and kernel
> options to no avail.
>  I'm a little out of options right now, any ideas for something to try,
> patches to test, or some help in understanding what's happening ?
> 
> 
> kmirrord/0 D ffff81003f1bccd8 0 978 9 1731 977 (L-TLB)
> Call Trace:
> <ffffffff8016a2d6>{cache_alloc_refill+1222}
> <ffffffff804a2f9f>{io_schedule+15}
> --
> kjournald D ffff81003e94bcd8 0 1748 1 2060 953 (L-TLB)
> Call Trace:
> <ffffffff802e9c13>{__generic_unplug_device+19}
> <ffffffff802e9cfd>{generic_unplug_device+189}
> --
> rsync D 000000701553dccf 0 6903 6901 (NOTLB)
> Call Trace:
> <ffffffff802e9c13>{__generic_unplug_device+19}
> <ffffffff802e9cfd>{generic_unplug_device+189}
> --
> x86_64-pc-lin D 0000006dc7d23e49 0 13785 13742 (NOTLB)
> Call Trace:
> <ffffffff802e9cfd>{generic_unplug_device+189}
> <ffffffff8040e3ad>{dm_unplug_all+29}
> 
> /Mikael Andersson
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Disk output lockup 2.6.12_rc2 2.6.11.7
  2005-05-26  6:38   ` Andrew Morton
@ 2005-05-30 10:30     ` Mikael Andersson
  0 siblings, 0 replies; 4+ messages in thread
From: Mikael Andersson @ 2005-05-30 10:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: dm-devel

Andrew Morton wrote:

>Mikeal, this smells like a devicemapper lockup.  Could you please test
>2.6.12-rc5 and provide us with a status update?
>  
>

I haven't got any unused disks to try this on atm, but i might be able
to use dmsetup
to create a dmraid inside my 2G swap partition and craft a test which
works with
the limited space available. I'll send a report to the list as soon as
i've got any results,
but it will probably take some time in any case.

>Thanks.
>  
>
/Mikael

>Mikael Andersson <mikael@karett.se> wrote:
>  
>
>>Mikael Andersson wrote:
>>    
>>
>>>During heavy io-load a lockup occurs that appears to prevent any disk
>>>output from taking place. fs is reiserfs on two device-mapper mirrored
>>>200G maxtor disks. After the lockup occurs you can to things like 'ls',
>>>but echo > test.txt will hang.
>>>      
>>>
>>fs is now ext3
>>
>>    
>>
>>>A typical workload producing the error is doing:
>>>rsync of large (1GB) over 100Mbit ethernet
>>>simultaneous compilation / gunzip
>>>      
>>>
>>Or almost anything that writes something to the disk.
>>
>>    
>>
>>>I've disabled preemption, and tried with and without acpi enabled, with
>>>and without smp support (it was smp by default so i switched it off).
>>>Also tried with another nic (rtl8139) since i got an nv_stop_tx:
>>>TransmitterStatus remained busy<6> in the logs. I also tried 2.6.11.7
>>>with the same result.
>>>      
>>>
>> Tried converting to ext3, some problem, albeit the lockups are less
>>severe. More of the locked processes can be killed and echo > test.txt
>>works. So _some_ io gets through.
>> The output from sysrq-T is somewhat less confusing though, it appears
>>then hung processes is somehow being hung in __generic_unplug_device,  i
>>had a look at the assembler, but couldn't make heads or tails of it. the
>>code at __generic_unplug_device+19 was test %eax,%eax immediately
>>preceded by a callq to the test instruction. Obviously something magic
>>(by my eyes) is going on here.
>>
>> Also tried 2.6.12_rc3-mm3
>>
>> I'd really like to find a solution to this since it kinda borks the
>>nice an shiny machine if it can't handle large files without getting
>>into trouble.
>>
>> I've been working on this for two days, have been trying to find
>>similar bug reports, trying a lot of different kernels and kernel
>>options to no avail.
>> I'm a little out of options right now, any ideas for something to try,
>>patches to test, or some help in understanding what's happening ?
>>
>>
>>kmirrord/0 D ffff81003f1bccd8 0 978 9 1731 977 (L-TLB)
>>Call Trace:
>><ffffffff8016a2d6>{cache_alloc_refill+1222}
>><ffffffff804a2f9f>{io_schedule+15}
>>--
>>kjournald D ffff81003e94bcd8 0 1748 1 2060 953 (L-TLB)
>>Call Trace:
>><ffffffff802e9c13>{__generic_unplug_device+19}
>><ffffffff802e9cfd>{generic_unplug_device+189}
>>--
>>rsync D 000000701553dccf 0 6903 6901 (NOTLB)
>>Call Trace:
>><ffffffff802e9c13>{__generic_unplug_device+19}
>><ffffffff802e9cfd>{generic_unplug_device+189}
>>--
>>x86_64-pc-lin D 0000006dc7d23e49 0 13785 13742 (NOTLB)
>>Call Trace:
>><ffffffff802e9cfd>{generic_unplug_device+189}
>><ffffffff8040e3ad>{dm_unplug_all+29}
>>
>>/Mikael Andersson
>>-
>>    
>>

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2005-05-30 10:30 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-04-20 21:51 Disk output lockup amd64 nforce4 device-mapper 2.6.12_rc2 Mikael Andersson
2005-04-21  9:06 ` Disk output lockup 2.6.12_rc2 2.6.11.7 Mikael Andersson
2005-05-26  6:38   ` Andrew Morton
2005-05-30 10:30     ` Mikael Andersson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.