raid6 resync blocks the entire system

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* raid6 resync blocks the entire system
@ 2007-11-18 20:06 Bernd Schubert
  2007-11-18 20:49 ` pg_mh, Peter Grandi
  0 siblings, 1 reply; 8+ messages in thread
From: Bernd Schubert @ 2007-11-18 20:06 UTC (permalink / raw)
  To: linux-raid

Hi,

on raid-initialization or later on a re-sync our systems become 
unresponsive. Ping still works, ssh won't succeed until the re-sync has 
finished, on a serial or local connection one can still type, as with ssh, 
whatever you request from the system won't be done until the raid-sync is 
done.

This is with 2.6.22, but as far as I remember we also observed this with 
2.6.23. Also, the higher the stripe cache size, the higher the 
probability the system will go into this state.

The system is booted diskles over nfs, so absolutely no i/o to the disks.

[ 3017.702688] SysRq : HELP : loglevel0-8 reBoot tErm Full kIll saK showMem Nice powerOff showPc show-all-timers(Q) unRaw Sync showTasks Unmount shoW-blocked-tasks
[ 3017.742667] SysRq : Show Blocked State
[ 3017.746617]
[ 3017.746618]                                  free                        sibling
[ 3017.755846]   task                 PC        stack   pid father child younger older
[ 3017.763830] md0_resync    D 000002bea0dece63     0  8909      2 (L-TLB)
[ 3017.770737]  ffff810123905ba0 0000000000000046 0000000000000000 0000000000000000
[ 3017.778424]  0000000300000000 ffff81012467bc10 000000010009bbd1 ffff810129e25050
[ 3017.786078]  00000000000001dc ffff81012b59f570 ffff810129e24ea0 0000000000000000
[ 3017.793523] Call Trace:
[ 3017.796270]  [<ffffffff881ed509>] :raid456:get_active_stripe+0x459/0x540
[ 3017.803190]  [<ffffffff881f2f71>] :raid456:sync_request+0x831/0x850
[ 3017.809607]  [<ffffffff8817ba19>] :md_mod:md_do_sync+0x539/0x930
[ 3017.815745]  [<ffffffff88177fc9>] :md_mod:md_thread+0x49/0x140
[ 3017.821705]  [<ffffffff80249adc>] kthread+0x6c/0xa0
[ 3017.826712]  [<ffffffff8020a888>] child_rip+0xa/0x12
[ 3017.831793]
[ 3017.833331] md1_resync    D 000002be9f6f1c7d     0  8917      2 (L-TLB)
[ 3017.840276]  ffff810123cffba0 0000000000000046 0000000000000000 0000000000000000
[ 3017.847955]  0000000300000000 ffff81012946c490 000000010009bbc8 ffff810129dfdaa0
[ 3017.855721]  000000000000073b ffff81012b59e100 ffff810129dfd8f0 0000000000000000
[ 3017.863225] Call Trace:
[ 3017.865915]  [<ffffffff881ed50e>] :raid456:get_active_stripe+0x45e/0x540
[ 3017.872946]  [<ffffffff881f2f71>] :raid456:sync_request+0x831/0x850
[ 3017.879510]  [<ffffffff8817ba19>] :md_mod:md_do_sync+0x539/0x930
[ 3017.885775]  [<ffffffff88177fc9>] :md_mod:md_thread+0x49/0x140
[ 3017.891865]  [<ffffffff80249adc>] kthread+0x6c/0xa0
[ 3017.896957]  [<ffffffff8020a888>] child_rip+0xa/0x12
[ 3017.902135]
[ 3017.903685] md2_resync    D 000002be9e4bded5     0  8925      2 (L-TLB)
[ 3017.910662]  ffff81012279dba0 0000000000000046 0000000000000000 0000000000000000
[ 3017.918227]  0000000000000000 0000000000000000 000000010009bbc2 ffff810129dfd3d0
[ 3017.925785]  000000000000024c ffff81012b510750 ffff810129dfd220 0000000000000000
[ 3017.933137] Call Trace:
[ 3017.935825]  [<ffffffff881ed50e>] :raid456:get_active_stripe+0x45e/0x540
[ 3017.942613]  [<ffffffff881f2f71>] :raid456:sync_request+0x831/0x850
[ 3017.948972]  [<ffffffff8817ba19>] :md_mod:md_do_sync+0x539/0x930
[ 3017.955071]  [<ffffffff88177fc9>] :md_mod:md_thread+0x49/0x140
[ 3017.960960]  [<ffffffff80249adc>] kthread+0x6c/0xa0
[ 3017.965883]  [<ffffffff8020a888>] child_rip+0xa/0x12
[ 3017.970894]
[ 3017.972417] mcelog        D 000002bae6ba88a2     0  9005   9003 (NOTLB)
[ 3017.979169]  ffff810115b09dd8 0000000000000082 0000000000000000 0000000000000000
[ 3017.986753]  ffff81012fd7b9e0 ffffffff80265bc5 000000010009ac27 ffff81012a84a3f0
[ 3017.994312]  0000000000001438 ffff81012b5f8810 ffff81012a84a240 0000000000000000
[ 3018.001671] Call Trace:
[ 3018.004341]  [<ffffffff804ed69e>] wait_for_completion+0x9e/0xf0
[ 3018.010347]  [<ffffffff8024783c>] synchronize_rcu+0x3c/0x50
[ 3018.015985]  [<ffffffff80213fb8>] mce_read+0x118/0x240
[ 3018.021189]  [<ffffffff8028e265>] vfs_read+0xb5/0x170
[ 3018.026287]  [<ffffffff8028e623>] sys_read+0x53/0x90
[ 3018.031325]  [<ffffffff80209a6e>] system_call+0x7e/0x83
[ 3018.036619]  [<00002b32d97b9cd0>]
[ 3018.039963]

Any ideas?

Thanks in advance,
Bernd



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: raid6 resync blocks the entire system
  2007-11-18 20:06 raid6 resync blocks the entire system Bernd Schubert
@ 2007-11-18 20:49 ` pg_mh, Peter Grandi
  2007-11-18 22:18   ` Bernd Schubert
  0 siblings, 1 reply; 8+ messages in thread
From: pg_mh, Peter Grandi @ 2007-11-18 20:49 UTC (permalink / raw)
  To: Linux RAID

>>> On Sun, 18 Nov 2007 21:06:42 +0100, Bernd Schubert
>>> <bernd-schubert@gmx.de> said:

bernd-schubert> Hi, on raid-initialization or later on a re-sync
bernd-schubert> our systems become unresponsive. [ ... ] Also,
bernd-schubert> the higher the stripe cache size, the higher the
bernd-schubert> probability the system will go into this state.

Is either point surprising?

[ ... ]

bernd-schubert> Any ideas?

Well, the default Linux storage subsystem parameters are set
usually for high throughput, not low latency. Anyhow, there is
another interesting point, which is how many disks in that RAID6
array? What is the system bus and host adapter to the disk bus?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: raid6 resync blocks the entire system
  2007-11-18 20:49 ` pg_mh, Peter Grandi
@ 2007-11-18 22:18   ` Bernd Schubert
  2007-11-20  5:55     ` Mark Hahn
  0 siblings, 1 reply; 8+ messages in thread
From: Bernd Schubert @ 2007-11-18 22:18 UTC (permalink / raw)
  To: pg_mh, Peter Grandi; +Cc: Linux RAID

On Sun, Nov 18, 2007 at 08:49:44PM +0000, pg_mh@mh.to.sabi.co.UK wrote:
> >>> On Sun, 18 Nov 2007 21:06:42 +0100, Bernd Schubert
> >>> <bernd-schubert@gmx.de> said:
> 
> bernd-schubert> Hi, on raid-initialization or later on a re-sync
> bernd-schubert> our systems become unresponsive. [ ... ] Also,
> bernd-schubert> the higher the stripe cache size, the higher the
> bernd-schubert> probability the system will go into this state.
> 
> Is either point surprising?

On a system booted over network, so without anything accessing the raid 
devices IHMO yes.

> 
> [ ... ]
> 
> bernd-schubert> Any ideas?
> 
> Well, the default Linux storage subsystem parameters are set
> usually for high throughput, not low latency. Anyhow, there is

I don't speak about low latency, but latency of at least 24h is a bit high,
isn't it? Linux is a multi-tasking system, even with high load it has to 
respond to keyboard input, especially when there are 8 CPUs with 4 of them
in idle state (almost 4 used for sync processes).

> another interesting point, which is how many disks in that RAID6
> array? What is the system bus and host adapter to the disk bus?

6 x Infortrend A16U-G2430 (hardware raid 6) on 3 dual-channel LSI22320 HBAs 
in U320 mode, so eventually we have raid66. Since linux-raid is presently
not multi-threaded the the CPU would limit raid6 write performance to 
200MB/s. Since we do have 8 CPUs, we have made 3 partions of each A16U-G2430,
so in the end we have 3 x software-raid6.

I know this is a high end configuration, but no latency critical component
is at any limit, 4 CPUs are idling, PCI-X busses are 
far away from being saturated. 

Cheers,
Bernd

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: raid6 resync blocks the entire system
  2007-11-18 22:18   ` Bernd Schubert
@ 2007-11-20  5:55     ` Mark Hahn
  2007-11-20 15:33       ` BUG: soft lockup detected on CPU#1! (was Re: raid6 resync blocks the entire system) Bernd Schubert
  0 siblings, 1 reply; 8+ messages in thread
From: Mark Hahn @ 2007-11-20  5:55 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: Linux RAID

> I know this is a high end configuration, but no latency critical component
> is at any limit, 4 CPUs are idling, PCI-X busses are
> far away from being saturated.

yes, but what about memory?  I speculate that this is an Intel-based 
system that is relatively memory-starved.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BUG: soft lockup detected on CPU#1! (was Re: raid6 resync blocks the entire system)
  2007-11-20  5:55     ` Mark Hahn
@ 2007-11-20 15:33       ` Bernd Schubert
  2007-11-20 17:16         ` Mark Hahn
  0 siblings, 1 reply; 8+ messages in thread
From: Bernd Schubert @ 2007-11-20 15:33 UTC (permalink / raw)
  To: Mark Hahn, Neil Brown; +Cc: Linux RAID

On Tuesday 20 November 2007 06:55:52 Mark Hahn wrote:
> > I know this is a high end configuration, but no latency critical
> > component is at any limit, 4 CPUs are idling, PCI-X busses are
> > far away from being saturated.
>
> yes, but what about memory?  I speculate that this is an Intel-based
> system that is relatively memory-starved.

Yes, its an intel system, since still has problems to deliver AMD quadcores. 
Anyway, I don't believe the systems memory bandwidth is only 
6 x 280 MB/s = 1680 MB/s (280 MB/s is the maximum I measured per scsi 
channel). Actually, the measured bandwith of this system is 4 GB/s.

With 2.6.23 and enabled debugging we now nicely get softlockups.


[  187.913000] Call Trace:
[  187.917128]  [<ffffffff8020d3c1>] show_trace+0x41/0x70
[  187.922401]  [<ffffffff8020d400>] dump_stack+0x10/0x20
[  187.927667]  [<ffffffff80269949>] softlockup_tick+0x129/0x180
[  187.933529]  [<ffffffff80240c9d>] update_process_times+0x7d/0xa0
[  187.939676]  [<ffffffff8021c634>] smp_local_timer_interrupt+0x34/0x60
[  187.946275]  [<ffffffff8021c71a>] smp_apic_timer_interrupt+0x4a/0x70
[  187.952731]  [<ffffffff8020c7db>] apic_timer_interrupt+0x6b/0x70
[  187.958848]  [<ffffffff881ca5e3>] :raid456:handle_stripe+0xe23/0xf50
[  187.965360]  [<ffffffff881cb365>] :raid456:raid5d+0x2d5/0x3e0
[  187.971267]  [<ffffffff881a0ff9>] :md_mod:md_thread+0x39/0x130
[  187.977246]  [<ffffffff8024c73c>] kthread+0x6c/0xa0
[  187.982225]  [<ffffffff8020c9b8>] child_rip+0xa/0x12
[  187.988910]
[  207.232587] BUG: soft lockup detected on CPU#1!
[  207.237405]
[  207.237405] Call Trace:
[  207.241672]  [<ffffffff8020d3c1>] show_trace+0x41/0x70
[  207.247049]  [<ffffffff8020d400>] dump_stack+0x10/0x20
[  207.252470]  [<ffffffff80269949>] softlockup_tick+0x129/0x180
[  207.258513]  [<ffffffff80240c9d>] update_process_times+0x7d/0xa0
[  207.264651]  [<ffffffff8021c634>] smp_local_timer_interrupt+0x34/0x60
[  207.271222]  [<ffffffff8021c71a>] smp_apic_timer_interrupt+0x4a/0x70
[  207.277698]  [<ffffffff8020c7db>] apic_timer_interrupt+0x6b/0x70
[  207.283819]  [<ffffffff8023c6c9>] __do_softirq+0x59/0x100
[  207.289332]  [<ffffffff8020cd2c>] call_softirq+0x1c/0x30
[  207.294743]  [<ffffffff8020f165>] do_softirq+0x75/0xe0
[  207.299998]  [<ffffffff8020ee72>] do_IRQ+0xc2/0x1b0
[  207.305010]  [<ffffffff8020c076>] ret_from_intr+0x0/0xf
[  207.310346]  [<ffffffff80515606>] _spin_unlock_irq+0x26/0x30
[  207.316111]  [<ffff8101234d97a8>]
[  207.319509] DWARF2 unwinder stuck at 0xffff8101234d97a8
[  207.324905] Leftover inexact backtrace:
[  207.328843]  [<ffffffff80515565>] _spin_unlock_irqrestore+0x55/0x70
[  207.335236]  [<ffffffff8025517d>] trace_hardirqs_on+0xbd/0x160
[  207.341184]  [<ffffffff881a0ff9>] :md_mod:md_thread+0x39/0x130
[  207.347123]  [<ffffffff8024cdf0>] autoremove_wake_function+0x0/0x30
[  207.353504]  [<ffffffff8024cdf0>] autoremove_wake_function+0x0/0x30
[  207.359898]  [<ffffffff881a0fc0>] :md_mod:md_thread+0x0/0x130
[  207.365761]  [<ffffffff8024c73c>] kthread+0x6c/0xa0
[  207.370765]  [<ffffffff8020c9b8>] child_rip+0xa/0x12
[  207.375855]  [<ffffffff8020c0cc>] restore_args+0x0/0x30
[  207.381172]  [<ffffffff8024c6d0>] kthread+0x0/0xa0
[  207.386089]  [<ffffffff8020c9ae>] child_rip+0x0/0x12
[  207.391169]
[  219.044383] BUG: soft lockup detected on CPU#1!

(gdb) l *(handle_stripe+0xe23)
0x5613 is in handle_stripe (drivers/md/raid5.c:1853).
1848
1849    static void end_reshape(raid5_conf_t *conf);
1850
1851    static int page_is_zero(struct page *p)
1852    {
1853            char *a = page_address(p);
1854            return ((*(u32*)a) == 0 &&
1855                    memcmp(a, a+4, STRIPE_SIZE-4)==0);
1856    }
1857


Any ideas how to debug this?


Thanks,
Bernd

-- 
Bernd Schubert
Q-Leap Networks GmbH

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BUG: soft lockup detected on CPU#1! (was Re: raid6 resync blocks the entire system)
  2007-11-20 15:33       ` BUG: soft lockup detected on CPU#1! (was Re: raid6 resync blocks the entire system) Bernd Schubert
@ 2007-11-20 17:16         ` Mark Hahn
  2007-11-20 18:32           ` Bernd Schubert
  0 siblings, 1 reply; 8+ messages in thread
From: Mark Hahn @ 2007-11-20 17:16 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: Linux RAID

>> yes, but what about memory?  I speculate that this is an Intel-based
>> system that is relatively memory-starved.
>
> Yes, its an intel system, since still has problems to deliver AMD quadcores.
> Anyway, I don't believe the systems memory bandwidth is only
> 6 x 280 MB/s = 1680 MB/s (280 MB/s is the maximum I measured per scsi
> channel). Actually, the measured bandwith of this system is 4 GB/s.

4 GB/s is terrible, especially for 8 cores, but perhaps you know this.

> With 2.6.23 and enabled debugging we now nicely get softlockups.
>
> [  187.913000] Call Trace:
> [  187.917128]  [<ffffffff8020d3c1>] show_trace+0x41/0x70
> [  187.922401]  [<ffffffff8020d400>] dump_stack+0x10/0x20
> [  187.927667]  [<ffffffff80269949>] softlockup_tick+0x129/0x180
> [  187.933529]  [<ffffffff80240c9d>] update_process_times+0x7d/0xa0
> [  187.939676]  [<ffffffff8021c634>] smp_local_timer_interrupt+0x34/0x60
> [  187.946275]  [<ffffffff8021c71a>] smp_apic_timer_interrupt+0x4a/0x70
> [  187.952731]  [<ffffffff8020c7db>] apic_timer_interrupt+0x6b/0x70
> [  187.958848]  [<ffffffff881ca5e3>] :raid456:handle_stripe+0xe23/0xf50

so handle_stripe is taking too long; is this not consistent with the memory theory?

> (gdb) l *(handle_stripe+0xe23)
> 0x5613 is in handle_stripe (drivers/md/raid5.c:1853).
> 1848
> 1849    static void end_reshape(raid5_conf_t *conf);
> 1850
> 1851    static int page_is_zero(struct page *p)
> 1852    {
> 1853            char *a = page_address(p);
> 1854            return ((*(u32*)a) == 0 &&
> 1855                    memcmp(a, a+4, STRIPE_SIZE-4)==0);
> 1856    }

that's a weird piece of code.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BUG: soft lockup detected on CPU#1! (was Re: raid6 resync blocks the entire system)
  2007-11-20 17:16         ` Mark Hahn
@ 2007-11-20 18:32           ` Bernd Schubert
  2007-11-22  5:11             ` Neil Brown
  0 siblings, 1 reply; 8+ messages in thread
From: Bernd Schubert @ 2007-11-20 18:32 UTC (permalink / raw)
  To: Mark Hahn; +Cc: Linux RAID

On Tuesday 20 November 2007 18:16:43 Mark Hahn wrote:
> >> yes, but what about memory?  I speculate that this is an Intel-based
> >> system that is relatively memory-starved.
> >
> > Yes, its an intel system, since still has problems to deliver AMD
> > quadcores. Anyway, I don't believe the systems memory bandwidth is only
> > 6 x 280 MB/s = 1680 MB/s (280 MB/s is the maximum I measured per scsi
> > channel). Actually, the measured bandwith of this system is 4 GB/s.
>
> 4 GB/s is terrible, especially for 8 cores, but perhaps you know this.

Yes, but these are lustre storage nodes and the bottleneck is I/O and not 
memory. We get 420 MB/s writes and 900 MB/s reads per OSS (one storage node). 
For that we need between 3 and 4 CPUs, the lustre threads can take the other 
ones. We also only have problems on raid-resync, during normal I/O operation 
the system is perfectly responsive.

>
> > With 2.6.23 and enabled debugging we now nicely get softlockups.
> >
> > [  187.913000] Call Trace:
> > [  187.917128]  [<ffffffff8020d3c1>] show_trace+0x41/0x70
> > [  187.922401]  [<ffffffff8020d400>] dump_stack+0x10/0x20
> > [  187.927667]  [<ffffffff80269949>] softlockup_tick+0x129/0x180
> > [  187.933529]  [<ffffffff80240c9d>] update_process_times+0x7d/0xa0
> > [  187.939676]  [<ffffffff8021c634>] smp_local_timer_interrupt+0x34/0x60
> > [  187.946275]  [<ffffffff8021c71a>] smp_apic_timer_interrupt+0x4a/0x70
> > [  187.952731]  [<ffffffff8020c7db>] apic_timer_interrupt+0x6b/0x70
> > [  187.958848]  [<ffffffff881ca5e3>] :raid456:handle_stripe+0xe23/0xf50
>
> so handle_stripe is taking too long; is this not consistent with the memory
> theory?

Here an argument against memory theory. Per hardware raid we have 3 partitions 
to introduce some kind of 'manual' cpu-threading. Usually the md-driver 
detects it is doing raid over the same partitions and I guess in order to 
prevent disk thrashing it delays the re-sync of the other two md-devices. The 
resync is then limited due to CPU to about 80MB/s (the md process takes 100% 
cpu-time of one single cpu).
Since we simply do not do any disk thrashing at 80MB/s, I simply disabled this 
detection code and also limited the maximum sync-speed to 40MB/s. Now there 
are *three* resyncs running, each with 40MB/s, so altogether 120 MB/s. At 
this speed the system *sometimes* reacts slow, but at least I can login and 
also still can do something on the system. 
I think you will agree more memory bandwidth is used for 3x40MB/s than for 
1x80MB/s resync.

My personal (wild) guess for this problem is, that there is somewhere a global 
lock, preventing all other CPUs to do something. At 100%s (at 80 MB/s) 
there's probably not left any time frame to wake up the other CPUs or its 
sufficiently small to only allow high priority kernel threads to do 
something.
When I limit the sync to 40MB/s each resync-CPU has to wait sufficiently long 
to allow the other CPUs to wake up.

Cheers,
Bernd

-- 
Bernd Schubert
Q-Leap Networks GmbH

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BUG: soft lockup detected on CPU#1! (was Re: raid6 resync blocks the entire system)
  2007-11-20 18:32           ` Bernd Schubert
@ 2007-11-22  5:11             ` Neil Brown
  0 siblings, 0 replies; 8+ messages in thread
From: Neil Brown @ 2007-11-22  5:11 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: Mark Hahn, Linux RAID

On Tuesday November 20, bs@q-leap.de wrote:
> 
> My personal (wild) guess for this problem is, that there is somewhere a global 
> lock, preventing all other CPUs to do something. At 100%s (at 80 MB/s) 
> there's probably not left any time frame to wake up the other CPUs or its 
> sufficiently small to only allow high priority kernel threads to do 
> something.
> When I limit the sync to 40MB/s each resync-CPU has to wait sufficiently long 
> to allow the other CPUs to wake up.
> 
> 

md doesn't hold any locks that would interfere with other parts of the
kernel from working.

I cannot imagine what would be causing your problems.  The resync
thread makes a point of calling cond_resched() periodically so that it
will let other processes run even if it constantly has work to do.

If you have nothing that could write to the RAID6 arrays, then I
cannot see how the resync could affect the rest of the system except
to reduce the amount of available CPU time.  And as CPU is normally
much faster than drives, you wouldn't expect that effect to be very
great.

Very strange.

Can you do 'alt-sysrq-T' when it is frozen and get the process traces
from the kernel logs?

Can you send me "cat /proc/mdstat" after the resyn has started, but
before the system has locked up.

I'm sorry that I cannot suggest anything more useful.

NeilBrown

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2007-11-22  5:11 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-11-18 20:06 raid6 resync blocks the entire system Bernd Schubert
2007-11-18 20:49 ` pg_mh, Peter Grandi
2007-11-18 22:18   ` Bernd Schubert
2007-11-20  5:55     ` Mark Hahn
2007-11-20 15:33       ` BUG: soft lockup detected on CPU#1! (was Re: raid6 resync blocks the entire system) Bernd Schubert
2007-11-20 17:16         ` Mark Hahn
2007-11-20 18:32           ` Bernd Schubert
2007-11-22  5:11             ` Neil Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).