RAID1 over aoe devices freezes cp-procs on failure of one aoe device

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RAID1 over aoe devices freezes cp-procs on failure of one aoe device
@ 2008-06-05 14:25 Lars Täuber
  2008-06-06  7:34 ` Gabor Gombas
  2008-06-09 17:32 ` Ed L. Cashin
  0 siblings, 2 replies; 4+ messages in thread
From: Lars Täuber @ 2008-06-05 14:25 UTC (permalink / raw)
  To: linux-raid, sah, ecashin

Hi there.

please let me explain a problem i struggle with here with a self build SAN.

If of one of two AoE-devices of a RAID1 fails any process copying to the mounted RAID freezes.

This happens on a testing system. So I could make some more tests if you need some more information. But it's time consuming.

The aoe targets use qaoed as server.
To simulate a failure i shut down the network interface the qaoed serving requests from on one aoe-target.

  Linux        Linux
+-------+    +-------+
| qaoed |    | qaoed |
+--+----+    +---+---+
    \           / <- network device shut down
     \         / 
   +--+-------+--+
   |  |  aoe  |  |
   | e2.0  e11.1 | Linux 2.6.22.17-0.1-default
   |    \   /    | SuSE-10.3
   |    RAID1    | sekundus
   |     md9     |
   +-------------+

sekundus:~ # cat /proc/partitions 
major minor  #blocks  name
[...]
 152  2832 1074790400 etherd/e11.1
 152   512 1074790400 etherd/e2.0
   9     9 1074790336 md9

sekundus:~ # cat /proc/mdstat 
Personalities : [raid1] [raid0] [raid6] [raid5] [raid4] 
md9 : active raid1 etherd/e2.0[0] etherd/e11.1[1]
      1074790336 blocks [2/2] [UU]

The lost aoe-device is correctly marked as faulty but the the raid is not usable for a copying processes any more although the remaining device should be enough for a RAID1. There was no change after removing the faulty device from md9.

Is it possible that one faulty aoe-device blocks the aoe-module anyhow so that all other aoe devices aren't accessible anymore? Or is the RAID subsystem responsible for this?

How can I debug this? There are no entries in the logs regarding this besides:

/var/log/messages:
Jun  5 11:16:01 sekundus kernel: raid1: etherd/e11.1: rescheduling sector 293594096
Jun  5 11:16:01 sekundus kernel: raid1: etherd/e11.1: rescheduling sector 293594224
Jun  5 11:16:01 sekundus kernel: raid1: etherd/e11.1: rescheduling sector 293594472
Jun  5 11:16:01 sekundus kernel: md: super_written gets error=-5, uptodate=0
Jun  5 11:16:01 sekundus kernel: raid1: Disk failure on etherd/e11.1, disabling device. 
Jun  5 11:16:01 sekundus kernel:        Operation continuing on 1 devices
Jun  5 11:16:01 sekundus kernel: RAID1 conf printout:
Jun  5 11:16:01 sekundus kernel:  --- wd:1 rd:2
Jun  5 11:16:01 sekundus kernel:  disk 0, wo:0, o:1, dev:etherd/e2.0
Jun  5 11:16:01 sekundus kernel:  disk 1, wo:1, o:0, dev:etherd/e11.1
Jun  5 11:16:01 sekundus kernel: RAID1 conf printout:
Jun  5 11:16:01 sekundus kernel:  --- wd:1 rd:2
Jun  5 11:16:01 sekundus kernel:  disk 0, wo:0, o:1, dev:etherd/e2.0

The whole system doesn't react on a shutdown after this. I could login for minutes over network till i hard rebooted through sys-rq.

Thanks for any help.
Lars

PS: I gave up using raid on _multipath_ on LSI-SCSI (non-RAID SAS) connected to an external storage (with expander) because I couldn't find out which subsystem (scsi, driver, firmware-controller, firmware-expander, multipathing, raid) to blame for the reproducable raid sync failures. Who to contact in cases with problems with such a complex system? Or is there a step-by-step debugging guidance for what and how to test in what order?
This was a really time wasting try. I just skipped multipathd now - it seems to work.

-- 
                            Informationstechnologie
Berlin-Brandenburgische Akademie der Wissenschaften
Jägerstrasse 22-23                     10117 Berlin
Tel.: +49 30 20370-352           http://www.bbaw.de
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: RAID1 over aoe devices freezes cp-procs on failure of one aoe device
  2008-06-05 14:25 RAID1 over aoe devices freezes cp-procs on failure of one aoe device Lars Täuber
@ 2008-06-06  7:34 ` Gabor Gombas
  2008-06-06 10:03   ` Lars Täuber
  2008-06-09 17:32 ` Ed L. Cashin
  1 sibling, 1 reply; 4+ messages in thread
From: Gabor Gombas @ 2008-06-06  7:34 UTC (permalink / raw)
  To: Lars Täuber; +Cc: linux-raid, sah, ecashin

On Thu, Jun 05, 2008 at 04:25:02PM +0200, Lars Täuber wrote:

> The lost aoe-device is correctly marked as faulty but the the raid is
> not usable for a copying processes any more although the remaining
> device should be enough for a RAID1. There was no change after
> removing the faulty device from md9.
> 
> Is it possible that one faulty aoe-device blocks the aoe-module anyhow
> so that all other aoe devices aren't accessible anymore? Or is the
> RAID subsystem responsible for this?

It should be easy to test: when the RAID hangs, try to read directly
from the remaining device ("dd if=/dev/etherd/e11.1 ..."). If that also
hangs, then it is an AoE issue.

Gabor

-- 
     ---------------------------------------------------------
     MTA SZTAKI Computer and Automation Research Institute
                Hungarian Academy of Sciences
     ---------------------------------------------------------
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: RAID1 over aoe devices freezes cp-procs on failure of one aoe device
  2008-06-06  7:34 ` Gabor Gombas
@ 2008-06-06 10:03   ` Lars Täuber
  0 siblings, 0 replies; 4+ messages in thread
From: Lars Täuber @ 2008-06-06 10:03 UTC (permalink / raw)
  To: Gabor Gombas; +Cc: linux-raid, sah, ecashin

Hallo Gabor,

Gabor Gombas <gombasg@sztaki.hu> schrieb:
> On Thu, Jun 05, 2008 at 04:25:02PM +0200, Lars Täuber wrote:
> 
> > The lost aoe-device is correctly marked as faulty but the the raid is
> > not usable for a copying processes any more although the remaining
> > device should be enough for a RAID1. There was no change after
> > removing the faulty device from md9.
> > 
> > Is it possible that one faulty aoe-device blocks the aoe-module anyhow
> > so that all other aoe devices aren't accessible anymore? Or is the
> > RAID subsystem responsible for this?
> 
> It should be easy to test: when the RAID hangs, try to read directly
> from the remaining device ("dd if=/dev/etherd/e11.1 ..."). If that also
> hangs, then it is an AoE issue.

the easiest tests are the last that come to mind. :o)
There is another problem here:
The described problem is not reproducable. I'm not sure wether I made something wrong or the situation is not the same after the shutdown.

I'm checking this right now.

Thanks
Lars
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: RAID1 over aoe devices freezes cp-procs on failure of one aoe device
  2008-06-05 14:25 RAID1 over aoe devices freezes cp-procs on failure of one aoe device Lars Täuber
  2008-06-06  7:34 ` Gabor Gombas
@ 2008-06-09 17:32 ` Ed L. Cashin
  1 sibling, 0 replies; 4+ messages in thread
From: Ed L. Cashin @ 2008-06-09 17:32 UTC (permalink / raw)
  To: Lars T??uber; +Cc: linux-raid, sah

On Thu, Jun 05, 2008 at 04:25:02PM +0200, Lars T??uber wrote:
> Hi there.
> 
> please let me explain a problem i struggle with here with a self build SAN.
> 
> If of one of two AoE-devices of a RAID1 fails any process copying to the mounted RAID freezes.

The md driver might just be waiting for the I/O to fail or succeed
while the AoE command times out.  You can adjust aoe_deadsecs if you
like, so that I/O to an unavailable component is failed more quickly.

There is an aoetools-discuss mailing list at the sourceforge
"aoetools" project.  There you can talk to other people doing Linux
Software RAID 1 over AoE targets.

-- 
  Ed L Cashin <ecashin@coraid.com>

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2008-06-09 17:32 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-06-05 14:25 RAID1 over aoe devices freezes cp-procs on failure of one aoe device Lars Täuber
2008-06-06  7:34 ` Gabor Gombas
2008-06-06 10:03   ` Lars Täuber
2008-06-09 17:32 ` Ed L. Cashin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).