From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lars =?UTF-8?B?VMOkdWJlcg==?= Subject: RAID1 over aoe devices freezes cp-procs on failure of one aoe device Date: Thu, 5 Jun 2008 16:25:02 +0200 Message-ID: <20080605162502.09856f38.taeuber@bbaw.de> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Sender: linux-raid-owner@vger.kernel.org To: linux-raid@vger.kernel.org, sah@coraid.com, ecashin@coraid.com List-Id: linux-raid.ids Hi there. please let me explain a problem i struggle with here with a self build = SAN. If of one of two AoE-devices of a RAID1 fails any process copying to th= e mounted RAID freezes. This happens on a testing system. So I could make some more tests if yo= u need some more information. But it's time consuming. The aoe targets use qaoed as server. To simulate a failure i shut down the network interface the qaoed servi= ng requests from on one aoe-target. Linux Linux +-------+ +-------+ | qaoed | | qaoed | +--+----+ +---+---+ \ / <- network device shut down \ /=20 +--+-------+--+ | | aoe | | | e2.0 e11.1 | Linux 2.6.22.17-0.1-default | \ / | SuSE-10.3 | RAID1 | sekundus | md9 | +-------------+ sekundus:~ # cat /proc/partitions=20 major minor #blocks name [...] 152 2832 1074790400 etherd/e11.1 152 512 1074790400 etherd/e2.0 9 9 1074790336 md9 sekundus:~ # cat /proc/mdstat=20 Personalities : [raid1] [raid0] [raid6] [raid5] [raid4]=20 md9 : active raid1 etherd/e2.0[0] etherd/e11.1[1] 1074790336 blocks [2/2] [UU] The lost aoe-device is correctly marked as faulty but the the raid is n= ot usable for a copying processes any more although the remaining devic= e should be enough for a RAID1. There was no change after removing the = faulty device from md9. Is it possible that one faulty aoe-device blocks the aoe-module anyhow = so that all other aoe devices aren't accessible anymore? Or is the RAID= subsystem responsible for this? How can I debug this? There are no entries in the logs regarding this b= esides: /var/log/messages: Jun 5 11:16:01 sekundus kernel: raid1: etherd/e11.1: rescheduling sect= or 293594096 Jun 5 11:16:01 sekundus kernel: raid1: etherd/e11.1: rescheduling sect= or 293594224 Jun 5 11:16:01 sekundus kernel: raid1: etherd/e11.1: rescheduling sect= or 293594472 Jun 5 11:16:01 sekundus kernel: md: super_written gets error=3D-5, upt= odate=3D0 Jun 5 11:16:01 sekundus kernel: raid1: Disk failure on etherd/e11.1, d= isabling device.=20 Jun 5 11:16:01 sekundus kernel: Operation continuing on 1 devic= es Jun 5 11:16:01 sekundus kernel: RAID1 conf printout: Jun 5 11:16:01 sekundus kernel: --- wd:1 rd:2 Jun 5 11:16:01 sekundus kernel: disk 0, wo:0, o:1, dev:etherd/e2.0 Jun 5 11:16:01 sekundus kernel: disk 1, wo:1, o:0, dev:etherd/e11.1 Jun 5 11:16:01 sekundus kernel: RAID1 conf printout: Jun 5 11:16:01 sekundus kernel: --- wd:1 rd:2 Jun 5 11:16:01 sekundus kernel: disk 0, wo:0, o:1, dev:etherd/e2.0 The whole system doesn't react on a shutdown after this. I could login = for minutes over network till i hard rebooted through sys-rq. Thanks for any help. Lars PS: I gave up using raid on _multipath_ on LSI-SCSI (non-RAID SAS) conn= ected to an external storage (with expander) because I couldn't find ou= t which subsystem (scsi, driver, firmware-controller, firmware-expander= , multipathing, raid) to blame for the reproducable raid sync failures.= Who to contact in cases with problems with such a complex system? Or i= s there a step-by-step debugging guidance for what and how to test in w= hat order? This was a really time wasting try. I just skipped multipathd now - it = seems to work. --=20 Informationstechnologie Berlin-Brandenburgische Akademie der Wissenschaften J=C3=A4gerstrasse 22-23 10117 Berlin Tel.: +49 30 20370-352 http://www.bbaw.de -- To unsubscribe from this list: send the line "unsubscribe linux-raid" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html