From mboxrd@z Thu Jan 1 00:00:00 1970 From: Philip Molter Subject: RAID1 disk failure causes hung mount Date: Mon, 01 Sep 2008 13:32:10 -0500 Message-ID: <48BC352A.5010701@datafoundry.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Return-path: Sender: linux-raid-owner@vger.kernel.org To: Linux RAID Mailing List List-Id: linux-raid.ids Hello, We're running a modified version of the FC4 2.6.17 kernel (2.6.17.4). I realize this is an old kernel. For internal reasons, we cannot update to a newer version of the kernel at this time. We have a 3ware 9550SXU card with 12 drives in JBOD mode. These 12 drives are mirrored in 6 RAID1 pairs, then striped together in one big RAID0 stripe. When we have a disk error with one of the drives in a RAID1 pair, the entire RAID0 mount locks up. We can still cd to the mount and read from it, but if we try to write anything to the mount, the process hangs in an unkillable state. This recently happened. Here are the log messages from the disk failure: sd 0:0:9:0: SCSI error: return code = 0x8000004 sdj: Current: sense key: Medium Error Additional sense: Unrecovered read error end_request: I/O error, dev sdj, sector 10964975 raid1: sdj1: rescheduling sector 10964912 3w-9xxx: scsi0: ERROR: (0x03:0x0202): Drive ECC error:port=9. sd 0:0:9:0: SCSI error: return code = 0x8000004 sdj: Current: sense key: Medium Error Additional sense: Unrecovered read error end_request: I/O error, dev sdj, sector 10964975 raid1: sdd1: redirecting sector 10964912 to another mirror When this happened, /dev/sdj1 did not fail out of its RAID. It also did not lock the system. Later: sd 0:0:9:0: SCSI error: return code = 0x8000004 sdj: Current: sense key: Medium Error Additional sense: Unrecovered read error end_request: I/O error, dev sdj, sector 16744439 raid1: sdj1: rescheduling sector 16744376 When this happened, /dev/sdj1 did not fail out of its RAID, but it did lock writes to the big RAID0 stripe. I manually failed /dev/sdj1 out of the RAID and /proc/mdstat did report it as failed out at that point. It did not cause writes to begin being processed. I tried to manually remove /dev/sdj1 from the RAID and mdadm reported that the device was busy. A hard power-cycle was required to restore functionality to the system. This is consistent with these kinds of errors. I have looked through the patches related to RAID1 and lockups. The patches from January/March 2008 related to RAID1 deadlocks have not seemed to help (I didn't really expect them to as 2.6.17.4 predates bitmap code, no?). I'd like to be able to get more debug during cases like this, but I'm not sure what to gather or how to gather it. If anyone has any suggestions, I'd appreciate hearing them. If any of the md devs have any suggestions for patches to look at that specifically address this behavior, I'd by very grateful for that advice. As it is, I've been combing over git commits with very little success. Thanks in advance for any assistance, knowledge, or suggestions anyone has. Philip