Frozen drives when using SiI3726

linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Frozen drives when using SiI3726
@ 2008-12-23 10:02 Tim Nufire
  2008-12-29  9:13 ` Tejun Heo
  0 siblings, 1 reply; 4+ messages in thread
From: Tim Nufire @ 2008-12-23 10:02 UTC (permalink / raw)
  To: linux-ide

Hello,

I'm building a server using 9 SiI3726 based port multiplier backplanes  
connected to cards using SiI3132 (PCI Express) and SiI3124 (PCI). The  
drives are configured into 5 RAID6 groups of 9 drives each such that  
each array has 1 drive from each backplane. During the initial RAID  
synchronization one of the backplanes failed and restarted (see dmesg  
output below). While this did not disrupt the RAID groups this time,  
the reset took about 25 seconds and could easily have caused one or  
more drives to fail.

Is there anything I can do to prevent failures like this?

I'm running Debian Etchnhalf with the backport kernel 2.6.26-bpo.1- 
amd6. Here's my dmesg output...

[115135.002342] ata11.00: failed to read SCR 1 (Emask=0x40)
[115135.002348] ata11.01: failed to read SCR 1 (Emask=0x40)
[115135.002350] ata11.02: failed to read SCR 1 (Emask=0x40)
[115135.002353] ata11.03: failed to read SCR 1 (Emask=0x40)
[115135.002355] ata11.04: failed to read SCR 1 (Emask=0x40)
[115135.002362] ata11.05: failed to read SCR 1 (Emask=0x40)
[115135.002366] ata11.15: exception Emask 0x4 SAct 0x0 SErr 0x0 action  
0x6 frozen
[115135.002424] ata11.00: exception Emask 0x100 SAct 0x5f SErr 0x0  
action 0x6 frozen
[115135.002478] ata11.00: cmd 60/28:00:3f:dc:2a/00:00:48:00:00/40 tag  
0 ncq 20480 in
[115135.002478]          res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask  
0x4 (timeout)
[115135.002530] ata11.00: status: { DRDY }
[115135.002559] ata11.00: cmd 60/00:08:67:da:2a/01:00:48:00:00/40 tag  
1 ncq 131072 in
[115135.002560]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask  
0x4 (timeout)
[115135.002612] ata11.00: status: { DRDY }
[115135.002638] ata11.00: cmd 60/d8:10:67:db:2a/00:00:48:00:00/40 tag  
2 ncq 110592 in
[115135.002639]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask  
0x4 (timeout)
[115135.002691] ata11.00: status: { DRDY }
[115135.002718] ata11.00: cmd 60/00:18:67:dc:2a/01:00:48:00:00/40 tag  
3 ncq 131072 in
[115135.002718]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask  
0x4 (timeout)
[115135.003726] ata11.00: status: { DRDY }
[115135.003748] ata11.00: cmd 60/00:20:67:dd:2a/01:00:48:00:00/40 tag  
4 ncq 131072 in
[115135.003749]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask  
0x4 (timeout)
[115135.003795] ata11.00: status: { DRDY }
[115135.003906] ata11.00: cmd 60/80:30:67:de:2a/00:00:48:00:00/40 tag  
6 ncq 65536 in
[115135.003906]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask  
0x4 (timeout)
[115135.003906] ata11.00: status: { DRDY }
[115135.003906] ata11.01: exception Emask 0x100 SAct 0x0 SErr 0x0  
action 0x6 frozen
[115135.003906] ata11.02: exception Emask 0x100 SAct 0x0 SErr 0x0  
action 0x6 frozen
[115135.003906] ata11.03: exception Emask 0x100 SAct 0x0 SErr 0x0  
action 0x6 frozen
[115135.003906] ata11.04: exception Emask 0x100 SAct 0x0 SErr 0x0  
action 0x6 frozen
[115135.003906] ata11.05: exception Emask 0x100 SAct 0x0 SErr 0x0  
action 0x6 frozen
[115135.003906] ata11.15: hard resetting link
[115137.199223] ata11.15: SATA link up 3.0 Gbps (SStatus 123 SControl 0)
[115137.203173] ata11.00: hard resetting link
[115137.527173] ata11.00: SATA link up 3.0 Gbps (SStatus 123 SControl  
320)
[115137.527173] ata11.01: hard resetting link
[115137.845619] ata11.01: SATA link up 3.0 Gbps (SStatus 123 SControl  
300)
[115137.845619] ata11.02: hard resetting link
[115138.165222] ata11.02: SATA link up 3.0 Gbps (SStatus 123 SControl  
300)
[115138.165226] ata11.03: hard resetting link
[115138.488248] ata11.03: SATA link up 3.0 Gbps (SStatus 123 SControl  
300)
[115138.488248] ata11.04: hard resetting link
[115138.808248] ata11.04: SATA link up 3.0 Gbps (SStatus 123 SControl  
300)
[115138.808248] ata11.05: hard resetting link
[115139.130406] ata11.05: SATA link up 1.5 Gbps (SStatus 113 SControl  
320)
[115139.242365] ata11.00: failed to IDENTIFY (I/O error, err_mask=0x11)
[115139.242369] ata11.00: revalidation failed (errno=-5)
[115139.242398] ata11.15: hard resetting link
[115139.242400] ata11: controller in dubious state, performing PORT_RST
[115141.474099] ata11.15: SATA link up 3.0 Gbps (SStatus 123 SControl 0)
[115141.474071] ata11.00: hard resetting link
[115141.795928] ata11.00: SATA link up 3.0 Gbps (SStatus 123 SControl  
320)
[115141.795928] ata11.01: hard resetting link
[115142.115928] ata11.01: SATA link up 3.0 Gbps (SStatus 123 SControl  
300)
[115142.115932] ata11.02: hard resetting link
[115142.435928] ata11.02: SATA link up 3.0 Gbps (SStatus 123 SControl  
300)
[115142.435928] ata11.03: hard resetting link
[115142.763091] ata11.03: SATA link up 3.0 Gbps (SStatus 123 SControl  
300)
[115142.763091] ata11.04: hard resetting link
[115143.079449] ata11.04: SATA link up 3.0 Gbps (SStatus 123 SControl  
300)
[115143.079453] ata11.05: hard resetting link
[115143.403091] ata11.05: SATA link up 1.5 Gbps (SStatus 113 SControl  
320)
[115143.514319] ata11.00: failed to IDENTIFY (I/O error, err_mask=0x11)
[115143.514322] ata11.00: revalidation failed (errno=-5)
[115143.514351] ata11.15: hard resetting link
[115143.514353] ata11: controller in dubious state, performing PORT_RST
[115145.746903] ata11.15: SATA link up 3.0 Gbps (SStatus 123 SControl 0)
[115145.748120] ata11.00: hard resetting link
[115146.070875] ata11.00: SATA link up 3.0 Gbps (SStatus 123 SControl  
320)
[115146.070875] ata11.01: hard resetting link
[115146.388295] ata11.01: SATA link up 3.0 Gbps (SStatus 123 SControl  
300)
[115146.388295] ata11.02: hard resetting link
[115146.709654] ata11.02: SATA link up 3.0 Gbps (SStatus 123 SControl  
300)
[115146.709658] ata11.03: hard resetting link
[115147.032813] ata11.03: SATA link up 3.0 Gbps (SStatus 123 SControl  
300)
[115147.032817] ata11.04: hard resetting link
[115147.355108] ata11.04: SATA link up 3.0 Gbps (SStatus 123 SControl  
300)
[115147.355112] ata11.05: hard resetting link
[115147.673730] ata11.05: SATA link up 1.5 Gbps (SStatus 113 SControl  
320)
[115147.683449] ata11.00: configured for UDMA/100
[115147.694876] ata11.01: configured for UDMA/100
[115147.708279] ata11.02: configured for UDMA/100
[115148.077686] ata11.03: configured for UDMA/100
[115148.446001] ata11.04: configured for UDMA/100
[115148.446362] ata11: EH complete
[115148.446523] sd 10:0:0:0: [sdu] 1953525168 512-byte hardware  
sectors (1000205 MB)
[115148.446531] sd 10:0:0:0: [sdu] Write Protect is off
[115148.446533] sd 10:0:0:0: [sdu] Mode Sense: 00 3a 00 00
[115148.446546] sd 10:0:0:0: [sdu] Write cache: enabled, read cache:  
enabled, doesn't support DPO or FUA
[115148.446560] sd 10:1:0:0: [sdv] 1953525168 512-byte hardware  
sectors (1000205 MB)
[115148.446568] sd 10:1:0:0: [sdv] Write Protect is off
[115148.446569] sd 10:1:0:0: [sdv] Mode Sense: 00 3a 00 00
[115148.446582] sd 10:1:0:0: [sdv] Write cache: enabled, read cache:  
enabled, doesn't support DPO or FUA
[115148.446596] sd 10:2:0:0: [sdw] 1953525168 512-byte hardware  
sectors (1000205 MB)
[115148.446603] sd 10:2:0:0: [sdw] Write Protect is off
[115148.446604] sd 10:2:0:0: [sdw] Mode Sense: 00 3a 00 00
[115148.446617] sd 10:2:0:0: [sdw] Write cache: enabled, read cache:  
enabled, doesn't support DPO or FUA
[115148.446631] sd 10:3:0:0: [sdx] 1953525168 512-byte hardware  
sectors (1000205 MB)
[115148.446638] sd 10:3:0:0: [sdx] Write Protect is off
[115148.446639] sd 10:3:0:0: [sdx] Mode Sense: 00 3a 00 00
[115148.446652] sd 10:3:0:0: [sdx] Write cache: enabled, read cache:  
enabled, doesn't support DPO or FUA
[115148.446666] sd 10:4:0:0: [sdy] 1953525168 512-byte hardware  
sectors (1000205 MB)
[115148.446673] sd 10:4:0:0: [sdy] Write Protect is off
[115148.446674] sd 10:4:0:0: [sdy] Mode Sense: 00 3a 00 00
[115148.446687] sd 10:4:0:0: [sdy] Write cache: enabled, read cache:  
enabled, doesn't support DPO or FUA
[115148.446700] sd 10:0:0:0: [sdu] 1953525168 512-byte hardware  
sectors (1000205 MB)
[115148.446707] sd 10:0:0:0: [sdu] Write Protect is off
[115148.446708] sd 10:0:0:0: [sdu] Mode Sense: 00 3a 00 00
[115148.446721] sd 10:0:0:0: [sdu] Write cache: enabled, read cache:  
enabled, doesn't support DPO or FUA
[115148.446734] sd 10:1:0:0: [sdv] 1953525168 512-byte hardware  
sectors (1000205 MB)
[115148.446741] sd 10:1:0:0: [sdv] Write Protect is off
[115148.446742] sd 10:1:0:0: [sdv] Mode Sense: 00 3a 00 00
[115148.446755] sd 10:1:0:0: [sdv] Write cache: enabled, read cache:  
enabled, doesn't support DPO or FUA
[115148.446767] sd 10:2:0:0: [sdw] 1953525168 512-byte hardware  
sectors (1000205 MB)
[115148.446774] sd 10:2:0:0: [sdw] Write Protect is off
[115148.446776] sd 10:2:0:0: [sdw] Mode Sense: 00 3a 00 00
[115148.446788] sd 10:2:0:0: [sdw] Write cache: enabled, read cache:  
enabled, doesn't support DPO or FUA
[115148.446801] sd 10:3:0:0: [sdx] 1953525168 512-byte hardware  
sectors (1000205 MB)
[115148.446808] sd 10:3:0:0: [sdx] Write Protect is off
[115148.446809] sd 10:3:0:0: [sdx] Mode Sense: 00 3a 00 00
[115148.446822] sd 10:3:0:0: [sdx] Write cache: enabled, read cache:  
enabled, doesn't support DPO or FUA
[115148.446834] sd 10:4:0:0: [sdy] 1953525168 512-byte hardware  
sectors (1000205 MB)
[115148.446841] sd 10:4:0:0: [sdy] Write Protect is off
[115148.446843] sd 10:4:0:0: [sdy] Mode Sense: 00 3a 00 00
[115148.446855] sd 10:4:0:0: [sdy] Write cache: enabled, read cache:  
enabled, doesn't support DPO or FUA

Thanks,
Tim

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Frozen drives when using SiI3726
  2008-12-23 10:02 Frozen drives when using SiI3726 Tim Nufire
@ 2008-12-29  9:13 ` Tejun Heo
  2008-12-29 19:20   ` Grant Grundler
  0 siblings, 1 reply; 4+ messages in thread
From: Tejun Heo @ 2008-12-29  9:13 UTC (permalink / raw)
  To: Tim Nufire; +Cc: linux-ide

Hi,

Tim Nufire wrote:
> Hello,
> 
> I'm building a server using 9 SiI3726 based port multiplier backplanes
> connected to cards using SiI3132 (PCI Express) and SiI3124 (PCI). The
> drives are configured into 5 RAID6 groups of 9 drives each such that
> each array has 1 drive from each backplane. During the initial RAID
> synchronization one of the backplanes failed and restarted (see dmesg
> output below). While this did not disrupt the RAID groups this time, the
> reset took about 25 seconds and could easily have caused one or more
> drives to fail.
> 
> Is there anything I can do to prevent failures like this?

The reset was triggered by a timeout which probably have taken around
or more than 30 secs, so the array probably experienced disruption
which is about a minute long.  The failure latency is a bit
unfortunate at the moment.  :-(

Also, the timeout is one of the most generic failure mode there is.
It can be triggered by virtually anything including transmission
failure, power quality issues, drive problems and whatnot.  So, it's
impossible to tell what went wrong with the provided information.  It
could be an one-time fluke - e.g. bad sectors which developed during
storage and shipping and RAID sync makes the firmware think what to do
about it for a long time - or something more systematic -
e.g. slightly bad connection on the backplane side or sub-par power
which slightly chokes when all drives are pulling juice out of it.

Unfortunately, the only way to debug would be keeping an eye on
whether such failures repeat and if so when and where - whether it
always happen on the same chassis, slot or drive (by exchaning
drives), etc...

Please let us know when you find out more.

Happy new year.

-- 
tejun

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Frozen drives when using SiI3726
  2008-12-29  9:13 ` Tejun Heo
@ 2008-12-29 19:20   ` Grant Grundler
  2009-01-02  3:28     ` Tejun Heo
  0 siblings, 1 reply; 4+ messages in thread
From: Grant Grundler @ 2008-12-29 19:20 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Tim Nufire, linux-ide

On Mon, Dec 29, 2008 at 1:13 AM, Tejun Heo <tj@kernel.org> wrote:
> Hi,
>
> Tim Nufire wrote:
>> Hello,
>>
>> I'm building a server using 9 SiI3726 based port multiplier backplanes
>> connected to cards using SiI3132 (PCI Express) and SiI3124 (PCI). The
>> drives are configured into 5 RAID6 groups of 9 drives each such that
>> each array has 1 drive from each backplane. During the initial RAID
>> synchronization one of the backplanes failed and restarted (see dmesg
>> output below). While this did not disrupt the RAID groups this time, the
>> reset took about 25 seconds and could easily have caused one or more
>> drives to fail.
>>
>> Is there anything I can do to prevent failures like this?
>
> The reset was triggered by a timeout which probably have taken around
> or more than 30 secs, so the array probably experienced disruption
> which is about a minute long.  The failure latency is a bit
> unfortunate at the moment.  :-(
>
> Also, the timeout is one of the most generic failure mode there is.
> It can be triggered by virtually anything including transmission
> failure, power quality issues, drive problems and whatnot.  So, it's
> impossible to tell what went wrong with the provided information.

Tejun,
The only other possible major issue not listed above I can think of is
Sil3276 firmware rev. The data sheet (*) "EEPROM Speicifications"
on Page 20 says the firmware is versioned.
But the data sheet doesn't specify how to read it.
Can you ask Silicon Image to publish how to read the firmware rev?

(*) Data sheet available from
http://www.siliconimage.com/docs/SiI-DS-0121-C1.pdf

Also SATA_PMP_GSCR_REV  (Spec compliance, not FW rev) seems
to be used twice:

include/linux/ata.h:
#define sata_pmp_gscr_rev(gscr)         (((gscr)[SATA_PMP_GSCR_REV] >>
8) & 0xff)

drivers/ata/libata-pmp.c:sata_pmp_spec_rev_str()

Would it make sense to rename one or the other so they match?
The difference is one returns a string and the other a u8 value.
Both take the same parameter.

thanks,
grant

>  It
> could be an one-time fluke - e.g. bad sectors which developed during
> storage and shipping and RAID sync makes the firmware think what to do
> about it for a long time - or something more systematic -
> e.g. slightly bad connection on the backplane side or sub-par power
> which slightly chokes when all drives are pulling juice out of it.
>
> Unfortunately, the only way to debug would be keeping an eye on
> whether such failures repeat and if so when and where - whether it
> always happen on the same chassis, slot or drive (by exchaning
> drives), etc...
>
> Please let us know when you find out more.
>
> Happy new year.
>
> --
> tejun
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ide" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Frozen drives when using SiI3726
  2008-12-29 19:20   ` Grant Grundler
@ 2009-01-02  3:28     ` Tejun Heo
  0 siblings, 0 replies; 4+ messages in thread
From: Tejun Heo @ 2009-01-02  3:28 UTC (permalink / raw)
  To: Grant Grundler; +Cc: Tim Nufire, linux-ide

Hello, Grant.

Grant Grundler wrote:
>> Also, the timeout is one of the most generic failure mode there is.
>> It can be triggered by virtually anything including transmission
>> failure, power quality issues, drive problems and whatnot.  So, it's
>> impossible to tell what went wrong with the provided information.
> 
> Tejun,
> The only other possible major issue not listed above I can think of is
> Sil3276 firmware rev. The data sheet (*) "EEPROM Speicifications"
> on Page 20 says the firmware is versioned.
> But the data sheet doesn't specify how to read it.

I haven't seen firmware related transmission failures yet but it's not
like I have lots of 3726s with different revision firmwares.

> Can you ask Silicon Image to publish how to read the firmware rev?

IIRC, you can read it from the steelvine management utility.  You'll
probably need a Windows installation tho.

> Also SATA_PMP_GSCR_REV  (Spec compliance, not FW rev) seems
> to be used twice:
> 
> include/linux/ata.h:
> #define sata_pmp_gscr_rev(gscr)         (((gscr)[SATA_PMP_GSCR_REV] >>
> 8) & 0xff)
> 
> drivers/ata/libata-pmp.c:sata_pmp_spec_rev_str()
> 
> Would it make sense to rename one or the other so they match?
> The difference is one returns a string and the other a u8 value.
> Both take the same parameter.

No, spec_rev_str() returns spec compliance string (lower bits) while
gscr_rev() returns revision level of port multiplier (whatever that
means).

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2009-01-02  3:28 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-12-23 10:02 Frozen drives when using SiI3726 Tim Nufire
2008-12-29  9:13 ` Tejun Heo
2008-12-29 19:20   ` Grant Grundler
2009-01-02  3:28     ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).