Re: [Fwd: Re: libata , Silicon Image 3124]

linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [Fwd: Re: libata , Silicon Image 3124]
       [not found] <473A5B03.2020808@gmail.com>
@ 2008-01-06 10:55 ` Kasimir Müller
  2008-01-08  3:47   ` Tejun Heo
  0 siblings, 1 reply; 4+ messages in thread
From: Kasimir Müller @ 2008-01-06 10:55 UTC (permalink / raw)
  To: Tejun Heo, IDE/ATA development list, christian.kuehn

[-- Attachment #1: Type: text/plain, Size: 4728 bytes --]

Hi Tejun,

Old communication appended below.

I wish you a Happy Xmas and a successful New Year.

I spent some time during Christmas to further investigate the problem. I
bought a new 500GB disk and put all data on this disk.
This is also contineously watched by nagios and cacti
Then
1.) All 5 disks in the external case connected via Portmapper and sil24
card have excellent health-status with smartd.
2.) I get no(!!!!) errors at all if I use the disks as single drives or
with lvm. I verified this by copying large amounts of data (100-200GB)
     with rsync , cp-av and running bonnie++ single and simultaneously 
to  various combinations of drives.
3.) I get the errors as soon as I use raid. Same errors with raid0 (2
disks), 1 (2 disks), 5 (3 disks) in any combination of the drives
4.) The errors appear usually first during mkfs (same with ext3 and
reiserfs) and than
     after writing about 10-50 GB to the raid, and repeat then at 5 to
10 minute intervals according the disk activity.
5.) I used Kernel 2.6.23.1 with Your latest patch: same result
6.) I used kernel 2.6.24 patch rc-6 : same result
7.) during the tests I marked all files with md5-sums:  No data
corruption (!!!), so maybe I can live with it.


linux:/var/log # cat /proc/interrupts
           CPU0
  0:   12457604    XT-PIC-XT        timer
  1:          8    XT-PIC-XT        i8042
  2:          0    XT-PIC-XT        cascade
  4:     186738    XT-PIC-XT        serial
  5:   13328470    XT-PIC-XT        sata_via, ehci_hcd:usb5, VIA8237,
fcpci, eth0
  6:          5    XT-PIC-XT        floppy
  8:          0    XT-PIC-XT        rtc
  9:          0    XT-PIC-XT        acpi
 10:    8019753    XT-PIC-XT        sata_sil24
 11:        114    XT-PIC-XT        uhci_hcd:usb1, uhci_hcd:usb2
 14:     468588    XT-PIC-XT        libata
 15:          0    XT-PIC-XT        libata, uhci_hcd:usb3, uhci_hcd:usb4
NMI:          0
LOC:   12457724
ERR:          1


complete log of kernel errors from today appended as gz:

In the opensuse mailing-list some people reported related errors.
http://lists.opensuse.org/opensuse-de/2007-12/msg00939.html

Can You make someting out of this ? Do You need any more information
(please detail, because I am not a linux-guru)

Yours sincerely

Kasimir Mueller

> kjm@kasimir.mueller.de didn't work.  Forwarding to the original address.
>   
it is kjm@kasimir-mueller.de
>   
>
> ------------------------------------------------------------------------
>
> Betreff:
> Re: libata , Silicon Image 3124
> Von:
> Tejun Heo <htejun@gmail.com>
> Datum:
> Wed, 14 Nov 2007 11:10:57 +0900
> An:
> Kasimir Mueller <kjm@kasimir.mueller.de>
>
> An:
> Kasimir Mueller <kjm@kasimir.mueller.de>
> CC:
> IDE/ATA development list <linux-ide@vger.kernel.org>
>
> Return-Path:
> <htejun@gmail.com>
> X-Original-To:
> htejun@gmail.com
> Delivered-To:
> tj@htj.dyndns.org
> Received:
> from [127.0.0.2] (htj.dyndns.org [127.0.0.2]) by htj.dyndns.org
> (Postfix) with ESMTP id 50E1813A81CF; Wed, 14 Nov 2007 11:10:57 +0900
> (KST)
> Nachricht-ID:
> <473A5931.4040704@gmail.com>
> User-Agent:
> Thunderbird 2.0.0.6 (X11/20070801)
> MIME-Version:
> 1.0
> Referenzen:
> <4738A231.5010700@kasimir-mueller.de> <4738FC4C.7050103@gmail.com>
> <473A11B4.5030409@kasimir.mueller.de>
> In-Reply-To:
> <473A11B4.5030409@kasimir.mueller.de>
> X-Enigmail-Version:
> 0.95.3
> Content-Type:
> text/plain; charset=ISO-8859-15
> Content-Transfer-Encoding:
> 8bit
>
>
> [linux-ide added back. please don't drop cc list]
>
> Kasimir Mueller wrote:
>   
>> Tejun Heo schrieb:
>>     
>>> Kasimir Müller wrote:
>>>  
>>>       
>>>> Nov 12 19:28:42 linux kernel: ata6.02: cmd
>>>> ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0
>>>> Nov 12 19:28:42 linux kernel:          res
>>>> 40/00:00:09:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
>>>>     
>>>>         
>>> That's flush timing out, which isn't good.
>>>
>>> 1. Are the errors localized to ata6.02?
>>>   
>>>       
>> I think it's ata6.01
>>     
>
> The above message says 6.02 tho.  Anyways, if the errors are localized
> to one drive, please swap that drive with another drive.  Do the errors
> follow the drive or stay with the slot?
>
>   
>>> 2. Is FLUSH always involved (ie. is cmd always ea)?
>>>   
>>>       
>> yes
>>     
>
> Hmmm... okay.  That sounds like a dying drive to me.  What does
> 'smartctl -a /dev/sdX' say?
>
>   
>>> 3. If you disable NCQ, what do errors look like?
>>>   
>>>       
>> I put all disk-drives in the blacklist , without a change
>>     
>
> So, NCQ isn't the problem.
>
> ATM, it sounds like a dying drive to me.  Please double check the errors
> are localized to one drive, check smartctl log and try to verify as
> written above.
>
> Thanks.
>
>   


[-- Attachment #2: kernel.log.today.gz --]
[-- Type: application/x-gzip, Size: 5442 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Fwd: Re: libata , Silicon Image 3124]
  2008-01-06 10:55 ` [Fwd: Re: libata , Silicon Image 3124] Kasimir Müller
@ 2008-01-08  3:47   ` Tejun Heo
  2008-01-09 17:31     ` Kasimir Müller
  0 siblings, 1 reply; 4+ messages in thread
From: Tejun Heo @ 2008-01-08  3:47 UTC (permalink / raw)
  To: Kasimir Müller; +Cc: IDE/ATA development list, christian.kuehn

[-- Attachment #1: Type: text/plain, Size: 1407 bytes --]

Kasimir Müller wrote:
> Hi Tejun,
> 
> Old communication appended below.
> 
> I wish you a Happy Xmas and a successful New Year.
> 
> I spent some time during Christmas to further investigate the problem. I
> bought a new 500GB disk and put all data on this disk.
> This is also contineously watched by nagios and cacti
> Then
> 1.) All 5 disks in the external case connected via Portmapper and sil24
> card have excellent health-status with smartd.
> 2.) I get no(!!!!) errors at all if I use the disks as single drives or
> with lvm. I verified this by copying large amounts of data (100-200GB)
>      with rsync , cp-av and running bonnie++ single and simultaneously 
> to  various combinations of drives.
> 3.) I get the errors as soon as I use raid. Same errors with raid0 (2
> disks), 1 (2 disks), 5 (3 disks) in any combination of the drives
> 4.) The errors appear usually first during mkfs (same with ext3 and
> reiserfs) and than
>      after writing about 10-50 GB to the raid, and repeat then at 5 to
> 10 minute intervals according the disk activity.
> 5.) I used Kernel 2.6.23.1 with Your latest patch: same result
> 6.) I used kernel 2.6.24 patch rc-6 : same result
> 7.) during the tests I marked all files with md5-sums:  No data
> corruption (!!!), so maybe I can live with it.

Please apply the attached patch on top of 2.6.24-rc6 and report whether
anything changes.

Thanks.

-- 
tejun

[-- Attachment #2: propagate-timeout-to-host-link.patch --]
[-- Type: text/x-patch, Size: 1213 bytes --]

diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
index f0124a8..74269ed 100644
--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -1733,11 +1733,15 @@ static void ata_eh_link_autopsy(struct ata_link *link)
 		ehc->i.action &= ~ATA_EH_PERDEV_MASK;
 	}
 
-	/* consider speeding down */
+	/* propagate timeout to host link */
+	if ((all_err_mask & AC_ERR_TIMEOUT) && !ata_is_host_link(link))
+		ap->link.eh_context.i.err_mask |= AC_ERR_TIMEOUT;
+
+	/* record error and consider speeding down */
 	dev = ehc->i.dev;
-	if (!dev && ata_link_max_devices(link) == 1 &&
-	    ata_dev_enabled(link->device))
-		dev = link->device;
+	if (!dev && ((ata_link_max_devices(link) == 1 &&
+		      ata_dev_enabled(link->device))))
+	    dev = link->device;
 
 	if (dev)
 		ehc->i.action |= ata_eh_speed_down(dev, is_io, all_err_mask);
@@ -1759,8 +1763,14 @@ void ata_eh_autopsy(struct ata_port *ap)
 {
 	struct ata_link *link;
 
-	__ata_port_for_each_link(link, ap)
+	ata_port_for_each_link(link, ap)
 		ata_eh_link_autopsy(link);
+
+	/* Autopsy of fanout ports can affect host link autopsy.
+	 * Perform host link autopsy last.
+	 */
+	if (ap->nr_pmp_links)
+		ata_eh_link_autopsy(&ap->link);
 }
 
 /**

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [Fwd: Re: libata , Silicon Image 3124]
  2008-01-08  3:47   ` Tejun Heo
@ 2008-01-09 17:31     ` Kasimir Müller
  2008-01-10  4:08       ` Tejun Heo
  0 siblings, 1 reply; 4+ messages in thread
From: Kasimir Müller @ 2008-01-09 17:31 UTC (permalink / raw)
  To: Tejun Heo; +Cc: IDE/ATA development list, christian.kuehn

[-- Attachment #1: Type: text/plain, Size: 2027 bytes --]

Am 08.01.2008 04:47, Tejun Heo schrieb:
> Kasimir Müller wrote:
>   
>> Hi Tejun,
>>
>> Old communication appended below.
>>
>> I wish you a Happy Xmas and a successful New Year.
>>
>> I spent some time during Christmas to further investigate the problem. I
>> bought a new 500GB disk and put all data on this disk.
>> This is also contineously watched by nagios and cacti
>> Then
>> 1.) All 5 disks in the external case connected via Portmapper and sil24
>> card have excellent health-status with smartd.
>> 2.) I get no(!!!!) errors at all if I use the disks as single drives or
>> with lvm. I verified this by copying large amounts of data (100-200GB)
>>      with rsync , cp-av and running bonnie++ single and simultaneously 
>> to  various combinations of drives.
>> 3.) I get the errors as soon as I use raid. Same errors with raid0 (2
>> disks), 1 (2 disks), 5 (3 disks) in any combination of the drives
>> 4.) The errors appear usually first during mkfs (same with ext3 and
>> reiserfs) and than
>>      after writing about 10-50 GB to the raid, and repeat then at 5 to
>> 10 minute intervals according the disk activity.
>> 5.) I used Kernel 2.6.23.1 with Your latest patch: same result
>> 6.) I used kernel 2.6.24 patch rc-6 : same result
>> 7.) during the tests I marked all files with md5-sums:  No data
>> corruption (!!!), so maybe I can live with it.
>>     
>
> Please apply the attached patch on top of 2.6.24-rc6 and report whether
> anything changes.
>
> Thanks.
>
>   
Hi Tejun,
thanks very much again for your time and readiness to help.

I applied the patch as you proposed without problem.

I attach the kernel logs of the past few days. As far as I can see, with
the patched kernel errors appear only under heavy load .

If you want, I could provide a (temporary) ssh-login to my machine.
Maybe it would make it easier for you to debug the problem.
If you want to use it, I need your public ssh-certificate and your
login-name and I would send the login-details to you.

Thanks again,
yours Kasimir.




[-- Attachment #2: kernel.log.gz --]
[-- Type: application/x-gzip, Size: 99041 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Fwd: Re: libata , Silicon Image 3124]
  2008-01-09 17:31     ` Kasimir Müller
@ 2008-01-10  4:08       ` Tejun Heo
  0 siblings, 0 replies; 4+ messages in thread
From: Tejun Heo @ 2008-01-10  4:08 UTC (permalink / raw)
  To: Kasimir Müller; +Cc: IDE/ATA development list, christian.kuehn

Hello,

Kasimir Müller wrote:
> thanks very much again for your time and readiness to help.
> 
> I applied the patch as you proposed without problem.
> 
> I attach the kernel logs of the past few days. As far as I can see, with
> the patched kernel errors appear only under heavy load .
> 
> If you want, I could provide a (temporary) ssh-login to my machine.
> Maybe it would make it easier for you to debug the problem.
> If you want to use it, I need your public ssh-certificate and your
> login-name and I would send the login-details to you.

Your log is pretty interesting.  Lots of timeouts are on FLUSH and
previously other non-data commands like CHECK POWER or SMART, which is
very unusual.  I wonder what's going on.  Yeap, I wanna log in and take
a look.  I'll send you my pubkey in a private mail.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2008-01-10  4:08 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <473A5B03.2020808@gmail.com>
2008-01-06 10:55 ` [Fwd: Re: libata , Silicon Image 3124] Kasimir Müller
2008-01-08  3:47   ` Tejun Heo
2008-01-09 17:31     ` Kasimir Müller
2008-01-10  4:08       ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).