* Re: [Fwd: Re: libata , Silicon Image 3124] [not found] <473A5B03.2020808@gmail.com> @ 2008-01-06 10:55 ` Kasimir Müller 2008-01-08 3:47 ` Tejun Heo 0 siblings, 1 reply; 4+ messages in thread From: Kasimir Müller @ 2008-01-06 10:55 UTC (permalink / raw) To: Tejun Heo, IDE/ATA development list, christian.kuehn [-- Attachment #1: Type: text/plain, Size: 4728 bytes --] Hi Tejun, Old communication appended below. I wish you a Happy Xmas and a successful New Year. I spent some time during Christmas to further investigate the problem. I bought a new 500GB disk and put all data on this disk. This is also contineously watched by nagios and cacti Then 1.) All 5 disks in the external case connected via Portmapper and sil24 card have excellent health-status with smartd. 2.) I get no(!!!!) errors at all if I use the disks as single drives or with lvm. I verified this by copying large amounts of data (100-200GB) with rsync , cp-av and running bonnie++ single and simultaneously to various combinations of drives. 3.) I get the errors as soon as I use raid. Same errors with raid0 (2 disks), 1 (2 disks), 5 (3 disks) in any combination of the drives 4.) The errors appear usually first during mkfs (same with ext3 and reiserfs) and than after writing about 10-50 GB to the raid, and repeat then at 5 to 10 minute intervals according the disk activity. 5.) I used Kernel 2.6.23.1 with Your latest patch: same result 6.) I used kernel 2.6.24 patch rc-6 : same result 7.) during the tests I marked all files with md5-sums: No data corruption (!!!), so maybe I can live with it. linux:/var/log # cat /proc/interrupts CPU0 0: 12457604 XT-PIC-XT timer 1: 8 XT-PIC-XT i8042 2: 0 XT-PIC-XT cascade 4: 186738 XT-PIC-XT serial 5: 13328470 XT-PIC-XT sata_via, ehci_hcd:usb5, VIA8237, fcpci, eth0 6: 5 XT-PIC-XT floppy 8: 0 XT-PIC-XT rtc 9: 0 XT-PIC-XT acpi 10: 8019753 XT-PIC-XT sata_sil24 11: 114 XT-PIC-XT uhci_hcd:usb1, uhci_hcd:usb2 14: 468588 XT-PIC-XT libata 15: 0 XT-PIC-XT libata, uhci_hcd:usb3, uhci_hcd:usb4 NMI: 0 LOC: 12457724 ERR: 1 complete log of kernel errors from today appended as gz: In the opensuse mailing-list some people reported related errors. http://lists.opensuse.org/opensuse-de/2007-12/msg00939.html Can You make someting out of this ? Do You need any more information (please detail, because I am not a linux-guru) Yours sincerely Kasimir Mueller > kjm@kasimir.mueller.de didn't work. Forwarding to the original address. > it is kjm@kasimir-mueller.de > > > ------------------------------------------------------------------------ > > Betreff: > Re: libata , Silicon Image 3124 > Von: > Tejun Heo <htejun@gmail.com> > Datum: > Wed, 14 Nov 2007 11:10:57 +0900 > An: > Kasimir Mueller <kjm@kasimir.mueller.de> > > An: > Kasimir Mueller <kjm@kasimir.mueller.de> > CC: > IDE/ATA development list <linux-ide@vger.kernel.org> > > Return-Path: > <htejun@gmail.com> > X-Original-To: > htejun@gmail.com > Delivered-To: > tj@htj.dyndns.org > Received: > from [127.0.0.2] (htj.dyndns.org [127.0.0.2]) by htj.dyndns.org > (Postfix) with ESMTP id 50E1813A81CF; Wed, 14 Nov 2007 11:10:57 +0900 > (KST) > Nachricht-ID: > <473A5931.4040704@gmail.com> > User-Agent: > Thunderbird 2.0.0.6 (X11/20070801) > MIME-Version: > 1.0 > Referenzen: > <4738A231.5010700@kasimir-mueller.de> <4738FC4C.7050103@gmail.com> > <473A11B4.5030409@kasimir.mueller.de> > In-Reply-To: > <473A11B4.5030409@kasimir.mueller.de> > X-Enigmail-Version: > 0.95.3 > Content-Type: > text/plain; charset=ISO-8859-15 > Content-Transfer-Encoding: > 8bit > > > [linux-ide added back. please don't drop cc list] > > Kasimir Mueller wrote: > >> Tejun Heo schrieb: >> >>> Kasimir Müller wrote: >>> >>> >>>> Nov 12 19:28:42 linux kernel: ata6.02: cmd >>>> ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0 >>>> Nov 12 19:28:42 linux kernel: res >>>> 40/00:00:09:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) >>>> >>>> >>> That's flush timing out, which isn't good. >>> >>> 1. Are the errors localized to ata6.02? >>> >>> >> I think it's ata6.01 >> > > The above message says 6.02 tho. Anyways, if the errors are localized > to one drive, please swap that drive with another drive. Do the errors > follow the drive or stay with the slot? > > >>> 2. Is FLUSH always involved (ie. is cmd always ea)? >>> >>> >> yes >> > > Hmmm... okay. That sounds like a dying drive to me. What does > 'smartctl -a /dev/sdX' say? > > >>> 3. If you disable NCQ, what do errors look like? >>> >>> >> I put all disk-drives in the blacklist , without a change >> > > So, NCQ isn't the problem. > > ATM, it sounds like a dying drive to me. Please double check the errors > are localized to one drive, check smartctl log and try to verify as > written above. > > Thanks. > > [-- Attachment #2: kernel.log.today.gz --] [-- Type: application/x-gzip, Size: 5442 bytes --] ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Fwd: Re: libata , Silicon Image 3124] 2008-01-06 10:55 ` [Fwd: Re: libata , Silicon Image 3124] Kasimir Müller @ 2008-01-08 3:47 ` Tejun Heo 2008-01-09 17:31 ` Kasimir Müller 0 siblings, 1 reply; 4+ messages in thread From: Tejun Heo @ 2008-01-08 3:47 UTC (permalink / raw) To: Kasimir Müller; +Cc: IDE/ATA development list, christian.kuehn [-- Attachment #1: Type: text/plain, Size: 1407 bytes --] Kasimir Müller wrote: > Hi Tejun, > > Old communication appended below. > > I wish you a Happy Xmas and a successful New Year. > > I spent some time during Christmas to further investigate the problem. I > bought a new 500GB disk and put all data on this disk. > This is also contineously watched by nagios and cacti > Then > 1.) All 5 disks in the external case connected via Portmapper and sil24 > card have excellent health-status with smartd. > 2.) I get no(!!!!) errors at all if I use the disks as single drives or > with lvm. I verified this by copying large amounts of data (100-200GB) > with rsync , cp-av and running bonnie++ single and simultaneously > to various combinations of drives. > 3.) I get the errors as soon as I use raid. Same errors with raid0 (2 > disks), 1 (2 disks), 5 (3 disks) in any combination of the drives > 4.) The errors appear usually first during mkfs (same with ext3 and > reiserfs) and than > after writing about 10-50 GB to the raid, and repeat then at 5 to > 10 minute intervals according the disk activity. > 5.) I used Kernel 2.6.23.1 with Your latest patch: same result > 6.) I used kernel 2.6.24 patch rc-6 : same result > 7.) during the tests I marked all files with md5-sums: No data > corruption (!!!), so maybe I can live with it. Please apply the attached patch on top of 2.6.24-rc6 and report whether anything changes. Thanks. -- tejun [-- Attachment #2: propagate-timeout-to-host-link.patch --] [-- Type: text/x-patch, Size: 1213 bytes --] diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c index f0124a8..74269ed 100644 --- a/drivers/ata/libata-eh.c +++ b/drivers/ata/libata-eh.c @@ -1733,11 +1733,15 @@ static void ata_eh_link_autopsy(struct ata_link *link) ehc->i.action &= ~ATA_EH_PERDEV_MASK; } - /* consider speeding down */ + /* propagate timeout to host link */ + if ((all_err_mask & AC_ERR_TIMEOUT) && !ata_is_host_link(link)) + ap->link.eh_context.i.err_mask |= AC_ERR_TIMEOUT; + + /* record error and consider speeding down */ dev = ehc->i.dev; - if (!dev && ata_link_max_devices(link) == 1 && - ata_dev_enabled(link->device)) - dev = link->device; + if (!dev && ((ata_link_max_devices(link) == 1 && + ata_dev_enabled(link->device)))) + dev = link->device; if (dev) ehc->i.action |= ata_eh_speed_down(dev, is_io, all_err_mask); @@ -1759,8 +1763,14 @@ void ata_eh_autopsy(struct ata_port *ap) { struct ata_link *link; - __ata_port_for_each_link(link, ap) + ata_port_for_each_link(link, ap) ata_eh_link_autopsy(link); + + /* Autopsy of fanout ports can affect host link autopsy. + * Perform host link autopsy last. + */ + if (ap->nr_pmp_links) + ata_eh_link_autopsy(&ap->link); } /** ^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [Fwd: Re: libata , Silicon Image 3124] 2008-01-08 3:47 ` Tejun Heo @ 2008-01-09 17:31 ` Kasimir Müller 2008-01-10 4:08 ` Tejun Heo 0 siblings, 1 reply; 4+ messages in thread From: Kasimir Müller @ 2008-01-09 17:31 UTC (permalink / raw) To: Tejun Heo; +Cc: IDE/ATA development list, christian.kuehn [-- Attachment #1: Type: text/plain, Size: 2027 bytes --] Am 08.01.2008 04:47, Tejun Heo schrieb: > Kasimir Müller wrote: > >> Hi Tejun, >> >> Old communication appended below. >> >> I wish you a Happy Xmas and a successful New Year. >> >> I spent some time during Christmas to further investigate the problem. I >> bought a new 500GB disk and put all data on this disk. >> This is also contineously watched by nagios and cacti >> Then >> 1.) All 5 disks in the external case connected via Portmapper and sil24 >> card have excellent health-status with smartd. >> 2.) I get no(!!!!) errors at all if I use the disks as single drives or >> with lvm. I verified this by copying large amounts of data (100-200GB) >> with rsync , cp-av and running bonnie++ single and simultaneously >> to various combinations of drives. >> 3.) I get the errors as soon as I use raid. Same errors with raid0 (2 >> disks), 1 (2 disks), 5 (3 disks) in any combination of the drives >> 4.) The errors appear usually first during mkfs (same with ext3 and >> reiserfs) and than >> after writing about 10-50 GB to the raid, and repeat then at 5 to >> 10 minute intervals according the disk activity. >> 5.) I used Kernel 2.6.23.1 with Your latest patch: same result >> 6.) I used kernel 2.6.24 patch rc-6 : same result >> 7.) during the tests I marked all files with md5-sums: No data >> corruption (!!!), so maybe I can live with it. >> > > Please apply the attached patch on top of 2.6.24-rc6 and report whether > anything changes. > > Thanks. > > Hi Tejun, thanks very much again for your time and readiness to help. I applied the patch as you proposed without problem. I attach the kernel logs of the past few days. As far as I can see, with the patched kernel errors appear only under heavy load . If you want, I could provide a (temporary) ssh-login to my machine. Maybe it would make it easier for you to debug the problem. If you want to use it, I need your public ssh-certificate and your login-name and I would send the login-details to you. Thanks again, yours Kasimir. [-- Attachment #2: kernel.log.gz --] [-- Type: application/x-gzip, Size: 99041 bytes --] ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Fwd: Re: libata , Silicon Image 3124] 2008-01-09 17:31 ` Kasimir Müller @ 2008-01-10 4:08 ` Tejun Heo 0 siblings, 0 replies; 4+ messages in thread From: Tejun Heo @ 2008-01-10 4:08 UTC (permalink / raw) To: Kasimir Müller; +Cc: IDE/ATA development list, christian.kuehn Hello, Kasimir Müller wrote: > thanks very much again for your time and readiness to help. > > I applied the patch as you proposed without problem. > > I attach the kernel logs of the past few days. As far as I can see, with > the patched kernel errors appear only under heavy load . > > If you want, I could provide a (temporary) ssh-login to my machine. > Maybe it would make it easier for you to debug the problem. > If you want to use it, I need your public ssh-certificate and your > login-name and I would send the login-details to you. Your log is pretty interesting. Lots of timeouts are on FLUSH and previously other non-data commands like CHECK POWER or SMART, which is very unusual. I wonder what's going on. Yeap, I wanna log in and take a look. I'll send you my pubkey in a private mail. Thanks. -- tejun ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2008-01-10 4:08 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <473A5B03.2020808@gmail.com>
2008-01-06 10:55 ` [Fwd: Re: libata , Silicon Image 3124] Kasimir Müller
2008-01-08 3:47 ` Tejun Heo
2008-01-09 17:31 ` Kasimir Müller
2008-01-10 4:08 ` Tejun Heo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).