Linux ATA/IDE development
 help / color / mirror / Atom feed
* Re: [Bugme-new] [Bug 14518] New: I/O appears to get stuck on certain rsync backup job and system clock halts then
       [not found] <bug-14518-10286@http.bugzilla.kernel.org/>
@ 2009-10-31  7:17 ` Andrew Morton
  2009-10-31  7:27   ` Martin Steigerwald
                     ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Andrew Morton @ 2009-10-31  7:17 UTC (permalink / raw)
  To: Martin; +Cc: bugzilla-daemon, bugme-daemon, linux-ide, Jens Axboe


(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Sat, 31 Oct 2009 06:59:07 GMT bugzilla-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=14518
> 
>            Summary: I/O appears to get stuck on certain rsync backup job
>                     and system clock halts then
>            Product: IO/Storage
>            Version: 2.5
>     Kernel Version: 2.6.31.5
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Block Layer
>         AssignedTo: axboe@kernel.dk
>         ReportedBy: Martin@Lichtvoll.de
>         Regression: Yes
> 
> 
> I sorted this to generic block problems, might be an issue with the sata_sil
> driver tough.

Yup, it looks more like an ata issue.

Or it might be that the errors are due to real hardware problems and
triggered not-well-tested code paths in block/scsi/VFS/etc.

Could someone take a look please?


> Happening with and a similar 2.6.31.3 kernel:
> shambhala:~> cat /proc/version
> Linux version 2.6.31.5-tp42-toi-3.0.1-04850-g4eddd0d (martin@shambhala) (gcc
> version 4.3.4 (Debian 4.3.4-5) ) #5 PREEMPT Sun Oct 25 18:39:35 CET 2009
> 
> Yes, the kernel contains TuxOnIce, but I believe thats completely unrelated to
> this issue. I think this is a regression, but since it appears that this does
> not happen on my usual desktop workloads, but it only happened during a rsync
> backup comparison with its original, I am not sure.
> 
> Summery: The machine appeared to got stuck in I/O for three times. Two has been
> accompanied with UDMA CRC errors and thus might be explainable from that.
> Although I believe even in that case the machine should respond to something.
> Why is it that I/O problems with a eSATA drive can make to *complete system*
> unusable even while the internal laptop harddisk is just spinning fine? IMHO on
> a decend operating system this should not happen. On the third occasion there
> have been no errors related to this incident in the log.
> 
> This happened on a ThinkPad T42 with:
> shambhala:~> lspci -nn
> 00:00.0 Host bridge [0600]: Intel Corporation 82855PM Processor to I/O
> Controller [8086:3340] (rev 03)
> 00:01.0 PCI bridge [0604]: Intel Corporation 82855PM Processor to AGP
> Controller [8086:3341] (rev 03)
> 00:1d.0 USB Controller [0c03]: Intel Corporation 82801DB/DBL/DBM
> (ICH4/ICH4-L/ICH4-M) USB UHCI Controller #1 [8086:24c2] (rev 01)
> 00:1d.1 USB Controller [0c03]: Intel Corporation 82801DB/DBL/DBM
> (ICH4/ICH4-L/ICH4-M) USB UHCI Controller #2 [8086:24c4] (rev 01)
> 00:1d.2 USB Controller [0c03]: Intel Corporation 82801DB/DBL/DBM
> (ICH4/ICH4-L/ICH4-M) USB UHCI Controller #3 [8086:24c7] (rev 01)
> 00:1d.7 USB Controller [0c03]: Intel Corporation 82801DB/DBM (ICH4/ICH4-M) USB2
> EHCI Controller [8086:24cd] (rev 01)
> 00:1e.0 PCI bridge [0604]: Intel Corporation 82801 Mobile PCI Bridge
> [8086:2448] (rev 81)
> 00:1f.0 ISA bridge [0601]: Intel Corporation 82801DBM (ICH4-M) LPC Interface
> Bridge [8086:24cc] (rev 01)
> 00:1f.1 IDE interface [0101]: Intel Corporation 82801DBM (ICH4-M) IDE
> Controller [8086:24ca] (rev 01)
> 00:1f.3 SMBus [0c05]: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M)
> SMBus Controller [8086:24c3] (rev 01)
> 00:1f.5 Multimedia audio controller [0401]: Intel Corporation 82801DB/DBL/DBM
> (ICH4/ICH4-L/ICH4-M) AC'97 Audio Controller [8086:24c5] (rev 01)
> 00:1f.6 Modem [0703]: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M)
> AC'97 Modem Controller [8086:24c6] (rev 01)
> 01:00.0 VGA compatible controller [0300]: ATI Technologies Inc RV350 [Mobility
> Radeon 9600 M10] [1002:4e50]
> 02:00.0 CardBus bridge [0607]: Texas Instruments PCI4520 PC card Cardbus
> Controller [104c:ac46] (rev 01)
> 02:00.1 CardBus bridge [0607]: Texas Instruments PCI4520 PC card Cardbus
> Controller [104c:ac46] (rev 01)
> 02:01.0 Ethernet controller [0200]: Intel Corporation 82540EP Gigabit Ethernet
> Controller (Mobile) [8086:101e] (rev 03)
> 02:02.0 Network controller [0280]: Intel Corporation PRO/Wireless 2200BG
> [Calexico2] Network Connection [8086:4220] (rev 05)
> 03:00.0 Mass storage controller [0180]: Silicon Image, Inc. SiI 3512
> [SATALink/SATARaid] Serial ATA Controller [1095:3512] (rev 01)
> 
> (the latter is a PCMCIA card I use for accessing my eSATA devices)
> 
> The first two times
> ===================
> 
> First there where problems on a backup. I got UDMA CRC on one harddisk:
> 
> Oct 15 09:37:16 shambhala kernel: ata3: SATA link up 1.5 Gbps (SStatus 113
> SControl 310)
> Oct 15 09:37:16 shambhala kernel: ata3.00: ATA-8: Hitachi HTS545050B9A300,
> PB4OC60G, max UDMA/133
> Oct 15 09:37:16 shambhala kernel: ata3.00: 976773168 sectors, multi 0: LBA48
> NCQ (depth 0/32)
> Oct 15 09:37:16 shambhala kernel: ata3.00: configured for UDMA/100
> Oct 15 09:37:16 shambhala kernel: scsi 2:0:0:0: Direct-Access     ATA     
> Hitachi HTS54505 PB4O PQ: 0 ANSI: 5
> Oct 15 09:37:16 shambhala kernel: sd 2:0:0:0: Attached scsi generic sg2 type 0
> Oct 15 09:37:16 shambhala kernel: sd 2:0:0:0: [sdb] 976773168 512-byte logical
> blocks: (500 GB/465 GiB)
> Oct 15 09:37:16 shambhala kernel: sd 2:0:0:0: [sdb] Write Protect is off
> Oct 15 09:37:16 shambhala kernel: sd 2:0:0:0: [sdb] Mode Sense: 00 3a 00 00
> Oct 15 09:37:16 shambhala kernel: sd 2:0:0:0: [sdb] Write cache: enabled, read
> cache: enabled, doesn't support DPO or FUA
> Oct 15 09:37:16 shambhala kernel: sdb: sdb1 sdb2 sdb3
> Oct 15 09:37:16 shambhala kernel: sd 2:0:0:0: [sdb] Attached SCSI disk
> 
> This is my 500 GB 2,5 inch eSATA disk (500 GB). I copied from to 3 3,5 inch
> eSATA disk (1TB).
> 
> Like these:
> Oct 15 09:45:16 shambhala kernel: ata3: drained 32768 bytes to clear DRQ.
> Oct 15 09:45:16 shambhala kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr
> 0x280000 action 0x6 frozen
> Oct 15 09:45:16 shambhala kernel: ata3: SError: { 10B8B BadCRC }
> Oct 15 09:45:16 shambhala kernel: ata3.00: cmd
> 25/00:08:5d:bf:8f/00:01:13:00:00/e0 tag 0 dma 135168 in
> Oct 15 09:45:16 shambhala kernel:         res
> ff/ff:ff:ff:ff:ff/ff:ff:ff:ff:ff/ff Emask 0x2 (HSM violation)
> Oct 15 09:45:16 shambhala kernel: ata3.00: status: { Busy }
> Oct 15 09:45:16 shambhala kernel: ata3.00: error: { ICRC UNC IDNF ABRT }
> Oct 15 09:45:16 shambhala kernel: ata3: hard resetting link
> Oct 15 09:45:17 shambhala kernel: ata3: SATA link up 1.5 Gbps (SStatus 113
> SControl 310)
> Oct 15 09:45:17 shambhala kernel: ata3.00: configured for UDMA/100
> Oct 15 09:45:17 shambhala kernel: ata3: EH complete
> Oct 15 09:46:29 shambhala kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr
> 0x280000 action 0x6 frozen
> Oct 15 09:46:29 shambhala kernel: ata3: SError: { 10B8B BadCRC }
> Oct 15 09:46:29 shambhala kernel: ata3.00: cmd
> 25/00:08:5d:0f:c6/00:01:14:00:00/e0 tag 0 dma 135168 in
> Oct 15 09:46:29 shambhala kernel:         res
> 40/00:ff:ff:ff:ff/ff:ff:ff:ff:ff/ff Emask 0x4 (timeout)
> Oct 15 09:46:29 shambhala kernel: ata3.00: status: { DRDY }
> Oct 15 09:46:29 shambhala kernel: ata3: hard resetting link
> Oct 15 09:46:29 shambhala kernel: ata3: SATA link up 1.5 Gbps (SStatus 113
> SControl 310)
> Oct 15 09:46:29 shambhala kernel: ata3.00: configured for UDMA/100
> Oct 15 09:46:29 shambhala kernel: ata3: EH complete
> Oct 15 09:47:43 shambhala kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr
> 0x280000 action 0x6 frozen
> Oct 15 09:47:43 shambhala kernel: ata3: SError: { 10B8B BadCRC }
> Oct 15 09:47:43 shambhala kernel: ata3.00: cmd
> 25/00:08:5d:9b:f5/00:01:0e:00:00/e0 tag 0 dma 135168 in
> Oct 15 09:47:43 shambhala kernel:         res
> 40/00:ff:ff:ff:ff/ff:ff:ff:ff:ff/ff Emask 0x4 (timeout)
> Oct 15 09:47:43 shambhala kernel: ata3.00: status: { DRDY }
> Oct 15 09:47:43 shambhala kernel: ata3: hard resetting link
> Oct 15 09:47:43 shambhala kernel: ata3: SATA link up 1.5 Gbps (SStatus 113
> SControl 310)
> Oct 15 09:47:43 shambhala kernel: ata3.00: configured for UDMA/100
> Oct 15 09:47:43 shambhala kernel: ata3: EH complete
> Oct 15 09:48:57 shambhala kernel: ata3.00: limiting speed to UDMA/66:PIO4
> Oct 15 09:48:57 shambhala kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr
> 0x280000 action 0x6 frozen
> Oct 15 09:48:57 shambhala kernel: ata3: SError: { 10B8B BadCRC }
> Oct 15 09:48:57 shambhala kernel: ata3.00: cmd
> c8/00:f8:65:4a:0d/00:00:00:00:00/ef tag 0 dma 126976 in
> Oct 15 09:48:57 shambhala kernel:         res
> 40/00:ff:ff:ff:ff/ff:ff:ff:ff:ff/ff Emask 0x4 (timeout)
> Oct 15 09:48:57 shambhala kernel: ata3.00: status: { DRDY }
> 
> I let the backup continue and it finished after a long time. libata reduced DMA
> speeds several times during that backup sessions. I have never seen this before
> when working with eSATA disks. Complete log available on request.
> 
> This might have had hardware related reasons: The Delock eSATA casing seems to
> have become flaky. I had sudden power losses on the harddisks recently. There
> seemed to be contact problems with the power cable at least. I replaced the
> casing and cables a few days ago.
> 
> As this has happened initially - the first time - I had the KDE session locked
> and the machine did not respond to any key press or mouse movement. It did not
> unlock the KDE session and it it not switch to Ctrl-Alt-Fx to switch to a TTY.
> On the second try this happened, I unplugged the PCMCIA eSATA controller and
> the machine suddenly reacted again. Later I found out that the time got stuck.
> The clock was going several hours to late.
> 
> On the second time I let KDE unlocked and it completed the backup after a long
> time.
> 
> Now with the new case for that I updated that backup and all went well this
> time.
> 
> 
> The third time
> ==============
> 
> Then I wanted to make sure that the backup on the last time was correct. Thus I
> dared a:
> 
> shambhala:~> rsync -a -c --acls --xattrs --sparse --hard-links --del -P
> /mnt/amazon-daten/ /mnt/zauberwald-daten/
> sending incremental file list                                                   
> 
> For some hours before going to bed this appeared to work nicely. rsync just
> compared checksums of old and new files.
> 
> Then I went to back and in the morning again my laptop didn't react to
> anything. I have it quite some time - about 30 seconds but it didn't unlock KDE
> nor switch to a TTY.
> 
> As I want to avoid a complete reboot I then unplugged both SATA disks. Nothing.
> Then I removed the PCMCIA SATA card and suddenly the machine was usable again.
> 
> Of cause the rsync complained:
> 
> file has vanished: "/mnt/amazon-daten/VirtualBox-Images/OpenSUSE 11.1.vdi"      
> file has vanished: "/mnt/amazon-daten/VirtualBox-Images/Kubuntu Jaunty
> Template.vdi"                             
> file has vanished: "/mnt/amazon-daten/VirtualBox-Images/Debian Lenny.vdi"       
> file has vanished:
> "/mnt/amazon-daten/VirtualBox-Images/Debian-Etch-Template.vdi"                  
> file has vanished: "/mnt/amazon-daten/VirtualBox-Images/Superkonqi.vdi"         
> file has vanished: "/mnt/amazon-daten/VirtualBox-Images/OpenSUSE 11.1
> Template.vdi"                              
> Spiele/ScummVM Spiele/Indiana Joney - Fate of Atlantis/                         
> rsync: recv_generator: mkdir "/mnt/zauberwald-daten/somedirectory" failed:
> Read-only file system (30)                                                      
> *** Skipping any contents from this failed directory ***                        
> somedirectory                                                                   
> rsync: recv_generator: mkdir "/mnt/zauberwald-daten/somedirectory" failed:
> Read-only file system (30)                                                      
> *** Skipping any contents from this failed directory ***                        
> Spiele/ScummVM Spiele/Monkey Island 1/                                          
> rsync: recv_generator: mkdir "/mnt/zauberwald-daten/somedirectory" failed:
> Read-only file system (30)                                                      
> *** Skipping any contents from this failed directory ***                        
> somedirectory
> rsync: recv_generator: mkdir "/mnt/zauberwald-daten/somedirectory" failed:
> Read-only file system (30)                                                      
> *** Skipping any contents from this failed directory ***                        
> somedirectory
> rsync: recv_generator: mkdir "/mnt/zauberwald-daten/somedirectory" failed:
> Read-only file system (30)                                                      
> *** Skipping any contents from this failed directory ***                        
> somedirectory
> rsync: recv_generator: mkdir "/mnt/zauberwald-daten/somedirectory" failed:
> Read-only file system (30)                                                      
> *** Skipping any contents from this failed directory ***                        
> somedirectory
> rsync: recv_generator: mkdir "/mnt/zauberwald-daten/somedirectory" failed:
> Read-only file system (30)                                                      
> *** Skipping any contents from this failed directory ***
> somedirectory
> rsync: recv_generator: mkdir "/mnt/zauberwald-daten/somedirectory" failed:
> Read-only file system (30)
> *** Skipping any contents from this failed directory ***
> somedirectory
> rsync: recv_generator: mkdir "/mnt/zauberwald-daten/somedirectory" failed:
> Read-only file system (30)
> *** Skipping any contents from this failed directory ***
> somedirectory
> rsync: recv_generator: mkdir "/mnt/zauberwald-daten/somedirectory" failed:
> Read-only file system (30)
> *** Skipping any contents from this failed directory ***
> somedirectory
> rsync: recv_generator: mkdir "/mnt/zauberwald-daten/somedirectory" failed:
> Read-only file system (30)
> *** Skipping any contents from this failed directory ***
> Spiele/ScummVM Spiele/brokensword1/
> rsync: recv_generator: mkdir "/mnt/zauberwald-daten/somedirectory" failed:
> Read-only file system (30)
> *** Skipping any contents from this failed directory ***
> somedirectory
> rsync: recv_generator: mkdir "/mnt/zauberwald-daten/somedirectory" failed:
> Read-only file system (30)
> *** Skipping any contents from this failed directory ***
> somedirectory
> rsync: recv_generator: mkdir "/mnt/zauberwald-daten/somedirectory" failed:
> Read-only file system (30)
> *** Skipping any contents from this failed directory ***
> rsync: readdir("/mnt/zauberwald-daten/VirtualBox-Images"): Input/output error
> (5)
> VirtualBox-Images/Fedora-11.vdi
>        32768   0%    0.00kB/s    0:00:00
> Message from syslogd@shambhala at Oct 31 00:49:52 ...
>  kernel:journal commit I/O error
>   4768080384 100%   65.60MB/s    0:01:09 (xfer#1, to-check=1/61674)
> rsync: read errors mapping "/mnt/amazon-daten/VirtualBox-Images/Fedora-11.vdi":
> Input/output error (5)
> VirtualBox-Images/Kubuntu Jaunty.vdi
>   5098381824 100%   68.16MB/s    0:01:11 (xfer#2, to-check=0/61674)
> rsync: read errors mapping "/mnt/amazon-daten/VirtualBox-Images/Kubuntu
> Jaunty.vdi": Input/output error (5)
> IO error encountered -- skipping file deletion
> rsync: mkstemp "/mnt/zauberwald-daten/VirtualBox-Images/.Fedora-11.vdi.sV8k6c"
> failed: Read-only file system (30)
> rsync: mkstemp "/mnt/zauberwald-daten/VirtualBox-Images/.Kubuntu
> Jaunty.vdi.fEIcS4" failed: Read-only file system (30)
> 
> sent 9870164565 bytes  received 3602 bytes  1441426.53 bytes/sec
> total size is 178703414972  speedup is 18.11
> rsync error: some files/attrs were not transferred (see previous errors) (code
> 23) at main.c(1057) [sender=3.0.6]
> 
> This brought some more UDMA CRC errors into the SMART LOG of my 500 GB eSATA
> drive. Good that this is only an old age attribute. Anyway, both drives are
> rather new and used to work flawlessly up to these incidents. SMART status is
> "PASSED" on both drives and the 1TB 3.5 inch drive completed a long self test
> just fine recently. I will schedule one more test on that drive and also one of
> the 500 GB drive just to make sure, drives are okay.
> 
> I looked into syslog, but this only showed the errors from my unplug and eSATA
> controller removal action. There is not a single sign of an error before that
> incident. Complete log available on request. Or I might just decide to attach
> it right after completing this test.
> 
> The machine appeared to have been completely frozen:
> 
> shambhala:~> date
> Sa 31. Okt 01:14:00 CET 2009
> shambhala:~> /etc/init.d/openntpd stop
> Stopping openntpd: ntpd.
> shambhala:~> ntpdate de.pool.ntp.org
> 31 Oct 07:19:55 ntpdate[21953]: step time server 188.40.77.71 offset
> 21931.103558 sec
> shambhala:~> /etc/init.d/openntpd start
> Starting openntpd: ntpd.
> shambhala:~> date
> Sa 31. Okt 07:20:05 CET 2009
> 
> Yes, I checked the machine this morning, but it still thought it was still in
> the night. 01:14 was just about 30 or 45 minutes after I went to bed. And the
> rsync job went fine for several hours prior to that. KDE session was blanked
> several times but I was always able to unblank it and see that rsync was
> spinning on its checksum verify job. As long as it was running in didn't copy a
> single file so the checksum appeared to be correct.
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Bugme-new] [Bug 14518] New: I/O appears to get stuck on certain rsync backup job and system clock halts then
  2009-10-31  7:17 ` [Bugme-new] [Bug 14518] New: I/O appears to get stuck on certain rsync backup job and system clock halts then Andrew Morton
@ 2009-10-31  7:27   ` Martin Steigerwald
  2009-10-31  7:31   ` Martin Steigerwald
  2009-10-31 11:48   ` Alan Cox
  2 siblings, 0 replies; 6+ messages in thread
From: Martin Steigerwald @ 2009-10-31  7:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: bugzilla-daemon, bugme-daemon, linux-ide, Jens Axboe, Ingo Molnar

[-- Attachment #1: Type: Text/Plain, Size: 2037 bytes --]

Am Samstag 31 Oktober 2009 schrieb Andrew Morton:
> (switched to email.  Please respond via emailed reply-to-all, not via
>  the bugzilla web interface).
> 
> On Sat, 31 Oct 2009 06:59:07 GMT bugzilla-daemon@bugzilla.kernel.org 
wrote:
> > http://bugzilla.kernel.org/show_bug.cgi?id=14518
> >
> >            Summary: I/O appears to get stuck on certain rsync backup
> > job and system clock halts then
> >            Product: IO/Storage
> >            Version: 2.5
> >     Kernel Version: 2.6.31.5
> >           Platform: All
> >         OS/Version: Linux
> >               Tree: Mainline
> >             Status: NEW
> >           Severity: normal
> >           Priority: P1
> >          Component: Block Layer
> >         AssignedTo: axboe@kernel.dk
> >         ReportedBy: Martin@Lichtvoll.de
> >         Regression: Yes
> >
> >
> > I sorted this to generic block problems, might be an issue with the
> > sata_sil driver tough.
> 
> Yup, it looks more like an ata issue.
> 
> Or it might be that the errors are due to real hardware problems and
> triggered not-well-tested code paths in block/scsi/VFS/etc.
> 
> Could someone take a look please?

I think I better add one information about a special tuning I applied to 
CFS. CC'ing Ingo Molnar for feedback on that one:

shambhala:~> tail -2 /etc/sysfs.conf
# Scheduler-Tuning
kernel/debug/sched_features = NO_NEW_FAIR_SLEEPERS

I am not sure whether this could interfere with the workload I described.

I believed I had some more settings in /etc/sysfs.conf, but apparently to 
due an editing error on this one has been left.

The idea for is that coming from a scheduler tuning thread that originated 
from the initial Brain Fuck Scheduler posting by Con Koliva.

AFAIK NO_NEW_FAIR_SLEEPERS is disabled on current 2.6.32-rc kernels as 
well.

Ingo might that be a contribution to the issues seen here?

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Bugme-new] [Bug 14518] New: I/O appears to get stuck on certain rsync backup job and system clock halts then
  2009-10-31  7:17 ` [Bugme-new] [Bug 14518] New: I/O appears to get stuck on certain rsync backup job and system clock halts then Andrew Morton
  2009-10-31  7:27   ` Martin Steigerwald
@ 2009-10-31  7:31   ` Martin Steigerwald
  2009-10-31 11:48   ` Alan Cox
  2 siblings, 0 replies; 6+ messages in thread
From: Martin Steigerwald @ 2009-10-31  7:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: bugzilla-daemon, bugme-daemon, linux-ide, Jens Axboe, Ingo Molnar

[-- Attachment #1: Type: Text/Plain, Size: 1691 bytes --]

Am Samstag 31 Oktober 2009 schrieb Andrew Morton:
> (switched to email.  Please respond via emailed reply-to-all, not via
>  the bugzilla web interface).
> 
> On Sat, 31 Oct 2009 06:59:07 GMT bugzilla-daemon@bugzilla.kernel.org 
wrote:
> > http://bugzilla.kernel.org/show_bug.cgi?id=14518
> >
> >            Summary: I/O appears to get stuck on certain rsync backup
> > job and system clock halts then
> >            Product: IO/Storage
> >            Version: 2.5
> >     Kernel Version: 2.6.31.5
> >           Platform: All
> >         OS/Version: Linux
> >               Tree: Mainline
> >             Status: NEW
> >           Severity: normal
> >           Priority: P1
> >          Component: Block Layer
> >         AssignedTo: axboe@kernel.dk
> >         ReportedBy: Martin@Lichtvoll.de
> >         Regression: Yes
> >
> >
> > I sorted this to generic block problems, might be an issue with the
> > sata_sil driver tough.
> 
> Yup, it looks more like an ata issue.
> 
> Or it might be that the errors are due to real hardware problems and
> triggered not-well-tested code paths in block/scsi/VFS/etc.
> 
> Could someone take a look please?

Okay, I missed these as well:

shambhala:/etc> grep -A 3 "Sched" sysctl.conf
# Scheduler-Tuning
kernel.sched_latency_ns = 5000000
kernel.sched_min_granularity_ns = 100000
kernel.sched_wakeup_granularity_ns = 100000

Its confusing that some scheduler settings are in /sys and some are in 
/proc ;).

Ingo, might these have an impact on the issues seen? I can undo them.

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Bugme-new] [Bug 14518] New: I/O appears to get stuck on certain rsync backup job and system clock halts then
  2009-10-31  7:17 ` [Bugme-new] [Bug 14518] New: I/O appears to get stuck on certain rsync backup job and system clock halts then Andrew Morton
  2009-10-31  7:27   ` Martin Steigerwald
  2009-10-31  7:31   ` Martin Steigerwald
@ 2009-10-31 11:48   ` Alan Cox
  2009-10-31 12:22     ` Martin Steigerwald
  2 siblings, 1 reply; 6+ messages in thread
From: Alan Cox @ 2009-10-31 11:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Martin, bugzilla-daemon, bugme-daemon, linux-ide, Jens Axboe

> Yup, it looks more like an ata issue.

I see no ATA issue here


> > 25/00:08:5d:bf:8f/00:01:13:00:00/e0 tag 0 dma 135168 in
> > Oct 15 09:45:16 shambhala kernel:         res
> > ff/ff:ff:ff:ff:ff/ff:ff:ff:ff:ff/ff Emask 0x2 (HSM violation)
> > Oct 15 09:45:16 shambhala kernel: ata3.00: status: { Busy }
> > Oct 15 09:45:16 shambhala kernel: ata3.00: error: { ICRC UNC IDNF ABRT }

Looks like the drive fell off the bus entirely

> > I let the backup continue and it finished after a long time. libata reduced DMA
> > speeds several times during that backup sessions. I have never seen this before
> > when working with eSATA disks. Complete log available on request.

Desperately trying to get the drive to work reliably.

> > the machine suddenly reacted again. Later I found out that the time got stuck.
> > The clock was going several hours to late.

If the clock gets stuck for some reason then the block layer and ATA
timeouts are not going to work so the clock is probably the root cause.
Clock stopped sounds like an IRQ jam, and given removing the card fixed
it then possibly the drive jammed the IRQ on.

> > This brought some more UDMA CRC errors into the SMART LOG of my 500 GB eSATA
> > drive. Good that this is only an old age attribute. Anyway, both drives are

CRC errors are just logs of messages failing to get across uncorrupted -
its a sign of bad cables/power/adapters/ using SATA devices with eSATA
and not eSATA devices and the like. It's not really a sign of drive
problems.

I would say you had two problems

#1 Your eSATA cabling/power is flaky
#2 the Cardbus Sil3512 controller somehow got stuck asserting an
interrupt that wasn't cleared.

Needs the Sil3512 person to look at it. Even with flaky cabling it should
have either recovered cleanly or dropped the device.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Bugme-new] [Bug 14518] New: I/O appears to get stuck on certain rsync backup job and system clock halts then
  2009-10-31 11:48   ` Alan Cox
@ 2009-10-31 12:22     ` Martin Steigerwald
  2009-10-31 14:35       ` Alan Cox
  0 siblings, 1 reply; 6+ messages in thread
From: Martin Steigerwald @ 2009-10-31 12:22 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andrew Morton, bugzilla-daemon, bugme-daemon, linux-ide,
	Jens Axboe

[-- Attachment #1: Type: Text/Plain, Size: 2417 bytes --]

Am Samstag 31 Oktober 2009 schrieb Alan Cox:

[...]
> > > the machine suddenly reacted again. Later I found out that the time
> > > got stuck. The clock was going several hours to late.
> 
> If the clock gets stuck for some reason then the block layer and ATA
> timeouts are not going to work so the clock is probably the root cause.
> Clock stopped sounds like an IRQ jam, and given removing the card fixed
> it then possibly the drive jammed the IRQ on.
> 
> > > This brought some more UDMA CRC errors into the SMART LOG of my 500
> > > GB eSATA drive. Good that this is only an old age attribute.
> > > Anyway, both drives are
> 
> CRC errors are just logs of messages failing to get across uncorrupted
>  - its a sign of bad cables/power/adapters/ using SATA devices with
>  eSATA and not eSATA devices and the like. It's not really a sign of
>  drive problems.
> 
> I would say you had two problems
> 
> #1 Your eSATA cabling/power is flaky

I easily believe that for the first two occurences. As I said that eSATA 
case / cabling turned out to be quite flaky later on.

But on the third try I completely replaced it. Only thing that is 
unchanged is the 1 GB eSATA drive. But then on the third case I did not 
see *any* errors in the log at all until I disconnected both drives and 
removed the PCMCIA eSATA controller.

As of my knowing it should all be eSATA cables. I used the cables that 
where delivered with the eSATA cases.

> #2 the Cardbus Sil3512 controller somehow got stuck asserting an
> interrupt that wasn't cleared.

What could be the reason for that one? Could it be that the PCMCIA card 
had to many plug / unplug cycles. Contacts look fine tough.

> Needs the Sil3512 person to look at it. Even with flaky cabling it
>  should have either recovered cleanly or dropped the device.

Yes, thats my main concern. Why did it stuck the machine for so long?

Ok, if I can help with some test I try my best to take to time for it. I 
will wait for further instructions / questions.

For now I just assume that the data in the backup is okay and just use 
rsync periodically to update the backup - that seems to work. The data is 
a bit less important than the one on the internal drive so I hope I get 
away with this ;)

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Bugme-new] [Bug 14518] New: I/O appears to get stuck on certain rsync backup job and system clock halts then
  2009-10-31 12:22     ` Martin Steigerwald
@ 2009-10-31 14:35       ` Alan Cox
  0 siblings, 0 replies; 6+ messages in thread
From: Alan Cox @ 2009-10-31 14:35 UTC (permalink / raw)
  To: Martin Steigerwald
  Cc: Andrew Morton, bugzilla-daemon, bugme-daemon, linux-ide,
	Jens Axboe

> > #2 the Cardbus Sil3512 controller somehow got stuck asserting an
> > interrupt that wasn't cleared.
> 
> What could be the reason for that one? Could it be that the PCMCIA card 
> had to many plug / unplug cycles. Contacts look fine tough.

I have no idea.
> 
> > Needs the Sil3512 person to look at it. Even with flaky cabling it
> >  should have either recovered cleanly or dropped the device.
> 
> Yes, thats my main concern. Why did it stuck the machine for so long?

It needs someone with 3512 hardware/knowledge to work on it. 

> Ok, if I can help with some test I try my best to take to time for it. I 
> will wait for further instructions / questions.

Dump the IRQ counter (cat /proc/interrupts) somewhere before each rsync
and if it locks up again then when you remove the card take a second
copy. If its a stuck IRQ then the one IRQ will show up stupendously large
numbers.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2009-10-31 14:34 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <bug-14518-10286@http.bugzilla.kernel.org/>
2009-10-31  7:17 ` [Bugme-new] [Bug 14518] New: I/O appears to get stuck on certain rsync backup job and system clock halts then Andrew Morton
2009-10-31  7:27   ` Martin Steigerwald
2009-10-31  7:31   ` Martin Steigerwald
2009-10-31 11:48   ` Alan Cox
2009-10-31 12:22     ` Martin Steigerwald
2009-10-31 14:35       ` Alan Cox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox