linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* 2.6.36: Dropped interrupts in ata_piix
@ 2010-10-25 18:13 Mike Waychison
  2010-10-26  9:58 ` Tejun Heo
  0 siblings, 1 reply; 5+ messages in thread
From: Mike Waychison @ 2010-10-25 18:13 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Linux SCSI List

Hi Tejun,

I'm having problems reliably booting 2.6.36 on one of my development 
systems whereby it looks like the ata_piix driver isn't acknowledging 
interrupts.  I went through a bit of the recent history here, and it 
seems that things clear up for me if I revert the following two commits 
in my tree:

1c5afdf7 "libata-sff: separate out BMDMA init"
c3b28894 "libata-sff: separate out BMDMA irq handler"

I usually don't get a trace, but I did get this blurted out once on the 
console:


kinit: Mounted root (ext2 filesystem) readonly.
INIT: version 2.78 booting
[    5.419165] irq 20: nobody cared (try booting with the "irqpoll" option)
[    5.420140] Pid: 0, comm: kworker/0:1 Not tainted 2.6.36-smp-mikew 
#5gca29cdd
[    5.420140] Call Trace:
[    5.420140]  <IRQ>  [<ffffffff810b207f>] __report_bad_irq+0x3d/0x8c
[    5.420140]  [<ffffffff810b21e6>] note_interrupt+0x118/0x17e
[    5.420140]  [<ffffffff810b29be>] handle_fasteoi_irq+0xa7/0xcc
[    5.420140]  [<ffffffff81032da5>] handle_irq+0x24/0x2f
[    5.420140]  [<ffffffff81453744>] do_IRQ+0x5c/0xc3
[    5.420140]  [<ffffffff8144d853>] ret_from_intr+0x0/0xa
[    5.420140]  <EOI>  [<ffffffff81037a64>] ? mwait_idle+0x93/0x9b
[    5.420140]  [<ffffffff81037a0a>] ? mwait_idle+0x39/0x9b
[    5.420140]  [<ffffffff8102faee>] cpu_idle+0x63/0xd5
[    5.420140]  [<ffffffff8193d340>] start_secondary+0x192/0x196
[    5.420140] handlers:
[    5.420140] [<ffffffff812e26e6>] (ata_bmdma_interrupt+0x0/0x17)
[    5.420140] Disabling IRQ #20
[   34.720103] ata1: lost interrupt (Status 0x51)
[   34.724569] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 
frozen
[   34.731612] ata1.00: BMDMA stat 0x26, BMDMA stat 0x0, BMDMA stat 0x0, 
BMDMA stat 0x0, BMDMA stat 0x0
[   34.740750] ata1.00: failed command: READ DMA
[   34.745115] ata1.00: cmd c8/00:a0:f7:78:09/00:00:00:00:00/e0 tag 0 
dma 81920 in
[   34.745116]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 
0x24 (host bus error)
[   34.760490] ata1.00: status: { DRDY }
[   34.764180] ata1: soft resetting link
[   35.143059] ata1.00: configured for UDMA/133
[   35.147332] ata1.00: device reported invalid CHS sector 0
[   35.152730] ata1: EH complete

As you can see above, something looks to be wrong with ata_bmdma_interrupt.

Have you seen this problem before?

Thanks,

Mike Waychison

^ permalink raw reply	[flat|nested] 5+ messages in thread
* Re: 2.6.36: Dropped interrupts in ata_piix
@ 2010-11-29 11:41 Mikko Korkalo
  0 siblings, 0 replies; 5+ messages in thread
From: Mikko Korkalo @ 2010-11-29 11:41 UTC (permalink / raw)
  To: linux-scsi; +Cc: mikew

Hi,

I am replying to the message here:
http://www.spinics.net/lists/linux-scsi/msg47723.html
(I just registered to the mailing list, so I manually added the Reply-To
header, hopefully it's correct)

It seems I have very similar symptoms with 2.6.36, I can reproduce them
easily in a live system, I could even grant shell access if necessary.
It didn't happen at all with stock ubuntu 8.04 LTS kernel
(2.6.32-25-generic-pae).

Currently, 1 out of 3 SATA hard drives are suffering from this.
I left a dd command overnight to read the drive contents to /dev/null.
One drive didn't finish while all the other drives finished with normal
speeds.
I have attached dmesg and smartctl messages.

Should I try and revert the mentioned patches, or something else?
Anyone able to provide direct links to these patches, I don't normally
involve in kernel development so I'm kind of new to this.

Here are my logs. SMART doesn't report any errors, but I see a lot of
errors in dmesg.

$ sudo smartctl -a /dev/sdc
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD20EADS-00S2B0
Serial Number:    WD-WCAVY0359972
Firmware Version: 04.05G04
User Capacity:    2,000,398,934,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Mon Nov 29 12:18:20 2010 EET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
                                        was suspended by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                 (43200) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 255) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x303f) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   157   133   021    Pre-fail  Always       -       9108
  4 Start_Stop_Count        0x0032   099   099   000    Old_age   Always       -       1198
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   085   085   000    Old_age   Always       -       11337
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       58
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       22
193 Load_Cycle_Count        0x0032   172   172   000    Old_age   Always       -       86640
194 Temperature_Celsius     0x0022   109   095   000    Old_age   Always       -       43
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       1279
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      5200         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

$ dmesg|tail -30
[51214.053097] ata4.00: failed command: READ DMA EXT
[51214.055504] ata4.00: cmd 25/00:00:bf:0d:38/00:02:94:00:00/e0 tag 0 dma 262144 in
[51214.055507]          res 40/00:00:00:00:00/84:01:09:00:00/00 Emask 0x24 (host bus error)
[51214.060435] ata4.00: status: { DRDY }
[51214.062824] ata4: soft resetting link
[51214.339049] ata4.00: configured for UDMA/33
[51214.339059] ata4.00: device reported invalid CHS sector 0
[51214.339117] ata4: EH complete
[51253.024028] ata4: lost interrupt (Status 0x51)
[51253.024050] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[51253.026504] ata4.00: BMDMA stat 0x26, BMDMA stat 0x0, BMDMA stat 0x0, BMDMA stat 0x0, BMDMA stat 0x0
[51253.029131] ata4.00: failed command: READ DMA EXT
[51253.031523] ata4.00: cmd 25/00:00:bf:49:38/00:01:94:00:00/e0 tag 0 dma 131072 in
[51253.031526]          res 40/00:00:00:00:00/84:01:09:00:00/00 Emask 0x24 (host bus error)
[51253.036463] ata4.00: status: { DRDY }
[51253.038863] ata4: soft resetting link
[51253.339048] ata4.00: configured for UDMA/33
[51253.339055] ata4.00: device reported invalid CHS sector 0
[51253.339066] ata4: EH complete
[51288.008028] ata4: lost interrupt (Status 0x51)
[51288.008050] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[51288.010468] ata4.00: BMDMA stat 0x26, BMDMA stat 0x0, BMDMA stat 0x0, BMDMA stat 0x0, BMDMA stat 0x0
[51288.013096] ata4.00: failed command: READ DMA EXT
[51288.015506] ata4.00: cmd 25/00:00:bf:5f:38/00:01:94:00:00/e0 tag 0 dma 131072 in
[51288.015509]          res 40/00:00:00:00:00/84:01:09:00:00/00 Emask 0x24 (host bus error)
[51288.020517] ata4.00: status: { DRDY }
[51288.022906] ata4: soft resetting link
[51288.339041] ata4.00: configured for UDMA/33
[51288.339049] ata4.00: device reported invalid CHS sector 0
[51288.339059] ata4: EH complete

My kernel is stock except for BFS scheduler patches. I don't see how that could cause this.

Best Regards
Mikko Korkalo



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2010-11-29 11:49 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-10-25 18:13 2.6.36: Dropped interrupts in ata_piix Mike Waychison
2010-10-26  9:58 ` Tejun Heo
2010-10-26 18:08   ` Mike Waychison
2010-10-27  2:33     ` Mike Waychison
  -- strict thread matches above, loose matches on Subject: below --
2010-11-29 11:41 Mikko Korkalo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).