On 03/26/2009 12:20 AM, James Bottomley wrote:
> On Wed, 2009-03-25 at 02:06 -0400, Jeff Garzik wrote:
>> Niel Lambrechts wrote:
>>> On 03/25/2009 03:30 AM, Theodore Tso wrote:
>>>> On Tue, Mar 24, 2009 at 10:25:57PM +0200, Niel Lambrechts wrote:
>>>>> Hi,
>>>>>
>>>>> After upgrading to 2.6.29 I get the below errors after resuming from
>>>>> hibernating with s2disk. I ran fsck and tried doing the same thing again
>>>>> in 2.6.28.9-pae, but do not get any errors there.
>>>> The ext4 errors are interleaved with hardware errors, and the ext4
>>>> errors are about I/O errors.
>>>>
>>>> EXT4-fs error (device sda6): __ext4_get_inode_loc: unable to read inode block - inode=2346519
>>>> EXT4-fs error (device sda6) in ext4_reserve_inode_write: IO failure
>>>>
>>>> This looks more like a hibernation problem than an ext4 problem.
>>>> Looks like the hard drive is being left in some inconsistent state
>>>> after resuming from hibernation.
>>>>
>>>>      	   	       		   	   - Ted
>>> Thanks for the info Theodore, this is definitely looks like some type of
>>> regression in 2.6.29, as the problem is not evident when I s2disk using
>>> 2.6.28.9, even after multiple suspend/resume cycles.
>>>
>>> I found some 'ATA bus errors' and 'SError' messages in
>>> /var/log/messages, so I've attached the messages from both 2.6.29 and
>>> 2.6.28 for comparison.
>> Well, here is the interpretation of messages:
>>
>>> ata1.00: irq_stat 0x00400008, PHY RDY changed
>>> ata1: SError: { PHYRdyChg CommWake }
>> Your SATA hardware flags a connect-or-disconnect event ("PHY RDY"), 
>> which requires us to abort a bunch of queued commands:
>>
>>> ata1.00: cmd 60/18:00:77:88:6f/00:00:0e:00:00/40 tag 0 ncq 12288 in
>>>          res 50/00:30:07:b3:10/00:00:0c:00:00/40 Emask 0x10 (ATA bus error)
>> [...]
>>> ata1.00: cmd 60/30:68:07:b3:10/00:00:0c:00:00/40 tag 13 ncq 24576 in
>>>          res 50/00:30:07:b3:10/00:00:0c:00:00/40 Emask 0x10 (ATA bus error)
>>   ...through the 14th command (tag 13).
>>
>>> Mar 24 21:29:14 linux-7vph kernel: ata1: hard resetting link
>>> Mar 24 21:29:14 linux-7vph kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
>>> Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
>>> Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out
>>> Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
>>> Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 filtered out
>>> Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
>>> Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out
>>> Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
>>> Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 filtered out
>>> Mar 24 21:29:14 linux-7vph kernel: ata1.00: configured for UDMA/133
>>> Mar 24 21:29:14 linux-7vph kernel: ata1.00: configured for UDMA/133
>>
>> SATA link is reset, and ACPI is re-run.
>>
>>> Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
>>> Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Sense Key : Aborted Command [current] [descriptor]
>>> Mar 24 21:29:14 linux-7vph kernel: Descriptor sense data with sense descriptors (in hex):
>>> Mar 24 21:29:14 linux-7vph kernel:         72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00 
>>> Mar 24 21:29:14 linux-7vph kernel:         0c 10 b3 07 
>>> Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Add. Sense: No additional sense information
>>> Mar 24 21:29:14 linux-7vph kernel: end_request: I/O error, dev sda, sector 242190455
>> The SCSI subsystem aborts each of the queued commands.
> 
> No .. this is the SCSI subsystem receives an ABORTED COMMAND return in
> sense data for each of the outstanding I/Os
> 
> The only place these are generated is in ata_sense_to_error() which only
> occurs if there's some type of ata error.
> 
> If I had to theorise, I'd say the system suspended with commands
> outstanding to the device.  On resume, the device gets reset and returns
> some type of ATA error which gets translated to ABORTED COMMAND which
> causes a failure.
> 
> In the mid layer, we translate ABORTED_COMMAND into a retry until the
> command runs out of them ... could it be there's a race readying the
> device and we run through the retries before it can accept the command?

I'm seeing some dubious looking ATA messages even on 2.6.28.9-pae,
although with all the 2.6.28 variants I used s2disk/resume has always
worked. I was wondering if these "errors" perhaps play more of a role in
2.6.29, perhaps due to the async. changes that was mentioned?


In particular, I notice some the following before "EH complete":
ata1: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0x9 t4
ata1: irq_stat 0x00400040, connection status changed

but before this last message there is a
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)

Why would the connection status change if a moment ago it was "link up"?

Niel