linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Levente Kurusa <levex@linux.com>
To: Robert Hancock <hancockrwd@gmail.com>
Cc: "linux-ide@vger.kernel.org" <linux-ide@vger.kernel.org>,
	"linux-acpi@vger.kernel.org" <linux-acpi@vger.kernel.org>
Subject: Re: [PATCH] BIOS SATA legacy mode failure
Date: Sat, 28 Sep 2013 19:46:40 +0200	[thread overview]
Message-ID: <52471600.4090908@linux.com> (raw)
In-Reply-To: <CADLC3L1HdD4xV64AWN0bO-2GizUMhwfwY5PZsLZN0eQ=4yFyXA@mail.gmail.com>

2013-09-28 06:55 keltezéssel, Robert Hancock írta:
> On Fri, Sep 27, 2013 at 7:24 AM, Levente Kurusa <levex@linux.com> wrote:
>> 2013-09-25 08:31 keltezéssel, Robert Hancock írta:
>>
>>> On Sun, Sep 22, 2013 at 1:13 AM, Levente Kurusa <levex@linux.com> wrote:
>>>>
>>>> 2013-09-21 19:04 keltezéssel, Robert Hancock írta:
>>>>
>>>>> On Sat, Sep 21, 2013 at 1:35 AM, Levente Kurusa <levex@linux.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> The following dmesg is stuck in an infinite loop.
>>>>>>>>>>>> dmesg:
>>>>>>>>>>>> ata3: lost interrupt (Status 0x50)
>>>>>>>>>>>> ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
>>>>>>>>>>>> ata3.00: failed command: READ DMA
>>>>>>>>>>>> ata3.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096
>>>>>>>>>>>> in
>>>>>>>>>>>>                     res 40/00:00:00:00:00/00:00:00:00:00/00 Emask
>>>>>>>>>>>> 0x4
>>>>>>>>>>>> (timeout)
>>>>>>>>>>>> ata3.00: status: { DRDY }
>>>>>>>>>>>> ata3: soft resetting link
>>>>>>>>>>>> ata3.00: configured for UDMA/33 (no error)
>>>>>>>>>>>> ata3.00: device reported invalid CHS sector 0
>>>>>>>>>>>> ata3: EH complete
>>>>>>>>>>>>
>>>>>>>>>>>> Patch that fixes the infinite loop:
>>>>>>>>>>>> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
>>>>>>>>>>>> index f9476fb..eeedf80 100644
>>>>>>>>>>>> --- a/drivers/ata/libata-eh.c
>>>>>>>>>>>> +++ b/drivers/ata/libata-eh.c
>>>>>>>>>>>> @@ -2437,6 +2437,14 @@ static void ata_eh_link_report(struct
>>>>>>>>>>>> ata_link
>>>>>>>>>>>> *link)
>>>>>>>>>>>>                                   ehc->i.action, frozen,
>>>>>>>>>>>> tries_buf);
>>>>>>>>>>>>                       if (desc)
>>>>>>>>>>>>                               ata_dev_err(ehc->i.dev, "%s\n",
>>>>>>>>>>>> desc);
>>>>>>>>>>>> +               ehc->i.dev->exce_cnt ++;
>>>>>>>>>>>> +               ata_dev_warn(ehc->i.dev, "Number of exceptions:
>>>>>>>>>>>> %d\n",
>>>>>>>>>>>> ehc->i.dev->exce_cnt);
>>>>>>>>>>>> +               /**
>>>>>>>>>>>> +                  * The device is failing terribly,
>>>>>>>>>>>> +                 * disable it to prevent damage.
>>>>>>>>>>>> +                 */
>>>>>>>>>>>> +               if(ehc->i.dev->exce_cnt > 2)
>>>>>>>>>>>> +                       ata_dev_disable(ehc->i.dev);
>>>>>>>>>>>>               } else {
>>>>>>>>>>>>                       ata_link_err(link, "exception Emask 0x%x "
>>>>>>>>>>>>                                    "SAct 0x%x SErr 0x%x action
>>>>>>>>>>>> 0x%x%s%s\n",
>>>>>>>>>>>> diff --git a/include/linux/libata.h b/include/linux/libata.h
>>>>>>>>>>>> index eae7a05..fa52ee6 100644
>>>>>>>>>>>> --- a/include/linux/libata.h
>>>>>>>>>>>> +++ b/include/linux/libata.h
>>>>>>>>>>>> @@ -660,7 +660,8 @@ struct ata_device {
>>>>>>>>>>>>               u8
>>>>>>>>>>>> devslp_timing[ATA_LOG_DEVSLP_SIZE];
>>>>>>>>>>>>
>>>>>>>>>>>>               /* error history */
>>>>>>>>>>>> -       int                     spdn_cnt;
>>>>>>>>>>>> +       int                     spdn_cnt; /* Number of
>>>>>>>>>>>> speed_downs
>>>>>>>>>>>> */
>>>>>>>>>>>> +       int                     exce_cnt; /* Number of exceptions
>>>>>>>>>>>> that
>>>>>>>>>>>> happenned */
>>>>>>>>>>>>               /* ering is CLEAR_END, read comment above CLEAR_END
>>>>>>>>>>>> */
>>>>>>>>>>>>               struct ata_ering        ering;
>>>>>>>>>>>>        };
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> This doesn't seem like a very good fix. It may prevent the
>>>>>>>>>>> apparent
>>>>>>>>>>> infinite loop but will just prevent that device from functioning
>>>>>>>>>>> at
>>>>>>>>>>> all.
>>>>>>>>>>> It would be better if we could figure out what was actually going
>>>>>>>>>>> wrong.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> I have tested the problem with three different computers, all
>>>>>>>>>> switched
>>>>>>>>>> to legacy/IDE/compatibility mode, and they didn't have this
>>>>>>>>>> problem.
>>>>>>>>>> Of
>>>>>>>>>> course, they could have been set to AHCI mode, and there the kernel
>>>>>>>>>> would
>>>>>>>>>> boot normally. Feels strange, but so far I was only able to
>>>>>>>>>> reproduce
>>>>>>>>>> the
>>>>>>>>>> problem with a Toshiba MK8052GSX. On the topic of my patch, I still
>>>>>>>>>> don't
>>>>>>>>>> see why a device which fails so terribly that it reports 3
>>>>>>>>>> exceptions
>>>>>>>>>> shouldn't be disabled. Like in this case, it could cause infinite
>>>>>>>>>> loops.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The problem is that this could happen in some cases when you
>>>>>>>>> wouldn't
>>>>>>>>> want to disable the device, like an error that just happens
>>>>>>>>> sporadically and works on retry, or a device you're trying to
>>>>>>>>> recover
>>>>>>>>> data from.
>>>>>>>>>
>>>>>>>> What do you think if I edit the patch in a way, that when an
>>>>>>>> operation
>>>>>>>> successfully completes, it resets exce_cnt to zero. Might as well add
>>>>>>>> a
>>>>>>>> module_param, which can set the maximum value of exce_cnt, while
>>>>>>>> having
>>>>>>>> zero
>>>>>>>> as an option to never disable the device. Please don't think me
>>>>>>>> wrong,
>>>>>>>> I
>>>>>>>> don't want to force this patch, I just want to learn how all this
>>>>>>>> works,
>>>>>>>> and
>>>>>>>> in the process try to make it better. :-)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> That would be better, but I think you're still going to have an issue
>>>>>>> with what magic number to pick to avoid disabling devices
>>>>>>> inappropriately.
>>>>>>>
>>>>>>> Conceptually, disabling the device doesn't really make sense anyway.
>>>>>>> If someone in userspace wants to keep trying to read from that device,
>>>>>>> why would you stop them because of some arbitrary judgement? The
>>>>>>> kernel itself isn't "locked up" during this process, anything not
>>>>>>> blocked on I/O to that device should be able to continue running, so
>>>>>>> that process is only hurting itself. If the system fails to boot from
>>>>>>> another device due to this, this would likely point out some kind of
>>>>>>> problem in userspace or the distro boot process being overly
>>>>>>> serialized.
>>>>>>>
>>>>>>
>>>>>> I have been booting up with the initramfs from ubuntu 13.04,
>>>>>> and I have also tried to boot with the ubuntu install cd. They couldn't
>>>>>> continue the boot process. I'm gonna spend the weekend trying to figure
>>>>>> out where and why the interrupts don't happen. Whether it be a routing
>>>>>> or a hardware issue, which I highly doubt due to the fact that Windows
>>>>>> XP SP2 was able to boot up without errors.
>>>>>
>>>>>
>>>>>
>>>>> Are you able to get out full dmesg output from a boot attempt and the
>>>>> contents of /proc/interrupts?
>>>>>
>>>> As I said before, I am not able to get to the shell, without my 'symptom
>>>> cure'. With my patch I get the following dmesg output, with
>>>> some of my debug messages turned off:
>>>> http://pastebin.com/5eb5G3Dx
>>>> /proc/interrupts is here:
>>>> http://pastebin.com/84CJey2D
>>>> After yesterday's research, I have come to ata_piix.c . That file looks
>>>> like
>>>> the real culprit, as my netbook's controller is an Intel ICH7M one,
>>>> The values I am getting from the device are very different than those
>>>> that are expected.
>>>>
>>>> Things I have noticed, but ignored in dmesg:
>>>> There is a stack dump, because nobody cared about IRQ#20. I have ignored
>>>> this because it is the EHCI IRQ, and I suppose it has nothing to do with
>>>> ata. The problem is with ata3 or /dev/sdc, while the IRQ happens
>>>> with /dev/sda, which works fine.
>>>
>>>
>>> I think it is likely related to the problem. The kernel thinks this
>>> controller is on IRQ 16, but apparently something is raising
>>> un-acknowledged interrupts on IRQ 20 and nothing is coming in on IRQ
>>> 16. It seems quite likely that this is actually the ATA controller.
>>>
>>> You mentioned that Windows XP was able to work in this mode. I wonder
>>> if it was using the IOAPIC, as if not then the IRQ routing is
>>> different which might mask the problem. Do you know what IRQ Device
>>> Manager reported for this controller in Windows? And was it using any
>>> IRQs over 15 (which would indicate the IOAPIC was in use)?
>>
>>
>> Hmm, according to WinXP's Device manager for this controller,
>> it listens to IRQ# 20, and therefore it is using the I/O APIC.
>> Now, one question remains where is the error that mismaps
>> controller?
>> I have created a simple patch which seems to fix this:
>> ---
>> @@ -1704,6 +1767,8 @@ static int piix_init_one(struct pci_dev *pdev, const
>> struct pci_device_id *ent)
>>                  hpriv->map = piix_init_sata_map(pdev, port_info,
>>
>> piix_map_db_table[ent->driver_data]);
>>
>> +       if(pdev->vendor == 0x8086 && pdev->device == 0x27C4)
>> +               pdev->irq = 20;
>>          rc = ata_pci_bmdma_prepare_host(pdev, ppi, &host);
>>          if (rc)
>>                  return rc;
>>
>> However, I am more than sure that this is not the way
>> to solve this problem. Do you have any idea on where
>> the ideal place would be to implement a fix?
>> According to specs of ICH7M, which is essentially the
>> same as ICH6M, we need to check on what interrupt pin
>> is the SATA controller, and after that check which IRQ line
>> is connected to the I/O APIC and decide the IRQ's number
>> on those findings.
>>
>> Specs of ICH7:
>> http://www.intel.com/content/dam/doc/datasheet/i-o-controller-hub-7-datasheet.pdf
>> Device 31 Interrupt Route Register: Chapter 7.1.46
>> Device 31 Interrupt Pin Register: Chapter 7.1.41
>>
>> The SATA controller is always Device 31.
>
> It would appear that something is messing up with the ACPI IRQ routing
> on this machine that's causing us to think the controller is on the
> wrong IRQ. CCing the linux-acpi list to see if anyone has some
> additional debugging suggestions. I suspect that dumping the DSDT is
> likely the first step though. If you can get IASL installed, you can
> do something like:
>
> cat /sys/firmware/acpi/tables/DSDT > dsdt.aml
> iasl -d dsdt.aml
>
> That should spit out a dsdt.dsl file which would hopefully have the
> info needed to figure out what's going on.
>

Here is the disassembled DSDT table:
http://pastebin.com/LWNVht9H
The SATA controller is at line 5206.
I also disassembled the SSDT, but nothing interesting was there:
http://pastebin.com/fus5sxU8

I disabled the usage of ACPI for IRQs with acpi=noirq,
and it successfully booted up setting itself to IRQ#3.
This makes me think that this is the BIOS's fault.
I think it would be possible to create a DMI check
and forcibly set the irq to 20 if the DMI matches.
Any comments on this?

-- 
Regards,
Levente Kurusa
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  reply	other threads:[~2013-09-28 17:46 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-09-08  6:35 [PATCH] BIOS SATA legacy mode failure Levente Kurusa
2013-09-10  4:01 ` Robert Hancock
2013-09-14 15:09   ` Levente Kurusa
2013-09-16  4:37     ` Robert Hancock
2013-09-17 16:47       ` Levente Kurusa
2013-09-18  1:35         ` Robert Hancock
2013-09-21  7:35           ` Levente Kurusa
2013-09-21 17:04             ` Robert Hancock
2013-09-22  7:13               ` Levente Kurusa
2013-09-25  6:31                 ` Robert Hancock
2013-09-27 13:24                   ` Levente Kurusa
2013-09-28  4:55                     ` Robert Hancock
2013-09-28 17:46                       ` Levente Kurusa [this message]
2013-09-29  1:21                         ` Robert Hancock
2013-10-01  4:25                           ` Robert Hancock
2013-10-11 16:07                             ` Levente Kurusa
2013-10-12  2:06                               ` Robert Hancock
     [not found]                                 ` <52591 681.1020001@linux.com>
2013-10-12  9:29                                 ` Levente Kurusa
2013-10-13  5:57                                   ` Robert Hancock
2013-10-13 12:02                                     ` Levente Kurusa
2013-10-16  0:16                                       ` Robert Hancock
2013-10-16 14:42                                         ` Levente Kurusa
2013-10-22  1:34                                           ` Robert Hancock
2013-10-22  2:12                                             ` Aaron Lu
2013-10-22 14:32                                               ` Levente Kurusa

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=52471600.4090908@linux.com \
    --to=levex@linux.com \
    --cc=hancockrwd@gmail.com \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-ide@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).