From: Levente Kurusa <levex@linux.com>
To: Robert Hancock <hancockrwd@gmail.com>
Cc: "linux-ide@vger.kernel.org" <linux-ide@vger.kernel.org>
Subject: Re: [PATCH] BIOS SATA legacy mode failure
Date: Sun, 22 Sep 2013 09:13:35 +0200 [thread overview]
Message-ID: <523E989F.5040800@linux.com> (raw)
In-Reply-To: <CADLC3L3WCMWc4kuJ1-_GbFinEyCABuuh3Fonh641SptsfYDaeA@mail.gmail.com>
2013-09-21 19:04 keltezéssel, Robert Hancock írta:
> On Sat, Sep 21, 2013 at 1:35 AM, Levente Kurusa <levex@linux.com> wrote:
>>>>>>>> The following dmesg is stuck in an infinite loop.
>>>>>>>> dmesg:
>>>>>>>> ata3: lost interrupt (Status 0x50)
>>>>>>>> ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
>>>>>>>> ata3.00: failed command: READ DMA
>>>>>>>> ata3.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096 in
>>>>>>>> res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4
>>>>>>>> (timeout)
>>>>>>>> ata3.00: status: { DRDY }
>>>>>>>> ata3: soft resetting link
>>>>>>>> ata3.00: configured for UDMA/33 (no error)
>>>>>>>> ata3.00: device reported invalid CHS sector 0
>>>>>>>> ata3: EH complete
>>>>>>>>
>>>>>>>> Patch that fixes the infinite loop:
>>>>>>>> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
>>>>>>>> index f9476fb..eeedf80 100644
>>>>>>>> --- a/drivers/ata/libata-eh.c
>>>>>>>> +++ b/drivers/ata/libata-eh.c
>>>>>>>> @@ -2437,6 +2437,14 @@ static void ata_eh_link_report(struct ata_link
>>>>>>>> *link)
>>>>>>>> ehc->i.action, frozen, tries_buf);
>>>>>>>> if (desc)
>>>>>>>> ata_dev_err(ehc->i.dev, "%s\n", desc);
>>>>>>>> + ehc->i.dev->exce_cnt ++;
>>>>>>>> + ata_dev_warn(ehc->i.dev, "Number of exceptions:
>>>>>>>> %d\n",
>>>>>>>> ehc->i.dev->exce_cnt);
>>>>>>>> + /**
>>>>>>>> + * The device is failing terribly,
>>>>>>>> + * disable it to prevent damage.
>>>>>>>> + */
>>>>>>>> + if(ehc->i.dev->exce_cnt > 2)
>>>>>>>> + ata_dev_disable(ehc->i.dev);
>>>>>>>> } else {
>>>>>>>> ata_link_err(link, "exception Emask 0x%x "
>>>>>>>> "SAct 0x%x SErr 0x%x action
>>>>>>>> 0x%x%s%s\n",
>>>>>>>> diff --git a/include/linux/libata.h b/include/linux/libata.h
>>>>>>>> index eae7a05..fa52ee6 100644
>>>>>>>> --- a/include/linux/libata.h
>>>>>>>> +++ b/include/linux/libata.h
>>>>>>>> @@ -660,7 +660,8 @@ struct ata_device {
>>>>>>>> u8
>>>>>>>> devslp_timing[ATA_LOG_DEVSLP_SIZE];
>>>>>>>>
>>>>>>>> /* error history */
>>>>>>>> - int spdn_cnt;
>>>>>>>> + int spdn_cnt; /* Number of speed_downs */
>>>>>>>> + int exce_cnt; /* Number of exceptions
>>>>>>>> that
>>>>>>>> happenned */
>>>>>>>> /* ering is CLEAR_END, read comment above CLEAR_END */
>>>>>>>> struct ata_ering ering;
>>>>>>>> };
>>>>>>>>
>>>>>>>
>>>>>>> This doesn't seem like a very good fix. It may prevent the apparent
>>>>>>> infinite loop but will just prevent that device from functioning at
>>>>>>> all.
>>>>>>> It would be better if we could figure out what was actually going
>>>>>>> wrong.
>>>>>>>
>>>>>>>
>>>>>> I have tested the problem with three different computers, all switched
>>>>>> to legacy/IDE/compatibility mode, and they didn't have this problem. Of
>>>>>> course, they could have been set to AHCI mode, and there the kernel
>>>>>> would
>>>>>> boot normally. Feels strange, but so far I was only able to reproduce
>>>>>> the
>>>>>> problem with a Toshiba MK8052GSX. On the topic of my patch, I still
>>>>>> don't
>>>>>> see why a device which fails so terribly that it reports 3 exceptions
>>>>>> shouldn't be disabled. Like in this case, it could cause infinite
>>>>>> loops.
>>>>>
>>>>>
>>>>>
>>>>> The problem is that this could happen in some cases when you wouldn't
>>>>> want to disable the device, like an error that just happens
>>>>> sporadically and works on retry, or a device you're trying to recover
>>>>> data from.
>>>>>
>>>> What do you think if I edit the patch in a way, that when an operation
>>>> successfully completes, it resets exce_cnt to zero. Might as well add a
>>>> module_param, which can set the maximum value of exce_cnt, while having
>>>> zero
>>>> as an option to never disable the device. Please don't think me wrong, I
>>>> don't want to force this patch, I just want to learn how all this works,
>>>> and
>>>> in the process try to make it better. :-)
>>>
>>>
>>> That would be better, but I think you're still going to have an issue
>>> with what magic number to pick to avoid disabling devices
>>> inappropriately.
>>>
>>> Conceptually, disabling the device doesn't really make sense anyway.
>>> If someone in userspace wants to keep trying to read from that device,
>>> why would you stop them because of some arbitrary judgement? The
>>> kernel itself isn't "locked up" during this process, anything not
>>> blocked on I/O to that device should be able to continue running, so
>>> that process is only hurting itself. If the system fails to boot from
>>> another device due to this, this would likely point out some kind of
>>> problem in userspace or the distro boot process being overly
>>> serialized.
>>>
>>
>> I have been booting up with the initramfs from ubuntu 13.04,
>> and I have also tried to boot with the ubuntu install cd. They couldn't
>> continue the boot process. I'm gonna spend the weekend trying to figure
>> out where and why the interrupts don't happen. Whether it be a routing
>> or a hardware issue, which I highly doubt due to the fact that Windows
>> XP SP2 was able to boot up without errors.
>
> Are you able to get out full dmesg output from a boot attempt and the
> contents of /proc/interrupts?
>
As I said before, I am not able to get to the shell, without my 'symptom
cure'. With my patch I get the following dmesg output, with
some of my debug messages turned off:
http://pastebin.com/5eb5G3Dx
/proc/interrupts is here:
http://pastebin.com/84CJey2D
After yesterday's research, I have come to ata_piix.c . That file looks
like the real culprit, as my netbook's controller is an Intel ICH7M one,
The values I am getting from the device are very different than those
that are expected.
Things I have noticed, but ignored in dmesg:
There is a stack dump, because nobody cared about IRQ#20. I have ignored
this because it is the EHCI IRQ, and I suppose it has nothing to do with
ata. The problem is with ata3 or /dev/sdc, while the IRQ happens
with /dev/sda, which works fine.
Things I have not noticed before, but did now:
The ACPI errors at ~0.1329
--
Regards,
Levente Kurusa
next prev parent reply other threads:[~2013-09-22 7:13 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-09-08 6:35 [PATCH] BIOS SATA legacy mode failure Levente Kurusa
2013-09-10 4:01 ` Robert Hancock
2013-09-14 15:09 ` Levente Kurusa
2013-09-16 4:37 ` Robert Hancock
2013-09-17 16:47 ` Levente Kurusa
2013-09-18 1:35 ` Robert Hancock
2013-09-21 7:35 ` Levente Kurusa
2013-09-21 17:04 ` Robert Hancock
2013-09-22 7:13 ` Levente Kurusa [this message]
2013-09-25 6:31 ` Robert Hancock
2013-09-27 13:24 ` Levente Kurusa
2013-09-28 4:55 ` Robert Hancock
2013-09-28 17:46 ` Levente Kurusa
2013-09-29 1:21 ` Robert Hancock
2013-10-01 4:25 ` Robert Hancock
2013-10-11 16:07 ` Levente Kurusa
2013-10-12 2:06 ` Robert Hancock
[not found] ` <52591 681.1020001@linux.com>
2013-10-12 9:29 ` Levente Kurusa
2013-10-13 5:57 ` Robert Hancock
2013-10-13 12:02 ` Levente Kurusa
2013-10-16 0:16 ` Robert Hancock
2013-10-16 14:42 ` Levente Kurusa
2013-10-22 1:34 ` Robert Hancock
2013-10-22 2:12 ` Aaron Lu
2013-10-22 14:32 ` Levente Kurusa
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=523E989F.5040800@linux.com \
--to=levex@linux.com \
--cc=hancockrwd@gmail.com \
--cc=linux-ide@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.