From mboxrd@z Thu Jan 1 00:00:00 1970 From: Levente Kurusa Subject: Re: [PATCH] BIOS SATA legacy mode failure Date: Sun, 22 Sep 2013 09:13:35 +0200 Message-ID: <523E989F.5040800@linux.com> References: <522C1AC5.4080105@linux.com> <522E9982.2060504@gmail.com> <52347C24.8060102@linux.com> <523887BC.50704@linux.com> <523D4C4C.5070400@linux.com> Reply-To: levex@linux.com Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Received: from mail-ee0-f48.google.com ([74.125.83.48]:63731 "EHLO mail-ee0-f48.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751125Ab3IVHNi (ORCPT ); Sun, 22 Sep 2013 03:13:38 -0400 Received: by mail-ee0-f48.google.com with SMTP id l10so1065098eei.35 for ; Sun, 22 Sep 2013 00:13:37 -0700 (PDT) In-Reply-To: Sender: linux-ide-owner@vger.kernel.org List-Id: linux-ide@vger.kernel.org To: Robert Hancock Cc: "linux-ide@vger.kernel.org" 2013-09-21 19:04 keltez=E9ssel, Robert Hancock =EDrta: > On Sat, Sep 21, 2013 at 1:35 AM, Levente Kurusa wro= te: >>>>>>>> The following dmesg is stuck in an infinite loop. >>>>>>>> dmesg: >>>>>>>> ata3: lost interrupt (Status 0x50) >>>>>>>> ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 froz= en >>>>>>>> ata3.00: failed command: READ DMA >>>>>>>> ata3.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 409= 6 in >>>>>>>> res 40/00:00:00:00:00/00:00:00:00:00/00 Emas= k 0x4 >>>>>>>> (timeout) >>>>>>>> ata3.00: status: { DRDY } >>>>>>>> ata3: soft resetting link >>>>>>>> ata3.00: configured for UDMA/33 (no error) >>>>>>>> ata3.00: device reported invalid CHS sector 0 >>>>>>>> ata3: EH complete >>>>>>>> >>>>>>>> Patch that fixes the infinite loop: >>>>>>>> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c >>>>>>>> index f9476fb..eeedf80 100644 >>>>>>>> --- a/drivers/ata/libata-eh.c >>>>>>>> +++ b/drivers/ata/libata-eh.c >>>>>>>> @@ -2437,6 +2437,14 @@ static void ata_eh_link_report(struct a= ta_link >>>>>>>> *link) >>>>>>>> ehc->i.action, frozen, tries_b= uf); >>>>>>>> if (desc) >>>>>>>> ata_dev_err(ehc->i.dev, "%s\n", de= sc); >>>>>>>> + ehc->i.dev->exce_cnt ++; >>>>>>>> + ata_dev_warn(ehc->i.dev, "Number of exceptions= : >>>>>>>> %d\n", >>>>>>>> ehc->i.dev->exce_cnt); >>>>>>>> + /** >>>>>>>> + * The device is failing terribly, >>>>>>>> + * disable it to prevent damage. >>>>>>>> + */ >>>>>>>> + if(ehc->i.dev->exce_cnt > 2) >>>>>>>> + ata_dev_disable(ehc->i.dev); >>>>>>>> } else { >>>>>>>> ata_link_err(link, "exception Emask 0x%x " >>>>>>>> "SAct 0x%x SErr 0x%x action >>>>>>>> 0x%x%s%s\n", >>>>>>>> diff --git a/include/linux/libata.h b/include/linux/libata.h >>>>>>>> index eae7a05..fa52ee6 100644 >>>>>>>> --- a/include/linux/libata.h >>>>>>>> +++ b/include/linux/libata.h >>>>>>>> @@ -660,7 +660,8 @@ struct ata_device { >>>>>>>> u8 >>>>>>>> devslp_timing[ATA_LOG_DEVSLP_SIZE]; >>>>>>>> >>>>>>>> /* error history */ >>>>>>>> - int spdn_cnt; >>>>>>>> + int spdn_cnt; /* Number of speed_d= owns */ >>>>>>>> + int exce_cnt; /* Number of excepti= ons >>>>>>>> that >>>>>>>> happenned */ >>>>>>>> /* ering is CLEAR_END, read comment above CLEAR_EN= D */ >>>>>>>> struct ata_ering ering; >>>>>>>> }; >>>>>>>> >>>>>>> >>>>>>> This doesn't seem like a very good fix. It may prevent the appa= rent >>>>>>> infinite loop but will just prevent that device from functionin= g at >>>>>>> all. >>>>>>> It would be better if we could figure out what was actually goi= ng >>>>>>> wrong. >>>>>>> >>>>>>> >>>>>> I have tested the problem with three different computers, all sw= itched >>>>>> to legacy/IDE/compatibility mode, and they didn't have this prob= lem. Of >>>>>> course, they could have been set to AHCI mode, and there the ker= nel >>>>>> would >>>>>> boot normally. Feels strange, but so far I was only able to repr= oduce >>>>>> the >>>>>> problem with a Toshiba MK8052GSX. On the topic of my patch, I st= ill >>>>>> don't >>>>>> see why a device which fails so terribly that it reports 3 excep= tions >>>>>> shouldn't be disabled. Like in this case, it could cause infinit= e >>>>>> loops. >>>>> >>>>> >>>>> >>>>> The problem is that this could happen in some cases when you woul= dn't >>>>> want to disable the device, like an error that just happens >>>>> sporadically and works on retry, or a device you're trying to rec= over >>>>> data from. >>>>> >>>> What do you think if I edit the patch in a way, that when an opera= tion >>>> successfully completes, it resets exce_cnt to zero. Might as well = add a >>>> module_param, which can set the maximum value of exce_cnt, while h= aving >>>> zero >>>> as an option to never disable the device. Please don't think me wr= ong, I >>>> don't want to force this patch, I just want to learn how all this = works, >>>> and >>>> in the process try to make it better. :-) >>> >>> >>> That would be better, but I think you're still going to have an iss= ue >>> with what magic number to pick to avoid disabling devices >>> inappropriately. >>> >>> Conceptually, disabling the device doesn't really make sense anyway= =2E >>> If someone in userspace wants to keep trying to read from that devi= ce, >>> why would you stop them because of some arbitrary judgement? The >>> kernel itself isn't "locked up" during this process, anything not >>> blocked on I/O to that device should be able to continue running, s= o >>> that process is only hurting itself. If the system fails to boot fr= om >>> another device due to this, this would likely point out some kind o= f >>> problem in userspace or the distro boot process being overly >>> serialized. >>> >> >> I have been booting up with the initramfs from ubuntu 13.04, >> and I have also tried to boot with the ubuntu install cd. They could= n't >> continue the boot process. I'm gonna spend the weekend trying to fig= ure >> out where and why the interrupts don't happen. Whether it be a routi= ng >> or a hardware issue, which I highly doubt due to the fact that Windo= ws >> XP SP2 was able to boot up without errors. > > Are you able to get out full dmesg output from a boot attempt and the > contents of /proc/interrupts? > As I said before, I am not able to get to the shell, without my 'sympto= m=20 cure'. With my patch I get the following dmesg output, with some of my debug messages turned off: http://pastebin.com/5eb5G3Dx /proc/interrupts is here: http://pastebin.com/84CJey2D After yesterday's research, I have come to ata_piix.c . That file looks= =20 like the real culprit, as my netbook's controller is an Intel ICH7M one= , The values I am getting from the device are very different than those that are expected. Things I have noticed, but ignored in dmesg: There is a stack dump, because nobody cared about IRQ#20. I have ignore= d this because it is the EHCI IRQ, and I suppose it has nothing to do wit= h=20 ata. The problem is with ata3 or /dev/sdc, while the IRQ happens with /dev/sda, which works fine. Things I have not noticed before, but did now: The ACPI errors at ~0.1329 --=20 Regards, Levente Kurusa