Re: ide: Linux reports drive diagnostic failures on boot

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: John Snow <jsnow@redhat.com>
To: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>,
	qemu-devel <qemu-devel@nongnu.org>,
	alxndr@bu.edu
Subject: Re: ide: Linux reports drive diagnostic failures on boot
Date: Thu, 15 Oct 2020 17:10:27 -0400	[thread overview]
Message-ID: <6da8dc36-baa8-b9df-6dfc-67191c88eeb4@redhat.com> (raw)
In-Reply-To: <1dbfe85a-10d9-55a4-6eb3-a328de62fbe3@ilande.co.uk>

On 10/15/20 4:17 PM, Mark Cave-Ayland wrote:
> On 13/10/2020 19:39, John Snow wrote:
> 
>> On 10/13/20 6:59 AM, Mark Cave-Ayland wrote:
>>> During my latest OpenBIOS boot tests I've noticed the following IDE 
>>> diagnostics failure message appearing in dmesg at Linux boot time 
>>> when booting from CDROM on both SPARC64 and PPC:
>>>
>> Sorry for the inconvenience.
> 
> Well it wasn't too bad - in my case the kernel was able to recover so it 
> wasn't a complete showstopper for my tests. It seemed worth mentioning 
> in case this causes other failures elsewhere though.
> 
>>> [    9.347342] scsi host0: pata_cmd64x
>>> [    9.369055] scsi host1: pata_cmd64x
>>> [    9.371622] ata1: PATA max UDMA/33 cmd 0x1fe02008000 ctl 
>>> 0x1fe02008080 bmdma 0x1fe02008200 irq 8
>>> [    9.372805] ata2: PATA max UDMA/33 cmd 0x1fe02008100 ctl 
>>> 0x1fe02008180 bmdma 0x1fe02008208 irq 8
>>> [    9.711740] ata2.00: ATAPI: QEMU DVD-ROM, 2.5+, max UDMA/100
>>> [    9.712591] ata2.00: Drive reports diagnostics failure. This may 
>>> indicate a drive
>>> [    9.713256] ata2.00: fault or invalid emulation. Contact drive 
>>> vendor for information.
>>> [    9.737677] ata2.00: configured for UDMA/33
>>> [    9.790179] scsi 1:0:0:0: CD-ROM            QEMU     QEMU DVD-ROM 
>>> 2.5+ PQ: 0 ANSI: 5
>>> [   10.381172] hme 0000:01:01.1 enp1s1f1: renamed from eth0
>>> [   10.508819] sr 1:0:0:0: [sr0] scsi3-mmc drive: 4x/4x cd/rw 
>>> xa/form2 tray
>>> [   10.509805] cdrom: Uniform CD-ROM driver Revision: 3.20
>>>
>>>
>>> A session with git bisect points to the following commit:
>>>
>>> 55adb3c45620c31f29978f209e2a44a08d34e2da is the first bad commit
>>> commit 55adb3c45620c31f29978f209e2a44a08d34e2da
>>> Author: John Snow <jsnow@redhat.com>
>>> Date:   Fri Jul 24 01:23:00 2020 -0400
>>>
>>>      ide: cancel pending callbacks on SRST
>>>
>>>      The SRST implementation did not keep up with the rest of IDE; it is
>>>      possible to perform a weak reset on an IDE device to remove the 
>>> BSY/DRQ
>>>      bits, and then issue writes to the control/device registers 
>>> which can
>>>      cause chaos with the state machine.
>>>
>>>      Fix that by actually performing a real reset.
>>>
>>>      Reported-by: Alexander Bulekov <alxndr@bu.edu>
>>>      Fixes: https://bugs.launchpad.net/qemu/+bug/1878253
>>>      Fixes: https://bugs.launchpad.net/qemu/+bug/1887303
>>>      Fixes: https://bugs.launchpad.net/qemu/+bug/1887309
>>>      Signed-off-by: John Snow <jsnow@redhat.com>
>>>
>>> :040000 040000 70a7c1cfbafb22fa815d3ae4d7ed075ac3918188 
>>> 3f37762f20e9ca9d2874eaf819d7175a1dca1dd9 M      hw
>>>
>>>
>>> John/Alexander: any chance you could take a look at this? The message 
>>> above is really simple to reproduce under qemu-system-sparc64 using 
>>> http://cdimage.debian.org/cdimage/ports/9.0/sparc64/iso-cd/debian-9.0-sparc64-NETINST-1.iso 
>>> and the following command line:
>>>
>>> ./qemu-system-sparc64 \
>>>      -cdrom debian-9.0-sparc64-NETINST-1.iso \
>>>      -m 256 \
>>>      -nographic \
>>>      -boot d
>>>
>>>
>>> ATB,
>>>
>>> Mark.
>>>
>>
>> Shucks.
>>
>> This patch happened because the old SRST code reset the IDE state 
>> machine (cleared the status register) but then didn't cancel any of 
>> the pending callbacks, so it was possible to shuffle the state machine 
>> off the rails onto junk data. Obviously bad.
>>
>> Now, SRST actually cancels the callbacks which I thought would have 
>> been safe, but it's possible that doing a "real" reset here is 
>> touching more registers than it ought to.
>>
>> Let's take a look at the linux source code ...
>>
>> /* Let the user know. We don't want to disallow opens for
>>     rescue purposes, or in case the vendor is just a blithering
>>     idiot. Do this after the dev_config call as some controllers
>>     with buggy firmware may want to avoid reporting false device
>>     bugs */
>>
>> Ah, always a nice day to be called an idiot. Thank you for your 
>> service, Alan Cox.
>>
>> This message gets printed when ATA_HORKAGE_DIAGNOSTIC is set. 
>> libata-sff.c suggests this happens when the error register* comes back 
>> 0x00 after an SRST.
>>
>> (*I think that tf.feature is only feature on writes, but error on 
>> reads. Same address.)
>>
>> Now, ide_reset -- which we use for initialization and resets both 
>> always sets the error register to 0x00. libata thinks that 0x00 means 
>> a failed diagnostics test, though.
>>
>> This ought to be covered by ATA8-APT. Section 9.2, Software reset 
>> protocol.
>>
>> Cliff notes:
>>
>> - Host writes to SRST and waits for 5 μs.
>> - Both devices obey the SRST write, regardless of drive selection.
>> - Each device clears their registers and sets BSY (within 400ns.)
>> - Host clears SRST and waits for at least 2ms.
>> - Host polls devices, waiting for BSY to be 0.
>>
>>
>> device0 can set bit7 in the error register to 0 or 1, depending on the 
>> presence or absence of device1 and how it behaves following a 
>> diagnostic test.
>>
>>
>> Device 1 is absent: bit7 is cleared.
>> Device 1 is present:
>>    - If PDIAG- is asserted, bit7 is cleared.
>>    - If PDIAG- is not asserted within 31 seconds, bit7 is set.
>>
>> Then, ah:
>>
>> The EXECUTE DEVICE DIAGNOSTICS diagnostic code shall be placed in bits 
>> (6:0) of the Error register (See Clause 0). The device shall set the 
>> signature values (See Clause 0). The content of the Features register 
>> is undefined.
>>
>> I got this pretty wrong. I'm seeing a few problems:
>>
>> 1. I thought SRST was triggered on the falling edge, but that's not 
>> entirely true. BSY should be set immediately and the SRST can begin as 
>> soon as possible. The device does not seem to have any interaction 
>> with the SRST bit being cleared from what I can tell.
>>
>> 2. We aren't running the diagnostic command, actually. That should fix 
>> this particular case. The old version of the code had an open-coded 
>> version of this, but it wasn't clear at the time this is what it was 
>> doing.
>>
>> 3. There are likely other things to handle relating to the 
>> presence/absence of device1 that we have never done for either version 
>> of the code.
>>
>>
>>
>> Try this patch:
>>
>>
>> diff --git a/hw/ide/core.c b/hw/ide/core.c
>> index 693b352d5e..98cea7ad45 100644
>> --- a/hw/ide/core.c
>> +++ b/hw/ide/core.c
>> @@ -2254,10 +2254,8 @@ static void ide_perform_srst(IDEState *s)
>>       /* Cancel PIO callback, reset registers/signature, etc */
>>       ide_reset(s);
>>
>> -    if (s->drive_kind == IDE_CD) {
>> -        /* ATAPI drives do not set READY or SEEK */
>> -        s->status = 0x00;
>> -    }
>> +    /* perform diagnostic */
>> +    cmd_exec_dev_diagnostic(s, WIN_DIAGNOSE);
>>   }
>>
>>   static void ide_bus_perform_srst(void *opaque)
>> @@ -2282,9 +2280,7 @@ void ide_ctrl_write(void *opaque, uint32_t addr, 
>> uint32_t val)
>>
>>       /* Device0 and Device1 each have their own control register,
>>        * but QEMU models it as just one register in the controller. */
>> -    if ((bus->cmd & IDE_CTRL_RESET) &&
>> -        !(val & IDE_CTRL_RESET)) {
>> -        /* SRST triggers on falling edge */
>> +    if (!(bus->cmd & IDE_CTRL_RESET) && (val & IDE_CTRL_RESET)) {
>>           for (i = 0; i < 2; i++) {
>>               s = &bus->ifs[i];
>>               s->status |= BUSY_STAT;
> 
> I've just given this a quick boot test on qemu-system-sparc64 (both HD 
> and CD) and it seems to fix the problem here - thanks!
> 
> 
> ATB,
> 
> Mark.
> 

Thanks for testing! I'll spin this into a patch proper and add a T-B 
line for you here.

--js

     prev parent reply	other threads:[~2020-10-15 21:12 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-13 10:59 ide: Linux reports drive diagnostic failures on boot Mark Cave-Ayland
2020-10-13 18:39 ` John Snow
2020-10-15 20:17   ` Mark Cave-Ayland
2020-10-15 21:10     ` John Snow [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6da8dc36-baa8-b9df-6dfc-67191c88eeb4@redhat.com \
    --to=jsnow@redhat.com \
    --cc=alxndr@bu.edu \
    --cc=mark.cave-ayland@ilande.co.uk \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).