From: John Snow <jsnow@redhat.com>
To: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>,
qemu-devel <qemu-devel@nongnu.org>,
alxndr@bu.edu
Subject: Re: ide: Linux reports drive diagnostic failures on boot
Date: Thu, 15 Oct 2020 17:10:27 -0400 [thread overview]
Message-ID: <6da8dc36-baa8-b9df-6dfc-67191c88eeb4@redhat.com> (raw)
In-Reply-To: <1dbfe85a-10d9-55a4-6eb3-a328de62fbe3@ilande.co.uk>
On 10/15/20 4:17 PM, Mark Cave-Ayland wrote:
> On 13/10/2020 19:39, John Snow wrote:
>
>> On 10/13/20 6:59 AM, Mark Cave-Ayland wrote:
>>> During my latest OpenBIOS boot tests I've noticed the following IDE
>>> diagnostics failure message appearing in dmesg at Linux boot time
>>> when booting from CDROM on both SPARC64 and PPC:
>>>
>> Sorry for the inconvenience.
>
> Well it wasn't too bad - in my case the kernel was able to recover so it
> wasn't a complete showstopper for my tests. It seemed worth mentioning
> in case this causes other failures elsewhere though.
>
>>> [ 9.347342] scsi host0: pata_cmd64x
>>> [ 9.369055] scsi host1: pata_cmd64x
>>> [ 9.371622] ata1: PATA max UDMA/33 cmd 0x1fe02008000 ctl
>>> 0x1fe02008080 bmdma 0x1fe02008200 irq 8
>>> [ 9.372805] ata2: PATA max UDMA/33 cmd 0x1fe02008100 ctl
>>> 0x1fe02008180 bmdma 0x1fe02008208 irq 8
>>> [ 9.711740] ata2.00: ATAPI: QEMU DVD-ROM, 2.5+, max UDMA/100
>>> [ 9.712591] ata2.00: Drive reports diagnostics failure. This may
>>> indicate a drive
>>> [ 9.713256] ata2.00: fault or invalid emulation. Contact drive
>>> vendor for information.
>>> [ 9.737677] ata2.00: configured for UDMA/33
>>> [ 9.790179] scsi 1:0:0:0: CD-ROM QEMU QEMU DVD-ROM
>>> 2.5+ PQ: 0 ANSI: 5
>>> [ 10.381172] hme 0000:01:01.1 enp1s1f1: renamed from eth0
>>> [ 10.508819] sr 1:0:0:0: [sr0] scsi3-mmc drive: 4x/4x cd/rw
>>> xa/form2 tray
>>> [ 10.509805] cdrom: Uniform CD-ROM driver Revision: 3.20
>>>
>>>
>>> A session with git bisect points to the following commit:
>>>
>>> 55adb3c45620c31f29978f209e2a44a08d34e2da is the first bad commit
>>> commit 55adb3c45620c31f29978f209e2a44a08d34e2da
>>> Author: John Snow <jsnow@redhat.com>
>>> Date: Fri Jul 24 01:23:00 2020 -0400
>>>
>>> ide: cancel pending callbacks on SRST
>>>
>>> The SRST implementation did not keep up with the rest of IDE; it is
>>> possible to perform a weak reset on an IDE device to remove the
>>> BSY/DRQ
>>> bits, and then issue writes to the control/device registers
>>> which can
>>> cause chaos with the state machine.
>>>
>>> Fix that by actually performing a real reset.
>>>
>>> Reported-by: Alexander Bulekov <alxndr@bu.edu>
>>> Fixes: https://bugs.launchpad.net/qemu/+bug/1878253
>>> Fixes: https://bugs.launchpad.net/qemu/+bug/1887303
>>> Fixes: https://bugs.launchpad.net/qemu/+bug/1887309
>>> Signed-off-by: John Snow <jsnow@redhat.com>
>>>
>>> :040000 040000 70a7c1cfbafb22fa815d3ae4d7ed075ac3918188
>>> 3f37762f20e9ca9d2874eaf819d7175a1dca1dd9 M hw
>>>
>>>
>>> John/Alexander: any chance you could take a look at this? The message
>>> above is really simple to reproduce under qemu-system-sparc64 using
>>> http://cdimage.debian.org/cdimage/ports/9.0/sparc64/iso-cd/debian-9.0-sparc64-NETINST-1.iso
>>> and the following command line:
>>>
>>> ./qemu-system-sparc64 \
>>> -cdrom debian-9.0-sparc64-NETINST-1.iso \
>>> -m 256 \
>>> -nographic \
>>> -boot d
>>>
>>>
>>> ATB,
>>>
>>> Mark.
>>>
>>
>> Shucks.
>>
>> This patch happened because the old SRST code reset the IDE state
>> machine (cleared the status register) but then didn't cancel any of
>> the pending callbacks, so it was possible to shuffle the state machine
>> off the rails onto junk data. Obviously bad.
>>
>> Now, SRST actually cancels the callbacks which I thought would have
>> been safe, but it's possible that doing a "real" reset here is
>> touching more registers than it ought to.
>>
>> Let's take a look at the linux source code ...
>>
>> /* Let the user know. We don't want to disallow opens for
>> rescue purposes, or in case the vendor is just a blithering
>> idiot. Do this after the dev_config call as some controllers
>> with buggy firmware may want to avoid reporting false device
>> bugs */
>>
>> Ah, always a nice day to be called an idiot. Thank you for your
>> service, Alan Cox.
>>
>> This message gets printed when ATA_HORKAGE_DIAGNOSTIC is set.
>> libata-sff.c suggests this happens when the error register* comes back
>> 0x00 after an SRST.
>>
>> (*I think that tf.feature is only feature on writes, but error on
>> reads. Same address.)
>>
>> Now, ide_reset -- which we use for initialization and resets both
>> always sets the error register to 0x00. libata thinks that 0x00 means
>> a failed diagnostics test, though.
>>
>> This ought to be covered by ATA8-APT. Section 9.2, Software reset
>> protocol.
>>
>> Cliff notes:
>>
>> - Host writes to SRST and waits for 5 μs.
>> - Both devices obey the SRST write, regardless of drive selection.
>> - Each device clears their registers and sets BSY (within 400ns.)
>> - Host clears SRST and waits for at least 2ms.
>> - Host polls devices, waiting for BSY to be 0.
>>
>>
>> device0 can set bit7 in the error register to 0 or 1, depending on the
>> presence or absence of device1 and how it behaves following a
>> diagnostic test.
>>
>>
>> Device 1 is absent: bit7 is cleared.
>> Device 1 is present:
>> - If PDIAG- is asserted, bit7 is cleared.
>> - If PDIAG- is not asserted within 31 seconds, bit7 is set.
>>
>> Then, ah:
>>
>> The EXECUTE DEVICE DIAGNOSTICS diagnostic code shall be placed in bits
>> (6:0) of the Error register (See Clause 0). The device shall set the
>> signature values (See Clause 0). The content of the Features register
>> is undefined.
>>
>> I got this pretty wrong. I'm seeing a few problems:
>>
>> 1. I thought SRST was triggered on the falling edge, but that's not
>> entirely true. BSY should be set immediately and the SRST can begin as
>> soon as possible. The device does not seem to have any interaction
>> with the SRST bit being cleared from what I can tell.
>>
>> 2. We aren't running the diagnostic command, actually. That should fix
>> this particular case. The old version of the code had an open-coded
>> version of this, but it wasn't clear at the time this is what it was
>> doing.
>>
>> 3. There are likely other things to handle relating to the
>> presence/absence of device1 that we have never done for either version
>> of the code.
>>
>>
>>
>> Try this patch:
>>
>>
>> diff --git a/hw/ide/core.c b/hw/ide/core.c
>> index 693b352d5e..98cea7ad45 100644
>> --- a/hw/ide/core.c
>> +++ b/hw/ide/core.c
>> @@ -2254,10 +2254,8 @@ static void ide_perform_srst(IDEState *s)
>> /* Cancel PIO callback, reset registers/signature, etc */
>> ide_reset(s);
>>
>> - if (s->drive_kind == IDE_CD) {
>> - /* ATAPI drives do not set READY or SEEK */
>> - s->status = 0x00;
>> - }
>> + /* perform diagnostic */
>> + cmd_exec_dev_diagnostic(s, WIN_DIAGNOSE);
>> }
>>
>> static void ide_bus_perform_srst(void *opaque)
>> @@ -2282,9 +2280,7 @@ void ide_ctrl_write(void *opaque, uint32_t addr,
>> uint32_t val)
>>
>> /* Device0 and Device1 each have their own control register,
>> * but QEMU models it as just one register in the controller. */
>> - if ((bus->cmd & IDE_CTRL_RESET) &&
>> - !(val & IDE_CTRL_RESET)) {
>> - /* SRST triggers on falling edge */
>> + if (!(bus->cmd & IDE_CTRL_RESET) && (val & IDE_CTRL_RESET)) {
>> for (i = 0; i < 2; i++) {
>> s = &bus->ifs[i];
>> s->status |= BUSY_STAT;
>
> I've just given this a quick boot test on qemu-system-sparc64 (both HD
> and CD) and it seems to fix the problem here - thanks!
>
>
> ATB,
>
> Mark.
>
Thanks for testing! I'll spin this into a patch proper and add a T-B
line for you here.
--js
prev parent reply other threads:[~2020-10-15 21:12 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-10-13 10:59 ide: Linux reports drive diagnostic failures on boot Mark Cave-Ayland
2020-10-13 18:39 ` John Snow
2020-10-15 20:17 ` Mark Cave-Ayland
2020-10-15 21:10 ` John Snow [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=6da8dc36-baa8-b9df-6dfc-67191c88eeb4@redhat.com \
--to=jsnow@redhat.com \
--cc=alxndr@bu.edu \
--cc=mark.cave-ayland@ilande.co.uk \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).