[Qemu-devel] NMI handling

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] NMI handling
@ 2010-06-21 18:20 Artyom Tarasenko
  2010-07-26 16:53 ` [Qemu-devel] " Artyom Tarasenko
  0 siblings, 1 reply; 3+ messages in thread
From: Artyom Tarasenko @ 2010-06-21 18:20 UTC (permalink / raw)
  To: Blue Swirl; +Cc: qemu-devel

2010/5/25 Blue Swirl <blauwirbel@gmail.com>:
>>> About bugs, IIRC NetBSD 3.x crash could be related to IOMMU.
>>
>> What does indicate it? It happens where the disk sizes are normally
>> reported, so it could be a scsi/dma/irq/fpu issue as well.
>
> IIRC the DVMA address was 0xfc004000, but the mapped entries were for
> 0xfc000000 to 0xfc003fff.

Hmm. It happens in all NetBSD versions from 1.6 to 3.1 inclusive.
Which is probably a sign that the problem resides in qemu and not in
NetBSD.

It looks like we have multiple problems here: they start with
0xfc004000 access (which can theoretically be expected on the real
hardware too) as you pointed out, but what happens afterwards is
strange too:

- In the current qemu implementation we have a screaming NMI which
NetBSD can not clear. This happens cause NMI in qemu is literally
non-maskable, while on the real hardware it can be masked with the
'mask all' flag. I'll send a patch for it.

- with the masking patch, the NMI is not screaming but still is
percepted as spurious. This may be ok if NetBSD (1.6-3.1) doesn't have
a moduleerr_handler set. I don't see it set, although there is a
moduleerr_handler variable is defined. Would be nice if someone with
NetBSD kernel knowledge would comment on this.

- the current implementation of NMI pending clearing in qemu may be
incomplete: if the source is not cleared, the pending NMI bit can be
set right after the user wrote to the clear pending register. If I
read the page 39 of the Sun4m System Architecture correctly, the
points (2) and (6) suggest that once pending NMI is cleared in a CPU,
the CPU doesn't see the NMI till the next one comes. Whereas it makes
sense, specially in SMP configurations, I don't see how is it
level-sensitive in this case. If 'another broadcast' means a new
external event, the NMI behavior seems mixed: turned on by edge, and
off by level. But the documentation says 'level-sensitive' everywhere.
Ideas?

-- 
Regards,
Artyom Tarasenko

solaris/sparc under qemu blog: http://tyom.blogspot.com/

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Qemu-devel] Re: NMI handling
  2010-06-21 18:20 [Qemu-devel] NMI handling Artyom Tarasenko
@ 2010-07-26 16:53 ` Artyom Tarasenko
  2010-07-26 17:57   ` Blue Swirl
  0 siblings, 1 reply; 3+ messages in thread
From: Artyom Tarasenko @ 2010-07-26 16:53 UTC (permalink / raw)
  To: Blue Swirl; +Cc: qemu-devel

2010/6/21 Artyom Tarasenko <atar4qemu@googlemail.com>:
> 2010/5/25 Blue Swirl <blauwirbel@gmail.com>:
>>>> About bugs, IIRC NetBSD 3.x crash could be related to IOMMU.
>>>
>>> What does indicate it? It happens where the disk sizes are normally
>>> reported, so it could be a scsi/dma/irq/fpu issue as well.
>>
>> IIRC the DVMA address was 0xfc004000, but the mapped entries were for
>> 0xfc000000 to 0xfc003fff.

Under OpenBIOS. And even less with OBP, and much less if the network
card is disabled.

> It looks like we have multiple problems here: they start with
> 0xfc004000 access (which can theoretically be expected on the real
> hardware too) as you pointed out, but what happens afterwards is
> strange too:
>
> - In the current qemu implementation we have a screaming NMI which
> NetBSD can not clear. This happens cause NMI in qemu is literally
> non-maskable, while on the real hardware it can be masked with the
> 'mask all' flag. I'll send a patch for it.
>
> - with the masking patch, the NMI is not screaming but still is
> percepted as spurious. This may be ok if NetBSD (1.6-3.1) doesn't have
> a moduleerr_handler set.

Or because scsi dma transfer on a real hardware never generates a nmi.

In the current implementation, when "select with attention" is
processed, scsi controller initiates a dma transfer and fetches a CDB.
If dma fails (not mapped, or not allowed), NMI is generated. It is
quite a strange design: such an error is an asynchronous event, and
CPU wouldn't know, that scsi controller tried to do some dma at
certain address. It would have been more consequent to send the error
notification to the dma initiator (scsi controller in this case),  not
to CPU.

The offending code in NetBSD 1.6-3.1:

NCRCMD(sc, NCRCMD_SELATN | NCRCMD_DMA); // Here it crashes (under
qemu) cause dma page is not valid
NCRDMA_SETUP(sc, &sc->sc_cmdp, &sc->sc_cmdlen, 0, &dmasize); // The
page would have been made valid here.
NCRDMA_GO(sc);

In the working versions (before 1.6 and after 4.0) the code looks like this:

NCRDMA_SETUP(sc, &sc->sc_cmdp, &sc->sc_cmdlen, 0, &dmasize);
//...
NCRCMD(sc, NCRCMD_SELATN | NCRCMD_DMA);
NCRDMA_GO(sc);

After debugging the code on the real hardware, it looks like qemu has
multiple problems in scsi/dma/iommu layer.

I modified NCRDMA_SETUP, so that it did dma transfer without mapping
the page. In this case NetBSD 3.1 shows the following error (on a real
SS-20):

dma0: error: csr=a4440212<ERR,DRAINING=0,IEN,ENDMA,BURST=1,FASTER,ALOADED>
esp0: DMA error; resetting
dma0: error: csr=a4440212<ERR,DRAINING=0,IEN,ENDMA,BURST=1,FASTER,ALOADED>

no NMI.

And what is more important, on the real hardware "select with
attention" does not initiate dma (put a delay, waited 2 seconds and
nothing happened). It has to be done manually.

Any suggestions how to fix it according to the current iommu/dma
architecture? Looks like "select with attention" should register
callbacks?  ( Volunteers? ;-) )

-- 
Regards,
Artyom Tarasenko

solaris/sparc under qemu blog: http://tyom.blogspot.com/

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Qemu-devel] Re: NMI handling
  2010-07-26 16:53 ` [Qemu-devel] " Artyom Tarasenko
@ 2010-07-26 17:57   ` Blue Swirl
  0 siblings, 0 replies; 3+ messages in thread
From: Blue Swirl @ 2010-07-26 17:57 UTC (permalink / raw)
  To: Artyom Tarasenko; +Cc: qemu-devel

On Mon, Jul 26, 2010 at 4:53 PM, Artyom Tarasenko
<atar4qemu@googlemail.com> wrote:
> 2010/6/21 Artyom Tarasenko <atar4qemu@googlemail.com>:
>> 2010/5/25 Blue Swirl <blauwirbel@gmail.com>:
>>>>> About bugs, IIRC NetBSD 3.x crash could be related to IOMMU.
>>>>
>>>> What does indicate it? It happens where the disk sizes are normally
>>>> reported, so it could be a scsi/dma/irq/fpu issue as well.
>>>
>>> IIRC the DVMA address was 0xfc004000, but the mapped entries were for
>>> 0xfc000000 to 0xfc003fff.
>
> Under OpenBIOS. And even less with OBP, and much less if the network
> card is disabled.
>
>> It looks like we have multiple problems here: they start with
>> 0xfc004000 access (which can theoretically be expected on the real
>> hardware too) as you pointed out, but what happens afterwards is
>> strange too:
>>
>> - In the current qemu implementation we have a screaming NMI which
>> NetBSD can not clear. This happens cause NMI in qemu is literally
>> non-maskable, while on the real hardware it can be masked with the
>> 'mask all' flag. I'll send a patch for it.
>>
>> - with the masking patch, the NMI is not screaming but still is
>> percepted as spurious. This may be ok if NetBSD (1.6-3.1) doesn't have
>> a moduleerr_handler set.
>
> Or because scsi dma transfer on a real hardware never generates a nmi.
>
> In the current implementation, when "select with attention" is
> processed, scsi controller initiates a dma transfer and fetches a CDB.
> If dma fails (not mapped, or not allowed), NMI is generated. It is
> quite a strange design: such an error is an asynchronous event, and
> CPU wouldn't know, that scsi controller tried to do some dma at
> certain address. It would have been more consequent to send the error
> notification to the dma initiator (scsi controller in this case),  not
> to CPU.
>
> The offending code in NetBSD 1.6-3.1:
>
> NCRCMD(sc, NCRCMD_SELATN | NCRCMD_DMA); // Here it crashes (under
> qemu) cause dma page is not valid
> NCRDMA_SETUP(sc, &sc->sc_cmdp, &sc->sc_cmdlen, 0, &dmasize); // The
> page would have been made valid here.
> NCRDMA_GO(sc);
>
> In the working versions (before 1.6 and after 4.0) the code looks like this:
>
> NCRDMA_SETUP(sc, &sc->sc_cmdp, &sc->sc_cmdlen, 0, &dmasize);
> //...
> NCRCMD(sc, NCRCMD_SELATN | NCRCMD_DMA);
> NCRDMA_GO(sc);
>
> After debugging the code on the real hardware, it looks like qemu has
> multiple problems in scsi/dma/iommu layer.
>
> I modified NCRDMA_SETUP, so that it did dma transfer without mapping
> the page. In this case NetBSD 3.1 shows the following error (on a real
> SS-20):
>
> dma0: error: csr=a4440212<ERR,DRAINING=0,IEN,ENDMA,BURST=1,FASTER,ALOADED>
> esp0: DMA error; resetting
> dma0: error: csr=a4440212<ERR,DRAINING=0,IEN,ENDMA,BURST=1,FASTER,ALOADED>
>
> no NMI.
>
> And what is more important, on the real hardware "select with
> attention" does not initiate dma (put a delay, waited 2 seconds and
> nothing happened). It has to be done manually.
>
> Any suggestions how to fix it according to the current iommu/dma
> architecture? Looks like "select with attention" should register
> callbacks?  ( Volunteers? ;-) )

Excellent analysis!

About NMI: IOMMU just raises the qemu_irq provided by sun4m.c. The
interrupt bit number is currently 30, which is Module Error
(asynchronous fault). Maybe this should be 29, MSI (MBus-SBus
Interface) interrupt? That is still NMI though. Could you check what
interrupt bits get active in the interrupt controller master status?
What is in IOMMU AFSR?

About select with attention: NCRDMA_GO just tweaks DMA controller, so
ESP shouldn't perform the transfer if DMA is not ready. I think Linux
always pre-programs DMA.

One way to handle this would be to add a qemu_irq signal from DMA to
ESP which tells ESP whether DMA is ready. DMA raises or lowers the
interrupt whenever DMA is enabled or disabled. When the IRQ is
received by ESP, If there is no transfer pending, it just adjusts an
internal flag about DMA status. If there is a transfer pending, it is
started. When ESP handles a command, it should check the internal DMA
flag. If DMA is ready, continue with the transfer immediately like
now. Otherwise, hold the transfer and store parameters to internal
state. I wonder what state bits ESP will show when this happens.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2010-07-26 17:58 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-06-21 18:20 [Qemu-devel] NMI handling Artyom Tarasenko
2010-07-26 16:53 ` [Qemu-devel] " Artyom Tarasenko
2010-07-26 17:57   ` Blue Swirl

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).