PCIe cycle sequence when updating the msi-x table

public inbox for linux-pci@vger.kernel.org
 help / color / mirror / Atom feed

* PCIe cycle sequence when updating the msi-x table
@ 2023-04-07 16:17 David Laight
  2023-04-07 19:55 ` Thomas Gleixner
  0 siblings, 1 reply; 6+ messages in thread
From: David Laight @ 2023-04-07 16:17 UTC (permalink / raw)
  To: Thomas Gleixner, Jason Gunthorpe, Bjorn Helgaas,
	Christoph Hellwig, linux-pci@vger.kernel.org

The function that updates the MSI-X table currently reads:

static inline void pci_write_msg_msix(struct msi_desc *desc, struct msi_msg *msg)
{
	void __iomem *base = pci_msix_desc_addr(desc);
	u32 ctrl = desc->pci.msix_ctrl;
	bool unmasked = !(ctrl & PCI_MSIX_ENTRY_CTRL_MASKBIT);

	if (desc->pci.msi_attrib.is_virtual)
		return;
	/*
	 * The specification mandates that the entry is masked
	 * when the message is modified:
	 *
	 * "If software changes the Address or Data value of an
	 * entry while the entry is unmasked, the result is
	 * undefined."
	 */
	if (unmasked)
		pci_msix_write_vector_ctrl(desc, ctrl | PCI_MSIX_ENTRY_CTRL_MASKBIT);

	writel(msg->address_lo, base + PCI_MSIX_ENTRY_LOWER_ADDR);
	writel(msg->address_hi, base + PCI_MSIX_ENTRY_UPPER_ADDR);
	writel(msg->data, base + PCI_MSIX_ENTRY_DATA);

	if (unmasked)
		pci_msix_write_vector_ctrl(desc, ctrl);

	/* Ensure that the writes are visible in the device */
	readl(base + PCI_MSIX_ENTRY_DATA);
}

Now I'm not 100% sure about the cycle ordering rules.
I've not got the spec here, and I recall it isn't that easy to
understand.
I can't remember whether reads are allowed to overtake writes
(to different addresses).
I do remember that reading back the same address was needed to
flush the cpu store buffer on some space cpus.
So it might be that the final readl() won't actually flush
the write to the mask register.
(The same readback of 'the wrong' address also happens elsewhere.)

But there is a bigger problem.
As the comment says writing the address/data while an entry is
unmasked must be avoided (because a mixture of the old and new
values could easily by used for the PCIe write cycle).

But it is also quite likely that that hardware checks the masked
bit before/after reading the address+data.

So masking the interrupt immediately before the update and/or
unmasking immediately after could also cause issues.

I'd suggest adding a readl() of the mask register after the disable
and moving the readl() of the data to before the enable.
The delays inherent in the PCIe reads should be enough to ensure
that the interrupt generation logic doesn't run until all the
writes are complete.

(I'm also going to have to look at our fpga to see if the
global enable/mask bits are actually exported by the pcie block.
The MSI-X logic definitely doesn't have them as inputs.)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: PCIe cycle sequence when updating the msi-x table
  2023-04-07 16:17 PCIe cycle sequence when updating the msi-x table David Laight
@ 2023-04-07 19:55 ` Thomas Gleixner
  2023-04-07 22:06   ` David Laight
  0 siblings, 1 reply; 6+ messages in thread
From: Thomas Gleixner @ 2023-04-07 19:55 UTC (permalink / raw)
  To: David Laight, Jason Gunthorpe, Bjorn Helgaas, Christoph Hellwig,
	linux-pci@vger.kernel.org

On Fri, Apr 07 2023 at 16:17, David Laight wrote:
> The function that updates the MSI-X table currently reads:
>
> Now I'm not 100% sure about the cycle ordering rules.
> I've not got the spec here, and I recall it isn't that easy to
> understand.
> I can't remember whether reads are allowed to overtake writes
> (to different addresses).
> I do remember that reading back the same address was needed to
> flush the cpu store buffer on some space cpus.
> So it might be that the final readl() won't actually flush
> the write to the mask register.
> (The same readback of 'the wrong' address also happens elsewhere.)

It's guaranteed that all writes to a device have reached the device
before a read from the same device which comes after the writes is
issued. The write and read addresses do not matter.

> But there is a bigger problem.
> As the comment says writing the address/data while an entry is
> unmasked must be avoided (because a mixture of the old and new
> values could easily by used for the PCIe write cycle).
>
> But it is also quite likely that that hardware checks the masked
> bit before/after reading the address+data.
>
> So masking the interrupt immediately before the update and/or
> unmasking immediately after could also cause issues.

No it does not, because the writes are strictly ordered.

So the devices gets:

   1) write to control register with MASKBIT set
   2) write to LOWER_ADDRESS
   3) write to UPPER_ADDRESS
   4) write to ENTRY_DATA
   5) write to control register with MASKBIT cleared

#1 disables the vector and the device is not allowed to use the msg data
from the table entry until the mask bit is cleared again.

If the device gets that wrong then that's a bug in the device and not a
kernel problem.

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: PCIe cycle sequence when updating the msi-x table
  2023-04-07 19:55 ` Thomas Gleixner
@ 2023-04-07 22:06   ` David Laight
  2023-04-10  6:50     ` Thomas Gleixner
  0 siblings, 1 reply; 6+ messages in thread
From: David Laight @ 2023-04-07 22:06 UTC (permalink / raw)
  To: 'Thomas Gleixner', Jason Gunthorpe, Bjorn Helgaas,
	Christoph Hellwig, linux-pci@vger.kernel.org

From: Thomas Gleixner
> Sent: 07 April 2023 20:56
..
> > But there is a bigger problem.
> > As the comment says writing the address/data while an entry is
> > unmasked must be avoided (because a mixture of the old and new
> > values could easily by used for the PCIe write cycle).
> >
> > But it is also quite likely that that hardware checks the masked
> > bit before/after reading the address+data.
> >
> > So masking the interrupt immediately before the update and/or
> > unmasking immediately after could also cause issues.
> 
> No it does not, because the writes are strictly ordered.
> 
> So the devices gets:
> 
>    1) write to control register with MASKBIT set
>    2) write to LOWER_ADDRESS
>    3) write to UPPER_ADDRESS
>    4) write to ENTRY_DATA
>    5) write to control register with MASKBIT cleared
> 
> #1 disables the vector and the device is not allowed to use the msg data
> from the table entry until the mask bit is cleared again.
> 
> If the device gets that wrong then that's a bug in the device and not a
> kernel problem.

Maybe, but the kernel isn't making it easy for a device
state-engine that has to do four separate reads of an
internal 32-bit memory area.
Adding a short delay between #4 and #5 is likely to avoid
some very hard to debug issues if the hardware reads the
values 'mask last', if it reads them 'mask first' you need
a short delay between #1 and #2.

Anything fpga based is likely to be using a 32bit memory
block for the MSI-X data (possibly even 16bit).

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: PCIe cycle sequence when updating the msi-x table
  2023-04-07 22:06   ` David Laight
@ 2023-04-10  6:50     ` Thomas Gleixner
  2023-04-10 12:16       ` David Laight
  0 siblings, 1 reply; 6+ messages in thread
From: Thomas Gleixner @ 2023-04-10  6:50 UTC (permalink / raw)
  To: David Laight, Jason Gunthorpe, Bjorn Helgaas, Christoph Hellwig,
	linux-pci@vger.kernel.org

On Fri, Apr 07 2023 at 22:06, David Laight wrote:
> From: Thomas Gleixner
>> Sent: 07 April 2023 20:56
>> So the devices gets:
>> 
>>    1) write to control register with MASKBIT set
>>    2) write to LOWER_ADDRESS
>>    3) write to UPPER_ADDRESS
>>    4) write to ENTRY_DATA
>>    5) write to control register with MASKBIT cleared
>> 
>> #1 disables the vector and the device is not allowed to use the msg data
>> from the table entry until the mask bit is cleared again.
>> 
>> If the device gets that wrong then that's a bug in the device and not a
>> kernel problem.
>
> Maybe, but the kernel isn't making it easy for a device
> state-engine that has to do four separate reads of an
> internal 32-bit memory area.
> Adding a short delay between #4 and #5 is likely to avoid
> some very hard to debug issues if the hardware reads the
> values 'mask last', if it reads them 'mask first' you need
> a short delay between #1 and #2.

Whatever order this reads does not matter. The point is:

  "Mask Bit - When this bit is Set, the Function is prohibited from
   sending a message using this MSI-X Table entry."

So you cannot make this read order dependent at all.

> Anything fpga based is likely to be using a 32bit memory
> block for the MSI-X data (possibly even 16bit).

It's trivial enough to latch the message on unmask into a shadow
register set and let the state engine work from there.

And no, we are not adding random delays to that code just because.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: PCIe cycle sequence when updating the msi-x table
  2023-04-10  6:50     ` Thomas Gleixner
@ 2023-04-10 12:16       ` David Laight
  2023-04-10 17:25         ` Thomas Gleixner
  0 siblings, 1 reply; 6+ messages in thread
From: David Laight @ 2023-04-10 12:16 UTC (permalink / raw)
  To: 'Thomas Gleixner', Jason Gunthorpe, Bjorn Helgaas,
	Christoph Hellwig, linux-pci@vger.kernel.org

From: Thomas Gleixner
> Sent: 10 April 2023 07:50
...
> Whatever order this reads does not matter. The point is:
> 
>   "Mask Bit - When this bit is Set, the Function is prohibited from
>    sending a message using this MSI-X Table entry."
> 
> So you cannot make this read order dependent at all.
> 
> > Anything fpga based is likely to be using a 32bit memory
> > block for the MSI-X data (possibly even 16bit).
> 
> It's trivial enough to latch the message on unmask into a shadow
> register set and let the state engine work from there.

I could implement the memory block inside the MSI_X logic
and then detect writes.
Anything else uses a lot more fpga resource.

> And no, we are not adding random delays to that code just because.

I was mostly suggesting one be moved.
(Changing the address/data is hardly a hot path.)

Actually I'm trying to work out what the readbacks in the MSI-X
code are actually for (apart from adding a 'random delay').

If you are masking one of the PCI INTx lines, then a readback
delays the cpu until the physical line is de-asserted so that
the IRQ line into the interrupt controlled is de-asserted and
thus the IRQ to the cpu is also de-asserted.
Even if the cpu commits to the interrupt, the interrupt controller
reports 'no vector' (or vector 7 if it is the 8259).
So you can safely do a readback, enable interrupts and know
the interrupt won't happen.
(Although if you have a real 8259 you need to worry about
the cycle recovery time, even a 286 will break it!)

But MSI-X is different, it is very much edge sensitive.
The target MSI-X logic can decide to raise an interrupt
just before the mask bit it set.
It will then request the target's PCIe interface issue
the write TLP, this is probably a 'posted' write in the
internal logic. The PCIe interface could be busy generating
a long write TLP (or two) before actually issuing write for
the interrupt.
The readback will be pretty much back-to-back with the mask write
(as seen on the target).
So maybe the rad can get completed while the write is still
queued.
Now I can't remember whether the data TLP for reads is allowed
to overtake a posted write request, but it really doesn't matter.
Even if the IRQ write arrives at the root hub before the
read completion they then head off in different directions
through the 'cpu' logic - so the read could easily complete
well before the interrupt gets processed.

In fact an interrupt requested well before the mask bit is set
could be pending in the ioapic waiting for the cpu to enable
interrupts.

So I don't see any reason for those readbacks at all.

I've never looked at the cpu end of MSI-X, but I suspect that
stop using an interrupts you need to mask it there first,
then disable the hardware and later make sure you clear
any lurking pending request (probably before unmasking for
later use).

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: PCIe cycle sequence when updating the msi-x table
  2023-04-10 12:16       ` David Laight
@ 2023-04-10 17:25         ` Thomas Gleixner
  0 siblings, 0 replies; 6+ messages in thread
From: Thomas Gleixner @ 2023-04-10 17:25 UTC (permalink / raw)
  To: David Laight, Jason Gunthorpe, Bjorn Helgaas, Christoph Hellwig,
	linux-pci@vger.kernel.org

On Mon, Apr 10 2023 at 12:16, David Laight wrote:
> From: Thomas Gleixner
>> Sent: 10 April 2023 07:50
>> It's trivial enough to latch the message on unmask into a shadow
>> register set and let the state engine work from there.
>
> I could implement the memory block inside the MSI_X logic
> and then detect writes.
> Anything else uses a lot more fpga resource.

Whatever it takes there are enough ways to solve that.

>> And no, we are not adding random delays to that code just because.
>
> I was mostly suggesting one be moved.
> (Changing the address/data is hardly a hot path.)

There is _no_ delay to be moved. The read back is not a delay. It's a
functional requirement.

> Actually I'm trying to work out what the readbacks in the MSI-X
> code are actually for (apart from adding a 'random delay').

It's absolutely not random. It's to ensure that _all_ writes are visible
and effective in the device before the function returns. 

> Now I can't remember whether the data TLP for reads is allowed
> to overtake a posted write request, but it really doesn't matter.
> Even if the IRQ write arrives at the root hub before the
> read completion they then head off in different directions
> through the 'cpu' logic - so the read could easily complete
> well before the interrupt gets processed.

The kernel handles the case that an already pending interrupt comes in
on the original vector and CPU. But it must be guaranteed that any
interrupt raised _after_ the mask/update/unmask sequence will arrive at
the new destination.

> In fact an interrupt requested well before the mask bit is set
> could be pending in the ioapic waiting for the cpu to enable
> interrupts.

MSI-X interrupts are never pending in the IOAPIC.

> So I don't see any reason for those readbacks at all.

But you want new read backs or move the existing one, right? Can you
make your mind up?

> I've never looked at the cpu end of MSI-X, but I suspect that
> stop using an interrupts you need to mask it there first,
> then disable the hardware and later make sure you clear
> any lurking pending request (probably before unmasking for
> later use).

You can suspect what you want. The kernel can handle this perfectly
correctly without any of this:

     Startup:
        Assign vector $M on CPU $A in msi_entry
        Unmask in msi_entry

     Interrupt affinity change request:
        New target CPU mask is stored in the interrupt descriptor and
        affinity change request is queued. That's a pure kernel internal
        software operation and does not touch the device.

     Interrupt raised on CPU $A vector $M:
        Handle interrupt
        Mask in msi_entry
        Assign vector $N on CPU $B
        Unmask in msi_entry
        EOI

     Interrupt raised on CPU $B vector $N: 
        Handle interrupt
        Send IPI to CPU$A to cleanup vector $M
        EOI

Affinity for non-remapped MSI/MSI-X is always changed in context of an
interrupt being handled on the original target CPU/vector.

The whole mechanism relies on standard conform devices simply because
it's impossible to handle the following scenario:

     CPU A           Device                    CPU B
     write(entry->address_lo);
     (Changes the destination to CPU B)

                     Writes entry->data to entry->address_lo

                                               Gets a spurious interrupt
                                               on the original vector $N

As a consequence the interrupt would be lost and in the worst case the
device is not longer functional. CPU A cannot access the local APIC IRR
of CPU B to test for this. Even if it could, it would only observe it
when CPU B cannot handle it at that moment.

And no, we are not going to install a magic "maybe random PCIe device
misbehaves handler" on CPU B for doing so. We simply can't and it's not
necessary at all.

There is a "work around" for non-maskable MSI. See msi_set_affinity()
for the gory details. But also that requires that the writes are visible
in the device when the write_msg() function returns.

And no, we are not extending this to MSI-X simply because contrary to
MSI MSI-X is well defined. I'm dealing with MSI-X for 15+ years now and
there hasn't been a single device which violated the standard in that
regard. At least to my knowledge.

Please fix your hardware/firmware to be standard compliant. The kernel
is not there to proliferate sloppy hardware designs. We simply are not
opening that can of worms. Aside of that a device must be compliant to
work independent of the OS and other OSes are neither interested in
horrible hacks just to let hardware people do what they want. Standards
are there for a reason.

Just for the record. entry->address_hi is irrelvant in the above
scenario because address_hi is only in use when interrupt remapping is
enabled (at least on x86). If interrupt remapping is enabled, none of
the above matters as the affinity change happens at the remapping level
and the device MSI[-X] entries never change after setup. The remapping
unit has none of those problems at all.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-04-10 17:25 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-04-07 16:17 PCIe cycle sequence when updating the msi-x table David Laight
2023-04-07 19:55 ` Thomas Gleixner
2023-04-07 22:06   ` David Laight
2023-04-10  6:50     ` Thomas Gleixner
2023-04-10 12:16       ` David Laight
2023-04-10 17:25         ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox