Re: [PATCH 1/2] usb: xhci: fix Command Aborting

public inbox for linux-usb@vger.kernel.org
 help / color / mirror / Atom feed

From: Michal Pecio <michal.pecio@gmail.com>
To: "Neronin, Niklas" <niklas.neronin@linux.intel.com>
Cc: mathias.nyman@linux.intel.com, linux-usb@vger.kernel.org,
	Pavankumar Kondeti <quic_pkondeti@quicinc.com>,
	Pavankumar Kondeti <pkondeti@codeaurora.org>
Subject: Re: [PATCH 1/2] usb: xhci: fix Command Aborting
Date: Mon, 23 Mar 2026 12:24:58 +0100	[thread overview]
Message-ID: <20260323122458.0af6b4d0.michal.pecio@gmail.com> (raw)
In-Reply-To: <56606a55-b1cb-4669-9025-8a91d8d258bf@linux.intel.com>

On Mon, 23 Mar 2026 12:25:00 +0200, Neronin, Niklas wrote:
> On 21/03/2026 15.30, Michal Pecio wrote:
> > On Mon, 16 Mar 2026 15:27:19 +0100, Niklas Neronin wrote:  
> >> Aborting the command ring requires setting the Command Abort (CA)
> >> bit in the 64-bit Command Ring Control Register (CRCR).
> >> Historically, several attempts have been made to implement this
> >> correctly, but each introduced its own problems. This patch fixes
> >> the final remaining issue: accidental modification of the Command
> >> Ring Pointer (i'll abbreviate it to CRP).
> >>
> >> Originally [1], the full 64-bit CRCR value was read and written
> >> back after setting CA. This is a bit unnecessary, only RsvdP bits
> >> (5:4) should be read and written back (for future-proofing). All
> >> other writable fields read as zero.
> >>
> >> Later patches attempted to solve an issues, caused by 64-bit writes
> >> being split into two 32-bit writes. Writing the lower 31:0 bits
> >> first immediately stopped the ring (CRR=0), and the following
> >> upper-half write then overwrote part of CRP with zeroes, thus
> >> corrupting the CRP. Patch [2] avoided this by writing only the
> >> lower 31:0 bits with CA set, but that broke controllers that latch
> >> 64-bit registers only when the upper bits are written, as reported
> >> in [3].  
> > 
> > I too have HW which waits for the high DWORD after low DWORD write.
> > 
> > And I have HW which doesn't. If I write 0xdeadbeef to the high DWORD
> > after waiting for CRR=0, some HW will ignore the write and some will
> > IOMMU fault at 0xdeadbeefsomething.
> > 
> > The abort itself takes a few microseconds in my tests.  
> 
> Yes, abort completion itself is usually very fast. The 5 second
> timeout comes directly from the xHCI 1.9 specification, which
> explicitly allows for that duration.
> 
> 4.6.1.2 Aborting a Command:
> "If software doesn't see CRR negated in a timely manner (e.g. longer
>  than 5 seconds), then it should assume that there are larger problems
>  with the xHC and assert HCRST"
> 
> The timeout could probably be reduced to 1 second, but I did not do
> that since I do not have evidence that a shorter timeout would be
> safe across all hardware.

As noted in my later response to patch 2/2, in rare corner cases
ASMedia command abort can take a few seconds and still succeed.

> > Is this race causing problems in the real world?  
> 
> I believe so, given that a fix was proposed specifically to address it:
> https://lore.kernel.org/all/20211008092547.3996295-5-mathias.nyman@linux.intel.com/#t

It does look like a problem which people would be unlikely to discover
without being affected by it, but the patch says nothing.

Hi Pavankumar,

Could you elaborate on the above issue, what was the affected host
controller and what sort of workloads made this race turn bad?

In theory the two DWORD writes could be separated by some NMI/SMI
hitting the CPU or by heavy traffic on the I/O bus.

And if probelms are known to happen here, then what about other users
of xhci_write_64(), for example updating ERDP aka erst_dequeue?

> The current implementation does attempt to change the CRP.
> The sequence is as follows:
>  1. Fetch the next TRB.
>  2. If the TRB is a Link TRB, advance to the next TRB.
>  3. Get the DMA address of that TRB.
>  4. Write the lower DWORD with the new CRP address and the CA bit set.
>  5. Write the upper DWORD with the new CRP address.
> 
> This results in two possible final states, depending on how quickly
> CRR is cleared:
>  1. The CRP is not updated at all.
>  2. Only the upper DWORD of the CRP is updated.

I stand corrected, I wasn't aware of some details. That said,

1. is a valid outcome and not a bug, jumping to the next TRB is not
   necessary, it was a workaround for previous issues.
2. is a bug, but it shouldn't matter because the ring has one segment

But I did have problems on NEC and AMD, which means that something else
is also possible. I think I will need to debug it deeper, but I suspect
that writing high DWORD alone is enough to trigger bit 4:5 truncation
or overwrite low DWORD with some garbage.

> So you believe that when waiting for CRR=0 any software writes to
> CRCR, which are not immediately acted on, i.e. Command Stop/Abort,
> are instead buffered internally?

We all believe that some HW does nothing and waits for high DWORD.

Other HW is seen to begin aborting more or less immediately. However,
until the abort is complete, any subsequent write to high DWORD may
be ignored because we can't change CRP while CRR=1.

Problems begin if CRR=0 before we write high DWORD. CRP bits become
writeable and any write may trigger truncation of bits 4:5.

Note that if the HW interprets this as a single write to high DWORD
separate from the previous abort operation (which it has already
completed) then this is out of spec, because we should write low-high.

It seems that my Etron HC ignores this high DWORD write after CRR=0.
I can write 0xdeadbeef there and it continues working normally.
Or maybe it's broken and should get the disable 64 bits quirk.

But NEC and AMD start getting "mismatched completion" errors, so it
seems that their CRP is corrupted somehow. And I could probably replace
waiting for CRR=0 with usleep(20) and get the same corruption.

I will later look closer what happens here.

Regards,
Michal

next prev parent reply	other threads:[~2026-03-23 11:25 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-16 14:27 [PATCH 0/2] usb: xhci: correct Command Abort bit handling Niklas Neronin
2026-03-16 14:27 ` [PATCH 1/2] usb: xhci: fix Command Aborting Niklas Neronin
2026-03-21 13:30   ` Michal Pecio
2026-03-23 10:25     ` Neronin, Niklas
2026-03-23 11:24       ` Michal Pecio [this message]
2026-03-23 14:00         ` Neronin, Niklas
2026-03-23 21:40           ` Michal Pecio
2026-03-16 14:27 ` [PATCH 2/2] usb: xhci: use writeq() for CA write on 64-bit architectures Niklas Neronin
2026-03-22 11:36   ` Michal Pecio

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260323122458.0af6b4d0.michal.pecio@gmail.com \
    --to=michal.pecio@gmail.com \
    --cc=linux-usb@vger.kernel.org \
    --cc=mathias.nyman@linux.intel.com \
    --cc=niklas.neronin@linux.intel.com \
    --cc=pkondeti@codeaurora.org \
    --cc=quic_pkondeti@quicinc.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox