From: "Neronin, Niklas" <niklas.neronin@linux.intel.com>
To: Michal Pecio <michal.pecio@gmail.com>
Cc: mathias.nyman@linux.intel.com, linux-usb@vger.kernel.org
Subject: Re: [PATCH 1/2] usb: xhci: fix Command Aborting
Date: Mon, 23 Mar 2026 12:25:00 +0200 [thread overview]
Message-ID: <56606a55-b1cb-4669-9025-8a91d8d258bf@linux.intel.com> (raw)
In-Reply-To: <20260321143057.1bf31b1b.michal.pecio@gmail.com>
On 21/03/2026 15.30, Michal Pecio wrote:
> On Mon, 16 Mar 2026 15:27:19 +0100, Niklas Neronin wrote:
>> Aborting the command ring requires setting the Command Abort (CA) bit
>> in the 64-bit Command Ring Control Register (CRCR). Historically,
>> several attempts have been made to implement this correctly, but each
>> introduced its own problems. This patch fixes the final remaining
>> issue: accidental modification of the Command Ring Pointer (i'll
>> abbreviate it to CRP).
>>
>> Originally [1], the full 64-bit CRCR value was read and written back
>> after setting CA. This is a bit unnecessary, only RsvdP bits (5:4)
>> should be read and written back (for future-proofing). All other
>> writable fields read as zero.
>>
>> Later patches attempted to solve an issues, caused by 64-bit writes
>> being split into two 32-bit writes. Writing the lower 31:0 bits first
>> immediately stopped the ring (CRR=0), and the following upper-half
>> write then overwrote part of CRP with zeroes, thus corrupting the CRP.
>> Patch [2] avoided this by writing only the lower 31:0 bits with CA
>> set, but that broke controllers that latch 64-bit registers only when
>> the upper bits are written, as reported in [3].
>
> I too have HW which waits for the high DWORD after low DWORD write.
>
> And I have HW which doesn't. If I write 0xdeadbeef to the high DWORD
> after waiting for CRR=0, some HW will ignore the write and some will
> IOMMU fault at 0xdeadbeefsomething.
>
> The abort itself takes a few microseconds in my tests.
Yes, abort completion itself is usually very fast. The 5 second timeout
comes directly from the xHCI 1.9 specification, which explicitly allows
for that duration.
4.6.1.2 Aborting a Command:
"If software doesn't see CRR negated in a timely manner (e.g. longer
than 5 seconds), then it should assume that there are larger problems
with the xHC and assert HCRST"
The timeout could probably be reduced to 1 second, but I did not do that
since I do not have evidence that a shorter timeout would be safe across
all hardware.
> Is this race causing problems in the real world?
I believe so, given that a fix was proposed specifically to address it:
https://lore.kernel.org/all/20211008092547.3996295-5-mathias.nyman@linux.intel.com/#t
>
>> The current solution [4] attempted to fix this by writing the full
>> 64-bit CRCR with both CA and an updated CRP. This does not work. The
>> patch tries to modify CRP while setting CA, but with CRR=1 all writes
>> to CRP are ignored. Updating CRP requires writing only the CA bit,
>> waiting for the controller to process the abort and clear CRR, and
>> only then writing the full CRP value.
>
> That's not a problem yet because we *don't* want to change anything.
The current implementation does attempt to change the CRP.
The sequence is as follows:
1. Fetch the next TRB.
2. If the TRB is a Link TRB, advance to the next TRB.
3. Get the DMA address of that TRB.
4. Write the lower DWORD with the new CRP address and the CA bit set.
5. Write the upper DWORD with the new CRP address.
This results in two possible final states, depending on how quickly
CRR is cleared:
1. The CRP is not updated at all.
2. Only the upper DWORD of the CRP is updated.
In neither case is the CRP updated as intended by the current
implementation.
>
>> Writing a new CRP after CA clears CRR is also unsafe:
>> * TRBs are 16-byte aligned (bits 3:0 clear)
>> * CRP requires 64-byte alignment (bits 5:0 clear)
>> Writing a TRB pointer into CRP truncates bits 5:4
>
> But this is a problem and I found HW (NEC, AMD) where waiting for
> CRR=0 and writing correct high DWORD causes "mismatched completion"
> errors, which is almost certainly (I haven't debugged) this thing.
So you believe that when waiting for CRR=0 any software writes to CRCR,
which are not immediately acted on, i.e. Command Stop/Abort, are instead
buffered internally?
OR
Do you believe that the HW will update CRP, irrelevant of what CRR is?
If that interpretation is correct, it would mean that Command Stop and
Command Abort cannot be used without also updating the RCS and CRP.
>
>> For a Command Abort to succeed, the CA bit must be set without
>> modifying the CRP. The following sequence ensures this:
>>
>> 1. Write the lower 31:0 bits with only the CA bit set. Since CRR=1,
>> CRP write is ignored.
>>
>> 2. Poll CRR. If CRR becomes 0, the abort succeeded with CRP
>> preserved.
>>
>> 3. If CRR does not clear (timeout), test if controller requires an
>> upper bits write to latch the register. Write the upper 63:32
>> bits (which does not update the CRP because CRR=1).
>
> Unless it just flipped to zero and CRP becomes corrupted as usual.
Yes, that race still exists, but it is significantly less likely.
Thanks,
Niklas
>
>> Then poll CRR again. If CRR becomes 0, it was a latching issue
>> and the abort succeeded with CRP preserved.
>
> Another problem is that this slows down the double-write chips and
> polls them under spinlock, which is completely crazy because it blocks
> xhci_irq() and other IRQs and generally freezes the whole system.
>
> Not sure if there is a way out of this madness without writeq(), quirks
> or a completely out-of-the-box solution, like making it possible to
> restore CRP. Truncation can be dealt with by ensuring that a few TRBs
> before Dequeue are unused and can be made No-Ops at any time.
>
> Regards,
> Michal
>
next prev parent reply other threads:[~2026-03-23 10:25 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-16 14:27 [PATCH 0/2] usb: xhci: correct Command Abort bit handling Niklas Neronin
2026-03-16 14:27 ` [PATCH 1/2] usb: xhci: fix Command Aborting Niklas Neronin
2026-03-21 13:30 ` Michal Pecio
2026-03-23 10:25 ` Neronin, Niklas [this message]
2026-03-23 11:24 ` Michal Pecio
2026-03-23 14:00 ` Neronin, Niklas
2026-03-23 21:40 ` Michal Pecio
2026-03-16 14:27 ` [PATCH 2/2] usb: xhci: use writeq() for CA write on 64-bit architectures Niklas Neronin
2026-03-22 11:36 ` Michal Pecio
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=56606a55-b1cb-4669-9025-8a91d8d258bf@linux.intel.com \
--to=niklas.neronin@linux.intel.com \
--cc=linux-usb@vger.kernel.org \
--cc=mathias.nyman@linux.intel.com \
--cc=michal.pecio@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox