All of lore.kernel.org
 help / color / mirror / Atom feed
From: Baolu Lu <baolu.lu@linux.intel.com>
To: "Tian, Kevin" <kevin.tian@intel.com>,
	Dmytro Maluka <dmaluka@chromium.org>,
	Jason Gunthorpe <jgg@ziepe.ca>
Cc: David Woodhouse <dwmw2@infradead.org>,
	"iommu@lists.linux.dev" <iommu@lists.linux.dev>,
	Joerg Roedel <joro@8bytes.org>, Will Deacon <will@kernel.org>,
	Robin Murphy <robin.murphy@arm.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"Vineeth Pillai (Google)" <vineeth@bitbyteword.org>,
	Aashish Sharma <aashish@aashishsharma.net>,
	Grzegorz Jaszczyk <jaszczyk@chromium.org>,
	"Dong, Chuanxiao" <chuanxiao.dong@intel.com>
Subject: Re: [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates
Date: Fri, 9 Jan 2026 14:32:12 +0800	[thread overview]
Message-ID: <2f9e3589-d488-4ffb-9ae9-0d69ac77a8fb@linux.intel.com> (raw)
In-Reply-To: <BN9PR11MB52764C6A9F1774DDAFED41258C85A@BN9PR11MB5276.namprd11.prod.outlook.com>

On 1/8/26 10:09, Tian, Kevin wrote:
>> From: Dmytro Maluka <dmaluka@chromium.org>
>> Sent: Tuesday, January 6, 2026 11:50 PM
>>
>> On Tue, Jan 06, 2026 at 10:23:01AM -0400, Jason Gunthorpe wrote:
>>> On Tue, Jan 06, 2026 at 02:51:38PM +0100, Dmytro Maluka wrote:
>>>> Regarding flushing caches right after that - what for? (BTW the Intel
>>>> driver doesn't do that either.) If we don't do that and as a result the
>>>> HW is using an old entry cached before we cleared the present bit, it
>>>> is not affected by our later modifications anyway.
>>>
>>> You don't know what state the HW fetcher is in. This kind of race is possible:
>>>
>>>       CPU                 FETCHER
>>>                          read present = 1
>>>      present = 0
>>>      mangle qword 1
>>>                          read qword 1
>>>                          < fail - HW sees a corrupted entry >
>>>
>>> The flush is not just a flush but a barrier to synchronize with the HW
>>> that it is done all fetches that may have been dependent on seeing
>>> present = 1.
>>>
>>> So missing a flush after clearing present is possibly a bug today - I
>>> don't remember what guarenteed the atomic size is for Intel IOMMU
>>> though, if the atomic size is the whole entry it is OK since there is
>>> only one fetcher read. Though AMD is 128 bits and ARM is 64 bits.
>>
>> Indeed, may be a bug... In the VT-d spec I don't immediately see a
>> guarantee that context and PASID entries are fetched atomically. (And
>> for PASID entries, which are 512 bits, that seems particularly
>> unlikely.)
>>
> 
> 512bits atomicity is possible, but not on the PASID entry.
> 
> VT-d spec, head of section 9 (Translation Structure Formats):
> 
> "
> This chapter describes the memory-resident structures for DMA and
> interrupt remapping. Hardware must access structure entries that
> are 64-bit or 128-bit atomically. Hardware must update a 512-bit
> Posted Interrupt Descriptor (see Section 9.11 for details) atomically.
> Other than the Posted Interrupt Descriptor (PID), hardware is allowed
> to break access to larger than 128-bit entries into multiple aligned
> 128-bit accesses.
> "
> 
> root entry, scalable root entry, context entry and IRTE are 128bits
> so they are OK.
> 
> scalable context entry are 256bits but only the lower 128bits are
> defined so it's OK for now.
> 
> scalable PASID directory entry is 64bits. ok.
> 
> posted interrupt descriptor is 512bits with atomicity guaranteed.
> 
> but we do have problem on scalable pasid entry which is 512bits.
>    - bits beyond 191 are for future hardware, not a problem now
>    - bits 128-191 are for 1st-stage
>    - bits 0-127 manages stage selection, 2nd-stage, and some 1st-stage
> 
> so in theory 1st-stage and nesting are affected by this bug.

Yes. This is a real software bug. The hardware is legally allowed to
tear the pasid table entry read into 4 128-bit chunks. For first-stage
(first-only or nested) translation, chunk 1 (bit 0-127) and chunk 2 (bit
128-191) both contain active configuration data, hardware could possibly
read a entry composed of half-old and half-new data.

The VT-d spec defines software guide for pasid table entry manipulation
in section 6.5.3.3 (Guidance to Software for Invalidations), I think the
Linux driver doesn't handle the handshake between CPU and IOMMU hardware
in the right way.

The correct way should be a clear-flush-update sequence, that is, when
tearing down a present pasid table entry, the software should

1. Clear the Present bit;
2. Invalidate the cache according to section 6.5.3.3;
[now CPU owns the pasid table entry]
3. Update other fields;
4. Set the Present bit;
[now the VT-d hardware owns the pasid table entry].

> 
> In reality:
>    - iommu driver shouldn't receive an attach request on an in-use pasid
>      entry, so the cache should remain cleared (either at initial state or
>      flushed by previous teardown) then hw won't use a partial 1st-stage
>      config after seeing the entry as non-present.

Yes. But the current tear down process is buggy as described above.

> 
>    - replace is already broken, as the entry should not be cleared in the
>      1st place then this bug will be fixed when replace is reworked.

We still have a long way to go before achieving the real replace since
the hardware doesn't guarantee 512-bit atomicity, "hitless" is very
difficult.

> 
> If no oversight (Baolu?), probably we don't need to fix it strictly following
> Jason's pseudo logic at this point. Instead, just rename pasid_clear_entry()
> to pasid_clear_entry_no_flush() for now (with some comment to clarify
> the expectation), and rework the replace path in parallel.
> We may never require a pasid_clear_entry_flush_cache() once hitless> replace is in place. 😊

Perhaps we can first add pasid_entry_clear_present() to fulfill the
clear-flush-update handshake between the software and the IOMMU hardware
while reworking the PASID replace path?

Thanks,
baolu

  reply	other threads:[~2026-01-09  6:32 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-27 17:57 [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates Dmytro Maluka
2025-12-27 17:57 ` [PATCH v2 1/5] iommu/vt-d: Sanitize set bits in pasid_set_bits() Dmytro Maluka
2025-12-27 17:57 ` [PATCH v2 2/5] iommu/vt-d: Generalize pasid_set_bits() Dmytro Maluka
2025-12-27 17:57 ` [PATCH v2 3/5] iommu/vt-d: Ensure memory ordering in context entry updates Dmytro Maluka
2025-12-27 17:57 ` [PATCH v2 4/5] iommu/vt-d: Use smp_wmb() before setting context/pasid present bit Dmytro Maluka
2025-12-27 17:57 ` [PATCH v2 5/5] iommu/vt-d: Use WRITE_ONCE for setting root table entries Dmytro Maluka
2026-01-05 18:12 ` [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates Jason Gunthorpe
2026-01-05 18:54   ` Dmytro Maluka
2026-01-05 19:14     ` Jason Gunthorpe
2026-01-05 20:05       ` Dmytro Maluka
2026-01-06  0:14         ` Jason Gunthorpe
2026-01-06  7:48           ` Tian, Kevin
2026-01-06 14:40             ` Dmytro Maluka
2026-01-08  2:22               ` Tian, Kevin
2026-01-06 13:51           ` Dmytro Maluka
2026-01-06 14:23             ` Jason Gunthorpe
2026-01-06 15:50               ` Dmytro Maluka
2026-01-06 16:45                 ` Jason Gunthorpe
2026-01-06 17:14                   ` Dmytro Maluka
2026-01-08  2:09                 ` Tian, Kevin
2026-01-09  6:32                   ` Baolu Lu [this message]
     [not found]                 ` <BN9PR11MB5276FB0F465DBFB4EE4D742B8C85A@BN9PR11MB5276.namprd11.prod.outlook.com>
2026-01-08  7:00                   ` Tian, Kevin
2026-01-06  3:37   ` Baolu Lu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2f9e3589-d488-4ffb-9ae9-0d69ac77a8fb@linux.intel.com \
    --to=baolu.lu@linux.intel.com \
    --cc=aashish@aashishsharma.net \
    --cc=chuanxiao.dong@intel.com \
    --cc=dmaluka@chromium.org \
    --cc=dwmw2@infradead.org \
    --cc=iommu@lists.linux.dev \
    --cc=jaszczyk@chromium.org \
    --cc=jgg@ziepe.ca \
    --cc=joro@8bytes.org \
    --cc=kevin.tian@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=robin.murphy@arm.com \
    --cc=vineeth@bitbyteword.org \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.