* [PATCH v18 4/8] rust: page: convert to `Ownable`
From: Andreas Hindborg @ 2026-06-25 10:15 UTC (permalink / raw)
To: Danilo Krummrich, Lorenzo Stoakes, Vlastimil Babka,
Liam R. Howlett, Uladzislau Rezki, Miguel Ojeda, Boqun Feng,
Gary Guo, Björn Roy Baron, Benno Lossin, Alice Ryhl,
Trevor Gross, Daniel Almeida, Tamir Duberstein, Alexandre Courbot,
Onur Özkan, Lyude Paul, Greg Kroah-Hartman,
Arve Hjønnevåg, Todd Kjos, Christian Brauner,
Carlos Llamas, Rafael J. Wysocki, Dave Ertman, Ira Weiny,
Leon Romanovsky, Paul Moore, Serge Hallyn, David Airlie,
Simona Vetter, Alexander Viro, Jan Kara, Igor Korotin,
Viresh Kumar, Nishanth Menon, Stephen Boyd, Bjorn Helgaas,
Krzysztof Wilczyński, Pavel Tikhomirov, Michal Wilczynski
Cc: Andreas Hindborg, Philipp Stanner, rust-for-linux, linux-kernel,
linux-mm, driver-core, linux-block, linux-security-module,
dri-devel, linux-fsdevel, linux-pm, linux-pci, linux-pwm,
Asahi Lina
In-Reply-To: <20260625-unique-ref-v18-0-4e06b5896d47@kernel.org>
From: Asahi Lina <lina@asahilina.net>
This allows Page references to be returned as borrowed references,
without necessarily owning the struct page.
Remove `BorrowedPage` and update users to use `Owned<Page>`.
Signed-off-by: Asahi Lina <lina@asahilina.net>
[ Andreas: Fix formatting and add a safety comment, update users. ]
Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org>
---
drivers/android/binder/page_range.rs | 10 +--
rust/kernel/alloc/allocator.rs | 19 +++---
rust/kernel/alloc/allocator/iter.rs | 6 +-
rust/kernel/page.rs | 122 +++++++++--------------------------
4 files changed, 46 insertions(+), 111 deletions(-)
diff --git a/drivers/android/binder/page_range.rs b/drivers/android/binder/page_range.rs
index e54a90e62402a..7941eb85b4ef4 100644
--- a/drivers/android/binder/page_range.rs
+++ b/drivers/android/binder/page_range.rs
@@ -33,7 +33,7 @@
sync::{aref::ARef, Mutex, SpinLock},
task::Pid,
transmute::FromBytes,
- types::Opaque,
+ types::{Opaque, Owned},
uaccess::UserSliceReader,
};
@@ -198,7 +198,7 @@ unsafe impl Send for Inner {}
#[repr(C)]
struct PageInfo {
lru: bindings::list_head,
- page: Option<Page>,
+ page: Option<Owned<Page>>,
range: *const ShrinkablePageRange,
}
@@ -206,7 +206,7 @@ impl PageInfo {
/// # Safety
///
/// The caller ensures that writing to `me.page` is ok, and that the page is not currently set.
- unsafe fn set_page(me: *mut PageInfo, page: Page) {
+ unsafe fn set_page(me: *mut PageInfo, page: Owned<Page>) {
// SAFETY: This pointer offset is in bounds.
let ptr = unsafe { &raw mut (*me).page };
@@ -229,13 +229,13 @@ unsafe fn get_page<'a>(me: *const PageInfo) -> Option<&'a Page> {
let ptr = unsafe { &raw const (*me).page };
// SAFETY: The pointer is valid for reading.
- unsafe { (*ptr).as_ref() }
+ unsafe { (*ptr).as_deref() }
}
/// # Safety
///
/// The caller ensures that writing to `me.page` is ok for the duration of 'a.
- unsafe fn take_page(me: *mut PageInfo) -> Option<Page> {
+ unsafe fn take_page(me: *mut PageInfo) -> Option<Owned<Page>> {
// SAFETY: This pointer offset is in bounds.
let ptr = unsafe { &raw mut (*me).page };
diff --git a/rust/kernel/alloc/allocator.rs b/rust/kernel/alloc/allocator.rs
index cd4203f27aed0..c7b9b069cf75d 100644
--- a/rust/kernel/alloc/allocator.rs
+++ b/rust/kernel/alloc/allocator.rs
@@ -169,7 +169,7 @@ unsafe fn realloc(
}
impl Vmalloc {
- /// Convert a pointer to a [`Vmalloc`] allocation to a [`page::BorrowedPage`].
+ /// Convert a pointer to a [`Vmalloc`] allocation to a [`Page`](page::Page) reference.
///
/// # Examples
///
@@ -202,20 +202,17 @@ impl Vmalloc {
///
/// - `ptr` must be a valid pointer to a [`Vmalloc`] allocation.
/// - `ptr` must remain valid for the entire duration of `'a`.
- pub unsafe fn to_page<'a>(ptr: NonNull<u8>) -> page::BorrowedPage<'a> {
+ pub unsafe fn to_page<'a>(ptr: NonNull<u8>) -> &'a page::Page {
// SAFETY: `ptr` is a valid pointer to `Vmalloc` memory.
let page = unsafe { bindings::vmalloc_to_page(ptr.as_ptr().cast()) };
- // SAFETY: `vmalloc_to_page` returns a valid pointer to a `struct page` for a valid pointer
- // to `Vmalloc` memory.
- let page = unsafe { NonNull::new_unchecked(page) };
-
// SAFETY:
- // - `page` is a valid pointer to a `struct page`, given that by the safety requirements of
- // this function `ptr` is a valid pointer to a `Vmalloc` allocation.
- // - By the safety requirements of this function `ptr` is valid for the entire lifetime of
- // `'a`.
- unsafe { page::BorrowedPage::from_raw(page) }
+ // - `vmalloc_to_page` returns a valid, non-null pointer to a `struct page` for a valid
+ // pointer to `Vmalloc` memory, given that by the safety requirements of this function
+ // `ptr` is a valid pointer to a `Vmalloc` allocation.
+ // - By the safety requirements of this function `ptr`, and hence the `struct page`, is
+ // valid for the entire lifetime of `'a`.
+ unsafe { &*page.cast() }
}
}
diff --git a/rust/kernel/alloc/allocator/iter.rs b/rust/kernel/alloc/allocator/iter.rs
index 02fda3ea5cae6..8dcc16ed89893 100644
--- a/rust/kernel/alloc/allocator/iter.rs
+++ b/rust/kernel/alloc/allocator/iter.rs
@@ -9,7 +9,7 @@
ptr::NonNull, //
};
-/// An [`Iterator`] of [`page::BorrowedPage`] items owned by a [`Vmalloc`] allocation.
+/// An [`Iterator`] of [`Page`](page::Page) references owned by a [`Vmalloc`] allocation.
///
/// # Guarantees
///
@@ -28,11 +28,11 @@ pub struct VmallocPageIter<'a> {
size: usize,
/// The current page index of the [`Iterator`].
index: usize,
- _p: PhantomData<page::BorrowedPage<'a>>,
+ _p: PhantomData<&'a page::Page>,
}
impl<'a> Iterator for VmallocPageIter<'a> {
- type Item = page::BorrowedPage<'a>;
+ type Item = &'a page::Page;
fn next(&mut self) -> Option<Self::Item> {
let offset = self.index.checked_mul(page::PAGE_SIZE)?;
diff --git a/rust/kernel/page.rs b/rust/kernel/page.rs
index 8affd8262891b..6dc1c2395acaf 100644
--- a/rust/kernel/page.rs
+++ b/rust/kernel/page.rs
@@ -12,16 +12,16 @@
code::*,
Result, //
},
+ types::{
+ Opaque,
+ Ownable,
+ Owned, //
+ },
uaccess::UserSliceReader, //
};
-use core::{
- marker::PhantomData,
- mem::ManuallyDrop,
- ops::Deref,
- ptr::{
- self,
- NonNull, //
- }, //
+use core::ptr::{
+ self,
+ NonNull, //
};
/// A bitwise shift for the page size.
@@ -65,93 +65,29 @@ pub const fn page_align(addr: usize) -> Option<usize> {
Some(sum & PAGE_MASK)
}
-/// Representation of a non-owning reference to a [`Page`].
-///
-/// This type provides a borrowed version of a [`Page`] that is owned by some other entity, e.g. a
-/// [`Vmalloc`] allocation such as [`VBox`].
-///
-/// # Example
-///
-/// ```
-/// # use kernel::{bindings, prelude::*};
-/// use kernel::page::{BorrowedPage, Page, PAGE_SIZE};
-/// # use core::{mem::MaybeUninit, ptr, ptr::NonNull };
-///
-/// fn borrow_page<'a>(vbox: &'a mut VBox<MaybeUninit<[u8; PAGE_SIZE]>>) -> BorrowedPage<'a> {
-/// let ptr = ptr::from_ref(&**vbox);
-///
-/// // SAFETY: `ptr` is a valid pointer to `Vmalloc` memory.
-/// let page = unsafe { bindings::vmalloc_to_page(ptr.cast()) };
-///
-/// // SAFETY: `vmalloc_to_page` returns a valid pointer to a `struct page` for a valid
-/// // pointer to `Vmalloc` memory.
-/// let page = unsafe { NonNull::new_unchecked(page) };
-///
-/// // SAFETY:
-/// // - `self.0` is a valid pointer to a `struct page`.
-/// // - `self.0` is valid for the entire lifetime of `self`.
-/// unsafe { BorrowedPage::from_raw(page) }
-/// }
-///
-/// let mut vbox = VBox::<[u8; PAGE_SIZE]>::new_uninit(GFP_KERNEL)?;
-/// let page = borrow_page(&mut vbox);
-///
-/// // SAFETY: There is no concurrent read or write to this page.
-/// unsafe { page.fill_zero_raw(0, PAGE_SIZE)? };
-/// # Ok::<(), Error>(())
-/// ```
-///
-/// # Invariants
-///
-/// The borrowed underlying pointer to a `struct page` is valid for the entire lifetime `'a`.
-///
-/// [`VBox`]: kernel::alloc::VBox
-/// [`Vmalloc`]: kernel::alloc::allocator::Vmalloc
-pub struct BorrowedPage<'a>(ManuallyDrop<Page>, PhantomData<&'a Page>);
-
-impl<'a> BorrowedPage<'a> {
- /// Constructs a [`BorrowedPage`] from a raw pointer to a `struct page`.
- ///
- /// # Safety
- ///
- /// - `ptr` must point to a valid `bindings::page`.
- /// - `ptr` must remain valid for the entire lifetime `'a`.
- pub unsafe fn from_raw(ptr: NonNull<bindings::page>) -> Self {
- let page = Page { page: ptr };
-
- // INVARIANT: The safety requirements guarantee that `ptr` is valid for the entire lifetime
- // `'a`.
- Self(ManuallyDrop::new(page), PhantomData)
- }
-}
-
-impl<'a> Deref for BorrowedPage<'a> {
- type Target = Page;
-
- fn deref(&self) -> &Self::Target {
- &self.0
- }
-}
-
-/// Trait to be implemented by types which provide an [`Iterator`] implementation of
-/// [`BorrowedPage`] items, such as [`VmallocPageIter`](kernel::alloc::allocator::VmallocPageIter).
+/// Trait to be implemented by types which provide an [`Iterator`] of [`Page`] references, such as
+/// [`VmallocPageIter`](kernel::alloc::allocator::VmallocPageIter).
pub trait AsPageIter {
/// The [`Iterator`] type, e.g. [`VmallocPageIter`](kernel::alloc::allocator::VmallocPageIter).
- type Iter<'a>: Iterator<Item = BorrowedPage<'a>>
+ type Iter<'a>: Iterator<Item = &'a Page>
where
Self: 'a;
- /// Returns an [`Iterator`] of [`BorrowedPage`] items over all pages owned by `self`.
+ /// Returns an [`Iterator`] of [`Page`] references over all pages owned by `self`.
fn page_iter(&mut self) -> Self::Iter<'_>;
}
-/// A pointer to a page that owns the page allocation.
+/// A `struct page`.
+///
+/// A `Page` is accessed through a shared reference or through an owning [`Owned<Page>`]; the latter
+/// frees the page allocation when it is dropped.
///
/// # Invariants
///
-/// The pointer is valid, and has ownership over the page.
+/// The `Page` is backed by a valid `struct page`.
+#[repr(transparent)]
pub struct Page {
- page: NonNull<bindings::page>,
+ page: Opaque<bindings::page>,
}
// SAFETY: Pages have no logic that relies on them staying on a given thread, so moving them across
@@ -185,19 +121,20 @@ impl Page {
/// # Ok::<(), kernel::alloc::AllocError>(())
/// ```
#[inline]
- pub fn alloc_page(flags: Flags) -> Result<Self, AllocError> {
+ pub fn alloc_page(flags: Flags) -> Result<Owned<Self>, AllocError> {
// SAFETY: Depending on the value of `gfp_flags`, this call may sleep. Other than that, it
// is always safe to call this method.
let page = unsafe { bindings::alloc_pages(flags.as_raw(), 0) };
let page = NonNull::new(page).ok_or(AllocError)?;
- // INVARIANT: We just successfully allocated a page, so we now have ownership of the newly
- // allocated page. We transfer that ownership to the new `Page` object.
- Ok(Self { page })
+ // SAFETY: We just successfully allocated a page, so we now have ownership of the newly
+ // allocated page. We transfer that ownership to the new `Owned<Page>` object.
+ // Since `Page` is transparent, we can cast the pointer directly.
+ Ok(unsafe { Owned::from_raw(page.cast()) })
}
/// Returns a raw pointer to the page.
pub fn as_ptr(&self) -> *mut bindings::page {
- self.page.as_ptr()
+ Opaque::cast_into(&self.page)
}
/// Get the node id containing this page.
@@ -372,10 +309,11 @@ pub unsafe fn copy_from_user_slice_raw(
}
}
-impl Drop for Page {
+impl Ownable for Page {
#[inline]
- fn drop(&mut self) {
- // SAFETY: By the type invariants, we have ownership of the page and can free it.
- unsafe { bindings::__free_pages(self.page.as_ptr(), 0) };
+ unsafe fn release(this: NonNull<Self>) {
+ // SAFETY: By the function safety requirements, we have ownership of the page and can free
+ // it. Since Page is transparent, we can cast the raw pointer directly.
+ unsafe { bindings::__free_pages(this.as_ptr().cast(), 0) };
}
}
--
2.51.2
^ permalink raw reply related
* [PATCH v18 7/8] rust: Add `OwnableRefCounted`
From: Andreas Hindborg @ 2026-06-25 10:15 UTC (permalink / raw)
To: Danilo Krummrich, Lorenzo Stoakes, Vlastimil Babka,
Liam R. Howlett, Uladzislau Rezki, Miguel Ojeda, Boqun Feng,
Gary Guo, Björn Roy Baron, Benno Lossin, Alice Ryhl,
Trevor Gross, Daniel Almeida, Tamir Duberstein, Alexandre Courbot,
Onur Özkan, Lyude Paul, Greg Kroah-Hartman,
Arve Hjønnevåg, Todd Kjos, Christian Brauner,
Carlos Llamas, Rafael J. Wysocki, Dave Ertman, Ira Weiny,
Leon Romanovsky, Paul Moore, Serge Hallyn, David Airlie,
Simona Vetter, Alexander Viro, Jan Kara, Igor Korotin,
Viresh Kumar, Nishanth Menon, Stephen Boyd, Bjorn Helgaas,
Krzysztof Wilczyński, Pavel Tikhomirov, Michal Wilczynski
Cc: Andreas Hindborg, Philipp Stanner, rust-for-linux, linux-kernel,
linux-mm, driver-core, linux-block, linux-security-module,
dri-devel, linux-fsdevel, linux-pm, linux-pci, linux-pwm,
Oliver Mangold
In-Reply-To: <20260625-unique-ref-v18-0-4e06b5896d47@kernel.org>
From: Oliver Mangold <oliver.mangold@pm.me>
Types implementing one of these traits can safely convert between an
`ARef<T>` and an `Owned<T>`.
This is useful for types which generally are accessed through an `ARef`
but have methods which can only safely be called when the reference is
unique, like e.g. `block::mq::Request::end_ok()`.
Signed-off-by: Oliver Mangold <oliver.mangold@pm.me>
[ Andreas: Fix formatting, update documentation, fix error handling in
examples. ]
Co-developed-by: Andreas Hindborg <a.hindborg@kernel.org>
Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org>
---
rust/kernel/owned.rs | 140 +++++++++++++++++++++++++++++++++++++++++++++--
rust/kernel/sync/aref.rs | 16 +++++-
rust/kernel/types.rs | 1 +
3 files changed, 151 insertions(+), 6 deletions(-)
diff --git a/rust/kernel/owned.rs b/rust/kernel/owned.rs
index e79936c00002c..bb4223c0f725a 100644
--- a/rust/kernel/owned.rs
+++ b/rust/kernel/owned.rs
@@ -14,20 +14,26 @@
pin::Pin,
ptr::NonNull, //
};
+use kernel::{
+ sync::aref::ARef,
+ types::RefCounted, //
+};
use kernel::types::ForeignOwnable;
/// Types that specify their own way of performing allocation and destruction. Typically, this trait
/// is implemented on types from the C side.
///
-/// Implementing this trait allows types to be referenced via the [`Owned<Self>`] pointer type. This
-/// is useful when it is desirable to tie the lifetime of the reference to an owned object, rather
-/// than pass around a bare reference. [`Ownable`] types can define custom drop logic that is
-/// executed when the owned reference [`Owned<Self>`] pointing to the object is dropped.
+/// Implementing this trait allows types to be referenced via the [`Owned<Self>`] pointer type.
+/// - This is useful when it is desirable to tie the lifetime of an object reference to an owned
+/// object, rather than pass around a bare reference.
+/// - [`Ownable`] types can define custom drop logic that is executed when the owned reference
+/// of type [`Owned<_>`] pointing to the object is dropped.
///
/// Note: The underlying object is not required to provide internal reference counting, because it
/// represents a unique, owned reference. If reference counting (on the Rust side) is required,
-/// [`RefCounted`](crate::types::RefCounted) should be implemented.
+/// [`RefCounted`] should be implemented. [`OwnableRefCounted`] should be implemented if conversion
+/// between unique and shared (reference counted) ownership is needed.
///
/// # Examples
///
@@ -239,3 +245,127 @@ unsafe fn borrow_mut<'a>(ptr: *mut kernel::ffi::c_void) -> Self::BorrowedMut<'a>
unsafe { Pin::new_unchecked(inner) }
}
}
+
+/// A trait for objects that can be wrapped in either one of the reference types [`Owned`] and
+/// [`ARef`].
+///
+/// # Examples
+///
+/// A minimal example implementation of [`OwnableRefCounted`], [`Ownable`] and its usage with
+/// [`ARef`] and [`Owned`] looks like this:
+///
+/// ```
+/// # #![expect(clippy::disallowed_names)]
+/// # use core::cell::Cell;
+/// # use core::ptr::NonNull;
+/// # use kernel::alloc::{flags, kbox::KBox, AllocError};
+/// # use kernel::sync::aref::{ARef, RefCounted};
+/// # use kernel::types::{Owned, Ownable, OwnableRefCounted};
+///
+/// // An internally refcounted struct for demonstration purposes.
+/// //
+/// // # Invariants
+/// //
+/// // - `refcount` is always non-zero for a valid object.
+/// // - `refcount` is >1 if there is more than one Rust reference to it.
+/// //
+/// struct Foo {
+/// refcount: Cell<usize>,
+/// }
+///
+/// impl Foo {
+/// fn new() -> Result<Owned<Self>> {
+/// // We are just using a `KBox` here to handle the actual allocation, as our `Foo` is
+/// // not actually a C-allocated object.
+/// // INVARIANT: We initialize `refcount` to 1, satisfying the invariants.
+/// let result = KBox::new(
+/// Foo {
+/// refcount: Cell::new(1),
+/// },
+/// flags::GFP_KERNEL,
+/// )?;
+/// let result = KBox::into_non_null(result);
+/// // SAFETY:
+/// // - We just allocated the `Self`, thus it is valid and we own it.
+/// // - We can transfer this ownership to the `from_raw` method.
+/// Ok(unsafe { Owned::from_raw(result) })
+/// }
+/// }
+///
+/// // SAFETY: We increment and decrement each time the respective function is called and only free
+/// // the `Foo` when the refcount reaches zero.
+/// unsafe impl RefCounted for Foo {
+/// fn inc_ref(&self) {
+/// self.refcount.replace(self.refcount.get() + 1);
+/// }
+///
+/// unsafe fn dec_ref(this: NonNull<Self>) {
+/// // SAFETY: By requirement on calling this function, the refcount is non-zero,
+/// // implying the underlying object is valid.
+/// let refcount = unsafe { &this.as_ref().refcount };
+/// let new_refcount = refcount.get() - 1;
+/// if new_refcount == 0 {
+/// // The `Foo` will be dropped when `KBox` goes out of scope.
+/// // SAFETY: The [`KBox<Foo>`] is still alive as the old refcount is 1. We can pass
+/// // ownership to the [`KBox`] as by requirement on calling this function,
+/// // the `Self` will no longer be used by the caller.
+/// unsafe { KBox::from_raw(this.as_ptr()) };
+/// } else {
+/// refcount.replace(new_refcount);
+/// }
+/// }
+/// }
+///
+/// impl OwnableRefCounted for Foo {
+/// fn try_from_shared(this: ARef<Self>) -> Result<Owned<Self>, ARef<Self>> {
+/// if this.refcount.get() == 1 {
+/// // SAFETY: The `Foo` is still alive and has no other Rust references as the refcount
+/// // is 1.
+/// Ok(unsafe { Owned::from_raw(ARef::into_raw(this)) })
+/// } else {
+/// Err(this)
+/// }
+/// }
+/// }
+///
+/// impl Ownable for Foo {
+/// unsafe fn release(this: NonNull<Self>) {
+/// // SAFETY: Using `dec_ref()` from [`RefCounted`] to release is okay, as the refcount is
+/// // always 1 for an [`Owned<Foo>`].
+/// unsafe { Foo::dec_ref(this) };
+/// }
+/// }
+///
+/// let foo = Foo::new()?;
+/// let foo = ARef::from(foo);
+/// {
+/// let bar = foo.clone();
+/// assert!(Owned::try_from(bar).is_err());
+/// }
+/// assert!(Owned::try_from(foo).is_ok());
+/// # Ok::<(), Error>(())
+/// ```
+pub trait OwnableRefCounted: RefCounted + Ownable + Sized {
+ /// Checks if the [`ARef`] is unique and converts it to an [`Owned`] if that is the case.
+ /// Otherwise it returns again an [`ARef`] to the same underlying object.
+ fn try_from_shared(this: ARef<Self>) -> Result<Owned<Self>, ARef<Self>>;
+
+ /// Converts the [`Owned`] into an [`ARef`].
+ #[inline]
+ fn into_shared(this: Owned<Self>) -> ARef<Self> {
+ // SAFETY: `Owned::into_raw` returns a pointer to a valid `Self`, and the `Owned` owned the
+ // reference count that we now transfer to the new `ARef`.
+ unsafe { ARef::from_raw(Owned::into_raw(this)) }
+ }
+}
+
+impl<T: OwnableRefCounted> TryFrom<ARef<T>> for Owned<T> {
+ type Error = ARef<T>;
+ /// Tries to convert the [`ARef`] to an [`Owned`] by calling
+ /// [`try_from_shared()`](OwnableRefCounted::try_from_shared). In case the [`ARef`] is not
+ /// unique, it returns again an [`ARef`] to the same underlying object.
+ #[inline]
+ fn try_from(b: ARef<T>) -> Result<Owned<T>, Self::Error> {
+ T::try_from_shared(b)
+ }
+}
diff --git a/rust/kernel/sync/aref.rs b/rust/kernel/sync/aref.rs
index d0865aeb9371b..77eb390139079 100644
--- a/rust/kernel/sync/aref.rs
+++ b/rust/kernel/sync/aref.rs
@@ -23,6 +23,10 @@
ops::Deref,
ptr::NonNull, //
};
+use kernel::types::{
+ OwnableRefCounted,
+ Owned, //
+};
/// Types that are internally reference counted.
///
@@ -35,7 +39,10 @@
/// Note: Implementing this trait allows types to be wrapped in an [`ARef<Self>`]. It requires an
/// internal reference count and provides only shared references. If unique references are required
/// [`Ownable`](crate::types::Ownable) should be implemented which allows types to be wrapped in an
-/// [`Owned<Self>`](crate::types::Owned).
+/// [`Owned<Self>`](crate::types::Owned). Implementing the trait
+/// [`OwnableRefCounted`] allows to convert between unique and
+/// shared references (i.e. [`Owned<Self>`](crate::types::Owned) and
+/// [`ARef<Self>`](crate::types::Owned)).
///
/// # Safety
///
@@ -188,6 +195,13 @@ fn from(b: &T) -> Self {
}
}
+impl<T: OwnableRefCounted> From<Owned<T>> for ARef<T> {
+ #[inline]
+ fn from(b: Owned<T>) -> Self {
+ T::into_shared(b)
+ }
+}
+
impl<T: RefCounted> Drop for ARef<T> {
fn drop(&mut self) {
// SAFETY: The type invariants guarantee that the `ARef` owns the reference we're about to
diff --git a/rust/kernel/types.rs b/rust/kernel/types.rs
index 5ef763717e59a..6aa760952cb63 100644
--- a/rust/kernel/types.rs
+++ b/rust/kernel/types.rs
@@ -18,6 +18,7 @@
pub use crate::{
owned::{
Ownable,
+ OwnableRefCounted,
Owned, //
},
sync::aref::{
--
2.51.2
^ permalink raw reply related
* RE: [PATCH V2 1/8] PCI: imx6: Add skip_pwrctrl_off flag support
From: Sherry Sun @ 2026-06-25 7:25 UTC (permalink / raw)
To: Frank Li (OSS)
Cc: Sherry Sun (OSS), robh@kernel.org, krzk+dt@kernel.org,
conor+dt@kernel.org, Frank Li, s.hauer@pengutronix.de,
kernel@pengutronix.de, festevam@gmail.com, Amitkumar Karwar,
Neeraj Sanjay Kale, marcel@holtmann.org, luiz.dentz@gmail.com,
Hongxing Zhu, l.stach@pengutronix.de, lpieralisi@kernel.org,
kwilczynski@kernel.org, mani@kernel.org, bhelgaas@google.com,
brgl@kernel.org, imx@lists.linux.dev, linux-pci@vger.kernel.org,
linux-arm-kernel@lists.infradead.org, devicetree@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-bluetooth@vger.kernel.org,
linux-pm@vger.kernel.org
In-Reply-To: <ajwOvZUlOEQzmjsu@SMW015318>
> Subject: Re: [PATCH V2 1/8] PCI: imx6: Add skip_pwrctrl_off flag support
>
> On Wed, Jun 24, 2026 at 07:09:26AM +0000, Sherry Sun wrote:
> > > Subject: Re: [PATCH V2 1/8] PCI: imx6: Add skip_pwrctrl_off flag
> > > support
> > >
> > > On Tue, Jun 23, 2026 at 11:07:28AM +0800, Sherry Sun (OSS) wrote:
> > > > From: Sherry Sun <sherry.sun@nxp.com>
> > > >
> > > > Use dw_pcie_rp::skip_pwrctrl_off to avoid powering off devices
> > > > during suspend to preserve wakeup capability of the devices and
> > > > also not to power on the devices in the init path.
> > > > This allows controller power-off to be skipped when some devices(e.g.
> > > > M.2 cards key E without auxiliary power) required to support PCIe
> > > > L2 link state and wake-up mechanisms.
> > > >
> > > > Signed-off-by: Sherry Sun <sherry.sun@nxp.com>
> > > > ---
> > > > drivers/pci/controller/dwc/pci-imx6.c | 36
> > > > +++++++++++++++++----------
> > > > 1 file changed, 23 insertions(+), 13 deletions(-)
> > > >
> > > > diff --git a/drivers/pci/controller/dwc/pci-imx6.c
> > > > b/drivers/pci/controller/dwc/pci-imx6.c
> > > > index 0fa716d1ed75..ff5a9565dbbf 100644
> > > > --- a/drivers/pci/controller/dwc/pci-imx6.c
> > > > +++ b/drivers/pci/controller/dwc/pci-imx6.c
> > > > @@ -1382,16 +1382,20 @@ static int imx_pcie_host_init(struct
> > > > dw_pcie_rp
> > > *pp)
> > > > }
> > > > }
> > > >
> > > > - ret = pci_pwrctrl_create_devices(dev);
> > > > - if (ret) {
> > > > - dev_err(dev, "failed to create pwrctrl devices\n");
> > > > - goto err_reg_disable;
> > > > + if (!pci->suspended) {
> > > > + ret = pci_pwrctrl_create_devices(dev);
> > >
> > > Is possible move pci_pwrctrl_create_devices() of
> > > pci_pwrctrl_create_devices
> > >
> > > and call it direct at probe() function, like other regulator_get function.
> > >
> >
> > Hi Frank,
> > That makes sense. However, if we move pci_pwrctrl_create_devices () to
> > probe(), we may need to add the following goto err_pwrctrl_destroy
> > path in imx_pcie_probe() to properly handle errors from
> > pci_pwrctrl_power_on_devices(), is that acceptable?
>
> Can you add a API devm_pci_pwrctrl_create_devices() ?
>
Hi Frank, we cannot unconditionally destroy the pwrctrl devices
when probing fails by using devm API.
Since we need to check the return value of
pci_pwrctrl_power_on_devices() for example EPROBE_DEFER to decide
whether to destroy the pwrctrl devices to avoid the deferred probe loop.
You can find more related discussion here.
https://lore.kernel.org/all/tutxwjciedqoje5wxvtin4h637auni5zzpvb7rtfg4uticxoux@yfl6xg7oht7t/
Best Regards
Sherry
>
> >
> > @@ -1960,11 +1949,15 @@ static int imx_pcie_probe(struct
> platform_device *pdev)
> > if (ret)
> > return ret;
> >
> > + ret = pci_pwrctrl_create_devices(dev);
> > + if (ret)
> > + return dev_err_probe(dev, ret, "failed to create
> > + pwrctrl devices\n");
> > +
> > pci->use_parent_dt_ranges = true;
> > if (imx_pcie->drvdata->mode == DW_PCIE_EP_TYPE) {
> > ret = imx_add_pcie_ep(imx_pcie, pdev);
> > if (ret < 0)
> > - return ret;
> > + goto err_pwrctrl_destroy;
> >
> > /*
> > * FIXME: Only single Device (EPF) is supported due to
> > the @@ -1979,7 +1972,7 @@ static int imx_pcie_probe(struct
> platform_device *pdev)
> > pci->pp.use_atu_msg = true;
> > ret = dw_pcie_host_init(&pci->pp);
> > if (ret < 0)
> > - return ret;
> > + goto err_pwrctrl_destroy;
> >
> > if (pci_msi_enabled()) {
> > u8 offset = dw_pcie_find_capability(pci,
> > PCI_CAP_ID_MSI); @@ -1991,6 +1984,11 @@ static int
> imx_pcie_probe(struct platform_device *pdev)
> > }
> >
> > return 0;
> > +
> > +err_pwrctrl_destroy:
> > + if (ret != -EPROBE_DEFER)
> > + pci_pwrctrl_destroy_devices(dev);
> > + return ret;
> > }
> >
> > Best Regards
> > Sherry
> >
> > >
> > > > + if (ret) {
> > > > + dev_err(dev, "failed to create pwrctrl devices\n");
> > > > + goto err_reg_disable;
> > > > + }
> > > > }
> > > >
> > > > - ret = pci_pwrctrl_power_on_devices(dev);
> > > > - if (ret) {
> > > > - dev_err(dev, "failed to power on pwrctrl devices\n");
> > > > - goto err_pwrctrl_destroy;
> > > > + if (!pp->skip_pwrctrl_off) {
> > > > + ret = pci_pwrctrl_power_on_devices(dev);
> > > > + if (ret) {
> > > > + dev_err(dev, "failed to power on pwrctrl devices\n");
> > > > + goto err_pwrctrl_destroy;
> > > > + }
> > > > }
> > > >
> > > > ret = imx_pcie_clk_enable(imx_pcie); @@ -1460,9 +1464,10 @@
> > > static
> > > > int imx_pcie_host_init(struct dw_pcie_rp *pp)
> > > > err_clk_disable:
> > > > imx_pcie_clk_disable(imx_pcie);
> > > > err_pwrctrl_power_off:
> > > > - pci_pwrctrl_power_off_devices(dev);
> > > > + if (!pp->skip_pwrctrl_off)
> > > > + pci_pwrctrl_power_off_devices(dev);
> > > > err_pwrctrl_destroy:
> > > > - if (ret != -EPROBE_DEFER)
> > > > + if (ret != -EPROBE_DEFER && !pci->suspended)
> > > > pci_pwrctrl_destroy_devices(dev);
> > > > err_reg_disable:
> > > > if (imx_pcie->vpcie)
> > > > @@ -1482,7 +1487,8 @@ static void imx_pcie_host_exit(struct
> > > > dw_pcie_rp
> > > *pp)
> > > > }
> > > > imx_pcie_clk_disable(imx_pcie);
> > > >
> > > > - pci_pwrctrl_power_off_devices(pci->dev);
> > > > + if (!pci->pp.skip_pwrctrl_off)
> > > > + pci_pwrctrl_power_off_devices(pci->dev);
> > > > if (imx_pcie->vpcie)
> > > > regulator_disable(imx_pcie->vpcie);
> > > > }
> > > > @@ -1990,12 +1996,16 @@ static int imx_pcie_probe(struct
> > > > platform_device *pdev) static void imx_pcie_shutdown(struct
> > > > platform_device *pdev) {
> > > > struct imx_pcie *imx_pcie = platform_get_drvdata(pdev);
> > > > + struct dw_pcie *pci = imx_pcie->pci;
> > > > + struct dw_pcie_rp *pp = &pci->pp;
> > > >
> > > > /* bring down link, so bootloader gets clean state in case of reboot */
> > > > imx_pcie_assert_core_reset(imx_pcie);
> > > > imx_pcie_assert_perst(imx_pcie, true);
> > > > - pci_pwrctrl_power_off_devices(&pdev->dev);
> > > > - pci_pwrctrl_destroy_devices(&pdev->dev);
> > > > + if (!pp->skip_pwrctrl_off)
> > > > + pci_pwrctrl_power_off_devices(&pdev->dev);
> > > > + if (!pci->suspended)
> > > > + pci_pwrctrl_destroy_devices(&pdev->dev);
> > > > }
> > > >
> > > > static const struct imx_pcie_drvdata drvdata[] = {
> > > > --
> > > > 2.50.1
> > > >
> > > >
^ permalink raw reply
* Re: [PATCH v5 8/8] PCI: vmd: Add workaround for bus number hardwired to fixed non-zero value
From: sashiko-bot @ 2026-06-25 6:30 UTC (permalink / raw)
To: Szymon Durawa; +Cc: linux-pci
In-Reply-To: <20260625082940.1290740-9-szymon.durawa@linux.intel.com>
Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
New issues:
- [High] Modifying `bus->number` dynamically after bus registration breaks PCI device naming conventions and sysfs topology consistency.
Pre-existing issues:
- [High] Missing cleanup of VMD_BUS_0 when VMD_BUS_1 creation fails, leading to a PCI root bus leak and a potential Use-After-Free.
--
commit daf784b1aba7feceb67edffc6d0d8a59f8bc2fb9
Author: Szymon Durawa <szymon.durawa@linux.intel.com>
PCI: vmd: Add workaround for bus number hardwired to fixed non-zero value
The VMD BUS1 root bus number is fixed in hardware to 0x80. This patch adds
a workaround to bypass bus number reconfiguration by updating the BUS1
number to be the same as the BUS1 primary number.
> diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
> index c68ea7837620d..d0527be78ebd5 100644
> --- a/drivers/pci/controller/vmd.c
> +++ b/drivers/pci/controller/vmd.c
> @@ -440,8 +440,23 @@ static void vmd_remove_irq_domain(struct vmd_dev *vmd)
> static void __iomem *vmd_cfg_addr(struct vmd_dev *vmd, struct pci_bus *bus,
> unsigned int devfn, int reg, int len)
> {
> - unsigned int busnr_ecam = bus->number - vmd->busn_start[VMD_BUS_0];
> - u32 offset = PCIE_ECAM_OFFSET(busnr_ecam, devfn, reg);
> + unsigned char bus_number;
> + unsigned int busnr_ecam;
> + u32 offset;
> +
> + /*
> + * VMD workaround: for VMD BUS1, bus->number is set to VMD_PRIMARY_BUS1
> + * (see comment in vmd_enable_domain() under vmd_create_bus() for BUS1)
> + * but original value is 0xE1 which is stored in
> + * vmd->busn_start[VMD_BUS_1].
> + */
> + if (vmd->bus1_rootbus && bus->number == VMD_PRIMARY_BUS1)
> + bus_number = vmd->busn_start[VMD_BUS_1];
[Severity: High]
Does this code undo the bus number mutation performed in vmd_enable_domain()
for config space accesses?
While this ensures config space accesses still work, it highlights the
mismatch that leaves the sysfs topology structurally inconsistent for
userspace.
> + else
> + bus_number = bus->number;
> +
> + busnr_ecam = bus_number - vmd->busn_start[VMD_BUS_0];
> + offset = PCIE_ECAM_OFFSET(busnr_ecam, devfn, reg);
>
> if (offset + len >= resource_size(&vmd->dev->resource[VMD_CFGBAR]))
> return NULL;
[ ... ]
> @@ -1177,6 +1192,19 @@ static int vmd_enable_domain(struct vmd_dev *vmd, unsigned long features)
> */
> vmd->bus[VMD_BUS_1]->primary = VMD_PRIMARY_BUS1;
[Severity: High]
This is a pre-existing issue, but does the error handling just above this
code leak VMD_BUS_0?
If vmd_create_bus() for VMD_BUS_1 fails:
vmd_enable_domain() {
...
ret = vmd_create_bus(vmd, VMD_BUS_0, sd, offset);
if (ret)
return ret; // Succeeds
...
if (vmd->bus1_rootbus) {
ret = vmd_create_bus(vmd, VMD_BUS_1, sd, offset);
if (ret) {
pci_err(vmd->dev, "Can't create BUS1: %d\n", ret);
return ret; // Leaks VMD_BUS_0
}
...
}
The error path returns directly without calling pci_stop_root_bus() and
pci_remove_root_bus() for VMD_BUS_0.
Could this lead to a use-after-free if the PCI core retains a reference to
the leaked VMD_BUS_0 after the vmd device allocation is freed?
> + /*
> + * This is a workaround for pci_scan_bridge_extend() code.
> + * It marks bridge invalid configuration when detecting a
> + * non-zero (0x80) the VMD BUS1 root bus number. Thus Primary Bus Number
> + * of Root Ports on BUS1 is deconfigured in the first pass of
> + * pci_scan_bridge() to be re-assigned to 0x0 in the second pass.
> + * As a result no subordinate bus number behind VMD BUS1 is found.
> + * Workaround: update vmd->bus[VMD_BUS_1]->number and
> + * vmd->bus[VMD_BUS_1]->primary to the same value, which
> + * bypasses bus number reconfiguration.
> + */
> + vmd->bus[VMD_BUS_1]->number = VMD_PRIMARY_BUS1;
[Severity: High]
Does modifying the bus number dynamically after bus registration break PCI
device naming conventions and sysfs topology consistency?
Because the bus number is forcibly mutated after the VMD_BUS_1 root bus
device has already been registered with the PCI subsystem, child devices
downstream will be enumerated and registered using names derived from
0x80 (e.g., 0000:80:1c.0).
Meanwhile, their parent bus directory in sysfs remains named using the
original bus number (e.g., 0000:e1), which can break userspace tools
expecting standard PCI naming.
> +
> WARN(sysfs_create_link(&vmd->dev->dev.kobj,
> &vmd->bus[VMD_BUS_1]->dev.kobj,
> "domain1"),
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260625082940.1290740-1-szymon.durawa@linux.intel.com?part=8
^ permalink raw reply
* Re: [PATCH v5 7/8] PCI: vmd: Add support for second rootbus under VMD
From: sashiko-bot @ 2026-06-25 6:20 UTC (permalink / raw)
To: Szymon Durawa; +Cc: linux-pci
In-Reply-To: <20260625082940.1290740-8-szymon.durawa@linux.intel.com>
Thank you for your contribution! Sashiko AI review found 5 potential issue(s) to consider:
- [Critical] Flawed arithmetic when configuring MEMBAR offsets guarantees overlapping or gapped PCI resource windows between BUS0 and BUS1.
- [High] Failure to clean up VMD_BUS_0 on VMD_BUS_1 creation failure leads to a severe Use-After-Free.
- [High] Iterating over an unvalidated sum of disjoint bus ranges in `vmd_domain_reset` causes missed bus resets and potential out-of-bounds MMIO accesses.
- [Medium] VMD_RES_BUS1_CFGBAR is assigned an IORESOURCE_MEM resource as its parent, corrupting the resource tree.
- [Medium] An uninitialized stack variable is assigned to resource bounds if PCI config space reads fail.
--
commit 46d85add5d9d3ed310b9c5a3c11026a4231b01a9
Author: Szymon Durawa <szymon.durawa@linux.intel.com>
Subject: PCI: vmd: Add support for second rootbus under VMD
Starting from Intel Arrow Lake VMD enhancement introduces second root bus
support with fixed root bus number (0x80). It means that all 3 MMIO BARs
exposed by VMD are shared now between both buses (current BUS0 and
new BUS1).
Add new BUS1 enumeration and divide MMIO space to be shared between
both root buses.
> diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
> index 2bfc16875d912..c68ea7837620d 100644
> --- a/drivers/pci/controller/vmd.c
> +++ b/drivers/pci/controller/vmd.c
[ ... ]
> @@ -537,11 +568,23 @@ static inline void vmd_acpi_end(void) { }
>
> static void vmd_domain_reset(struct vmd_dev *vmd)
> {
> - u16 bus, max_buses = resource_size(&vmd->resources[VMD_RES_CFGBAR]);
> + u16 bus, bus_cnt = resource_size(&vmd->resources[VMD_RES_CFGBAR]);
> u8 dev, functions, fn, hdr_type;
> char __iomem *base;
>
> - for (bus = 0; bus < max_buses; bus++) {
> + /*
> + * Total bus-number count (bus_cnt) when BUS1 root bus is present:
> + * initialized with BUS0 CFGBAR range, extended by BUS1 CFGBAR range plus
> + * one bus number per root bus' own config space (bus_cnt is increased by 2)
> + * as the CFGBAR ranges do not include the root bus config space.
> + */
> +
> + if (vmd->bus1_rootbus) {
> + bus_cnt += resource_size(&vmd->resources[VMD_RES_BUS1_CFGBAR]);
> + bus_cnt += 2;
> + }
> +
> + for (bus = 0; bus < bus_cnt; bus++) {
> for (dev = 0; dev < 32; dev++) {
> base = vmd->cfgbar + PCIE_ECAM_OFFSET(bus,
> PCI_DEVFN(dev, 0), 0);
[Severity: High]
The bus_cnt variable combines the sizes of both the BUS0 and BUS1 CFGBAR
ranges. If the ranges for BUS0 and BUS1 are sparse, does this continuous
loop up to the sum of their sizes end up missing the higher bus numbers?
Could this also access out-of-bounds MMIO offsets if the sum exceeds the
mapped region for ECAM?
[ ... ]
> @@ -799,6 +867,30 @@ static void vmd_configure_cfgbar(struct vmd_dev *vmd)
> (resource_size(res) >> 20) - 1,
> .flags = IORESOURCE_BUS | IORESOURCE_PCI_FIXED,
> };
> +
> + if (vmd->bus1_rootbus) {
> + u16 bus0_range = 0;
> + u16 bus1_range = 0;
> +
> + pci_read_config_word(vmd->dev, VMD_BUSRANGE0, &bus0_range);
> + pci_read_config_word(vmd->dev, VMD_BUSRANGE1, &bus1_range);
> +
> + /*
> + * Resize BUS0 CFGBAR range to make space for BUS1
> + * owned devices by adjusting range end with value stored in
> + * VMD_BUSRANGE0 register.
> + */
> + vmd->resources[VMD_RES_CFGBAR].start = VMD_BUS_START(bus0_range);
> + vmd->resources[VMD_RES_CFGBAR].end = VMD_BUS_END(bus0_range);
> +
> + vmd->resources[VMD_RES_BUS1_CFGBAR] = (struct resource){
> + .name = "VMD CFGBAR BUS1",
> + .start = VMD_BUS_START(bus1_range),
> + .end = VMD_BUS_END(bus1_range),
> + .flags = IORESOURCE_BUS | IORESOURCE_PCI_FIXED,
> + .parent = res,
> + };
> + }
> }
[Severity: Medium]
The res pointer here refers to VMD_CFGBAR, which is an IORESOURCE_MEM
resource. Is it safe to assign a memory resource as the parent to
VMD_RES_BUS1_CFGBAR, which is an IORESOURCE_BUS resource?
Will this cross-namespace parenting confuse the generic PCI resource
management?
[ ... ]
> @@ -840,41 +932,83 @@ static void vmd_configure_membar(struct vmd_dev *vmd,
> static void vmd_configure_membar1_membar2(struct vmd_dev *vmd,
> resource_size_t mbar2_ofs)
> {
> - vmd_configure_membar(vmd, 1, VMD_MEMBAR1, 0, 0);
> - vmd_configure_membar(vmd, 2, VMD_MEMBAR2, mbar2_ofs, 0);
> + if (vmd->bus1_rootbus) {
> + u32 bus1_mbar1_ofs = 0;
> + u64 bus1_mbar2_ofs = 0;
> + u32 reg;
> +
> + pci_read_config_dword(vmd->dev, VMD_MEMBAR1_OFFSET,
> + &bus1_mbar1_ofs);
> +
> + pci_read_config_dword(vmd->dev, VMD_MEMBAR2_OFFSET1, ®);
> + bus1_mbar2_ofs = reg;
[Severity: Medium]
If pci_read_config_dword() fails due to a bus error or unsupported
operation, it leaves reg untouched. Could this result in uninitialized
stack memory being used to configure the physical MMIO allocation for
BUS1?
> +
> + pci_read_config_dword(vmd->dev, VMD_MEMBAR2_OFFSET2, ®);
> + bus1_mbar2_ofs |= (u64)reg << 32;
> +
> + /*
> + * Resize BUS MEMBAR1 and MEMBAR2 ranges to make space
> + * for BUS1 owned devices by adjusting range end with values
> + * stored in VMD_MEMBAR1_OFFSET and VMD_MEMBAR2_OFFSET registers
> + */
> + vmd_configure_membar(vmd, VMD_RES_MBAR_1, VMD_MEMBAR1, 0,
> + bus1_mbar1_ofs);
> + vmd_configure_membar(vmd, VMD_RES_MBAR_2, VMD_MEMBAR2,
> + mbar2_ofs, bus1_mbar2_ofs - mbar2_ofs);
> +
> + vmd_configure_membar(vmd, VMD_RES_BUS1_MBAR_1, VMD_MEMBAR1,
> + bus1_mbar1_ofs, 0);
> + vmd_configure_membar(vmd, VMD_RES_BUS1_MBAR_2, VMD_MEMBAR2,
> + mbar2_ofs + bus1_mbar2_ofs, 0);
[Severity: Critical]
The vmd_configure_membar() function subtracts the last argument from the
end of the resource. By passing bus1_mbar1_ofs as both the start offset
for BUS1 and the amount to subtract from the end of BUS0, won't the
resources overlap?
For example, if the offset is smaller than half the total BAR size, does
this cause identical physical MMIO ranges to be assigned to distinct
devices on BUS0 and BUS1?
[ ... ]
> @@ -1028,6 +1164,27 @@ static int vmd_enable_domain(struct vmd_dev *vmd, unsigned long features)
>
> vmd_bus_enumeration(vmd->bus[VMD_BUS_0], features);
>
> + if (vmd->bus1_rootbus) {
> + ret = vmd_create_bus(vmd, VMD_BUS_1, sd, offset);
> + if (ret) {
> + pci_err(vmd->dev, "Can't create BUS1: %d\n", ret);
> + return ret;
> + }
[Severity: High]
If vmd_create_bus() fails for VMD_BUS_1, this path returns early. However,
VMD_BUS_0 has already been fully created and enumerated earlier in this
function.
Since vmd_remove() is not called on the probe error path, does this leave
VMD_BUS_0 active while the struct vmd_dev context is subsequently freed by
devres, resulting in a use-after-free?
> +
> + /*
> + * Primary bus is not set by pci_create_root_bus(), it is
> + * updated here
> + */
> + vmd->bus[VMD_BUS_1]->primary = VMD_PRIMARY_BUS1;
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260625082940.1290740-1-szymon.durawa@linux.intel.com?part=7
^ permalink raw reply
* Re: [PATCH v5 5/8] PCI: vmd: Replace hardcoded values with enum and defines
From: sashiko-bot @ 2026-06-25 6:20 UTC (permalink / raw)
To: Szymon Durawa; +Cc: linux-pci
In-Reply-To: <20260625082940.1290740-6-szymon.durawa@linux.intel.com>
Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
New issues:
- [Medium] Leak of dynamically allocated resource names on the vmd_probe error path.
Pre-existing issues:
- [High] Missing pci_free_irq_vectors() calls in error paths and during device removal, leading to vector leakage.
- [High] vmd_cleanup_srcu() destroys SRCU structs before devm-managed IRQ handlers are unregistered.
--
commit 415d4461cb0ec06aaac750c476d8975aa2ffae22
Author: Szymon Durawa <szymon.durawa@linux.intel.com>
PCI: vmd: Replace hardcoded values with enum and defines
Add enum vmd_resource type to replace hardcoded values. Add defines for
vmd bus start number based on VMD restriction value. No functional
changes.
> diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
> index c0a953df5f238..9ff82a24e859c 100644
> --- a/drivers/pci/controller/vmd.c
> +++ b/drivers/pci/controller/vmd.c
[ ... ]
> @@ -794,7 +806,8 @@ static void vmd_configure_cfgbar(struct vmd_dev *vmd)
> *
> * Function fills resource buffer inside the VMD structure.
> */
> -static void vmd_configure_membar(struct vmd_dev *vmd, u8 resource_number,
> +static void vmd_configure_membar(struct vmd_dev *vmd,
> + enum vmd_resource resource_number,
> u8 membar_number, resource_size_t start_offset,
> resource_size_t end_offset)
> {
[Severity: Medium]
Does this memory allocation lack proper cleanup in error paths?
In vmd_configure_membar(), resource names are allocated dynamically:
.name = kasprintf(GFP_KERNEL, "VMD MEMBAR%d %s", membar_number / 2, ...),
If vmd_enable_domain() fails after these names are successfully allocated,
the error propagates to vmd_probe(). The error handling block at the
out_release_instance label returns without explicitly freeing these allocated
strings, which leaks memory.
[ ... ]
> @@ -1104,8 +1119,8 @@ static void vmd_remove(struct pci_dev *dev)
> pci_remove_root_bus(vmd->bus);
>
> /* CFGBAR is static, does not require releasing memory */
> - kfree(vmd->resources[1].name);
> - kfree(vmd->resources[2].name);
> + kfree(vmd->resources[VMD_RES_MBAR_1].name);
> + kfree(vmd->resources[VMD_RES_MBAR_2].name);
[Severity: High]
This isn't a bug introduced by this patch, but does this function omit the
required call to pci_free_irq_vectors()?
The PCI subsystem guidelines require that after a successful call to
pci_alloc_irq_vectors(), every error path and device removal must explicitly
call pci_free_irq_vectors() to prevent leaking IRQ vectors.
If init_srcu_struct() or devm_request_irq() fails in vmd_alloc_irqs(), or if
any operation fails in vmd_enable_domain() after allocation, the function
returns without cleanup.
Similarly, this vmd_remove() function completes tear-down without freeing
the vectors. Exhausting system IRQ capacity could prevent future allocations.
>
> vmd_cleanup_srcu(vmd);
[Severity: High]
This isn't a bug introduced by this patch, but does cleaning up SRCU here
create a race condition with the devm-managed IRQ handler?
During device removal, vmd_remove() calls vmd_cleanup_srcu() to destroy
the SRCU structs. However, the driver's IRQ handlers were registered using
devm_request_irq(), which means they aren't deregistered until the devres
cleanup phase executes after vmd_remove() returns.
If a spurious MSI interrupt fires during this window, the handler will run
and invoke srcu_read_lock() on the already destroyed srcu_struct:
drivers/pci/controller/vmd.c:vmd_irq() {
...
idx = srcu_read_lock(&irqs->srcu);
...
}
Could this use-after-free on per-CPU memory lead to memory corruption?
> vmd_detach_resources(vmd);
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260625082940.1290740-1-szymon.durawa@linux.intel.com?part=5
^ permalink raw reply
* Re: [PATCH v5 3/8] PCI: vmd: Add vmd_configure_membar() and vmd_configure_membar1_membar2()
From: sashiko-bot @ 2026-06-25 6:19 UTC (permalink / raw)
To: Szymon Durawa; +Cc: linux-pci
In-Reply-To: <20260625082940.1290740-4-szymon.durawa@linux.intel.com>
Thank you for your contribution! Sashiko AI review found 5 potential issue(s) to consider:
New issues:
- [Critical] Dynamically allocated resource names are freed before the resources are detached from the resource tree, causing a Use-After-Free window.
- [Medium] Dynamically allocated memory for VMD MEMBAR resource names is leaked on error paths.
- [Medium] The return value of kasprintf() is not checked, potentially leaving resource names uninitialized as NULL.
Pre-existing issues:
- [High] Error paths and device removal fail to call pci_free_irq_vectors(), causing IRQ resource leaks.
- [High] SRCU structures are destroyed before the IRQ handlers are unregistered, risking a UAF if an interrupt fires during teardown.
--
commit 295e9b4a624efdd5299bd60f26213363d5cf8c68
Author: Szymon Durawa <szymon.durawa@linux.intel.com>
PCI: vmd: Add vmd_configure_membar() and vmd_configure_membar1_membar2()
This commit moves the MEMBAR1 and MEMBAR2 registry initialization code into
new helper functions. It introduces vmd_configure_membar and
vmd_configure_membar1_membar2 to handle the resource setup.
> diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
> --- a/drivers/pci/controller/vmd.c
> +++ b/drivers/pci/controller/vmd.c
[ ... ]
> @@ -783,6 +783,48 @@ static void vmd_configure_cfgbar(struct vmd_dev *vmd)
[ ... ]
> +static void vmd_configure_membar(struct vmd_dev *vmd, u8 resource_number,
> + u8 membar_number, resource_size_t start_offset,
> + resource_size_t end_offset)
> +{
[ ... ]
> + vmd->resources[resource_number] = (struct resource){
> + .name = kasprintf(GFP_KERNEL, "VMD MEMBAR%d",
> + membar_number / 2),
[Severity: Medium]
Does this kasprintf call need a check for allocation failure?
If kasprintf fails under memory pressure and returns a null pointer, the
resource structure will be initialized with a null name, which might violate
kernel API expectations for resource structures.
[Severity: Medium]
Could this dynamic allocation cause a memory leak if vmd_enable_domain fails
later on?
Looking at vmd_enable_domain, if functions like vmd_alloc_irqs or
vmd_create_bus return an error, the function exits directly. It doesn't
appear to free these allocated strings on those error paths, leaving the
memory leaked.
[ ... ]
> @@ -1074,6 +1089,11 @@ static void vmd_remove(struct pci_dev *dev)
> pci_stop_root_bus(vmd->bus);
> sysfs_remove_link(&vmd->dev->dev.kobj, "domain");
> pci_remove_root_bus(vmd->bus);
> +
> + /* CFGBAR is static, does not require releasing memory */
> + kfree(vmd->resources[1].name);
> + kfree(vmd->resources[2].name);
[Severity: Critical]
Can this lead to a use-after-free if a userspace tool reads proc iomem
concurrently?
By calling kfree on the resource names here, the memory is freed before
vmd_detach_resources unlinks these resources from the parent PCI device
tree. A concurrent read of the global resource tree during this window
might dereference the freed pointer.
> +
> vmd_cleanup_srcu(vmd);
[Severity: High]
This is a pre-existing issue, but I noticed we don't seem to free the irq
vectors during teardown.
In vmd_alloc_irqs, pci_alloc_irq_vectors is called, but vmd_remove completes
without calling pci_free_irq_vectors. Does this leak IRQ resources,
potentially exhausting system IRQ capacity over multiple bind and unbind
cycles?
[Severity: High]
This isn't a bug introduced by this patch, but is it safe to destroy the SRCU
structure here before the interrupt handlers are unregistered?
The handlers are registered with devm_request_irq, meaning they remain active
until the devres cleanup phase, which happens after vmd_remove finishes. If a
hardware interrupt fires in this window, vmd_irq will call srcu_read_lock on
the already destroyed SRCU object.
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260625082940.1290740-1-szymon.durawa@linux.intel.com?part=3
^ permalink raw reply
* Re: [PATCH v5 4/8] PCI: vmd: Add vmd_create_bus()
From: sashiko-bot @ 2026-06-25 6:18 UTC (permalink / raw)
To: Szymon Durawa; +Cc: linux-pci
In-Reply-To: <20260625082940.1290740-5-szymon.durawa@linux.intel.com>
Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] Memory allocated by kasprintf for resource names is leaked on vmd_enable_domain() error paths.
--
commit 1ab8bc19707a3b648b92906666ae290aa0884407
Author: Szymon Durawa <szymon.durawa@linux.intel.com>
PCI: vmd: Add vmd_create_bus()
This commit moves the VMD bus initialization logic into a new helper function
vmd_create_bus(). It aims to simplify the domain enablement code without
introducing functional changes.
> diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
> index 443a0cefb7f74..c0a953df5f238 100644
> --- a/drivers/pci/controller/vmd.c
> +++ b/drivers/pci/controller/vmd.c
[ ... ]
> @@ -968,25 +994,12 @@ static int vmd_enable_domain(struct vmd_dev *vmd, unsigned long features)
>
> sd->node = pcibus_to_node(vmd->dev->bus);
>
> - vmd->bus = pci_create_root_bus(&vmd->dev->dev, vmd->busn_start,
> - &vmd_ops, sd, &resources);
> - if (!vmd->bus) {
> - pci_bus_release_emul_domain_nr(sd->domain);
> - pci_free_resource_list(&resources);
> - vmd_remove_irq_domain(vmd);
> - return -ENODEV;
> + ret = vmd_create_bus(vmd, sd, offset);
> + if (ret) {
> + pci_err(vmd->dev, "Can't create bus: %d\n", ret);
> + return ret;
[Severity: High]
Does this error path in vmd_enable_domain() leak the resource names allocated
by kasprintf() during the earlier memory bar configuration?
If vmd_create_bus() fails and returns an error, does this early return
lack the necessary rollback logic? Could this leave the previously
initialized bus active while its sysdata pointer (the vmd_dev struct)
is later freed by devres, resulting in a use-after-free?
> }
>
> - vmd_copy_host_bridge_flags(pci_find_host_bridge(vmd->dev->bus),
> - to_pci_host_bridge(vmd->bus->bridge));
> -
> - vmd_attach_resources(vmd);
> - if (vmd->irq_domain)
> - dev_set_msi_domain(&vmd->bus->dev, vmd->irq_domain);
> - else
> - dev_set_msi_domain(&vmd->bus->dev,
> - dev_get_msi_domain(&vmd->dev->dev));
> -
> WARN(sysfs_create_link(&vmd->dev->dev.kobj, &vmd->bus->dev.kobj,
> "domain"), "Can't create symlink to domain\n");
>
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260625082940.1290740-1-szymon.durawa@linux.intel.com?part=4
^ permalink raw reply
* Re: [PATCH v5 6/8] PCI: vmd: Convert bus and busn_start to an array
From: sashiko-bot @ 2026-06-25 6:18 UTC (permalink / raw)
To: Szymon Durawa; +Cc: linux-pci
In-Reply-To: <20260625082940.1290740-7-szymon.durawa@linux.intel.com>
Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
Pre-existing issues:
- [High] Failing to call pci_free_irq_vectors() in error paths and driver removal leaks IRQ resources.
- [High] SRCU structures are destroyed before IRQ vectors are explicitly disabled, risking a Use-After-Free on spurious interrupts.
--
commit 5baf3e06c174f82b0781954bbca2834e43441922
Author: Szymon Durawa <szymon.durawa@linux.intel.com>
PCI: vmd: Convert bus and busn_start to an array
Convert bus and busn_start from scalar to an array to support multiple VMD
buses in the future. No functional changes.
> diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
> index 9ff82a24e859c..2bfc16875d912 100644
> --- a/drivers/pci/controller/vmd.c
> +++ b/drivers/pci/controller/vmd.c
[ ... ]
> @@ -1114,9 +1122,9 @@ static void vmd_remove(struct pci_dev *dev)
> {
> struct vmd_dev *vmd = pci_get_drvdata(dev);
>
> - pci_stop_root_bus(vmd->bus);
> + pci_stop_root_bus(vmd->bus[VMD_BUS_0]);
> sysfs_remove_link(&vmd->dev->dev.kobj, "domain");
> - pci_remove_root_bus(vmd->bus);
> + pci_remove_root_bus(vmd->bus[VMD_BUS_0]);
[Severity: High]
This is a pre-existing issue, but does device removal or probe failure leak
IRQ resources?
In vmd_alloc_irqs(), vectors are successfully allocated:
vmd->msix_count = pci_alloc_irq_vectors(dev, vmd->first_vec + 1,
vmd->msix_count, PCI_IRQ_MSIX);
However, they do not appear to be explicitly freed in error paths or upon
driver removal in vmd_remove(). The PCI subsystem guidelines state that every
error path or removal after successful allocation must call
pci_free_irq_vectors() to prevent resource leaks and potential exhaustion of
system IRQ capacity.
[Severity: High]
This is also a pre-existing issue, but is there a use-after-free risk during
teardown if a spurious interrupt fires?
In vmd_remove(), the SRCU structures are destroyed via vmd_cleanup_srcu().
However, the IRQ handlers are managed by devres (via devm_request_irq) and
will only be disabled after vmd_remove() returns.
If a spurious or shared interrupt fires after the SRCU structures are
destroyed but before devres cleanup completes, the handler will execute
and access the destroyed struct:
vmd_irq() {
...
idx = srcu_read_lock(&irqs->srcu);
}
Could this race condition lead to memory corruption or a crash? Should the
interrupts be explicitly disabled before destroying the SRCU structures?
> /* CFGBAR is static, does not require releasing memory */
> kfree(vmd->resources[VMD_RES_MBAR_1].name);
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260625082940.1290740-1-szymon.durawa@linux.intel.com?part=6
^ permalink raw reply
* Re: [PATCH v5 2/8] PCI: vmd: Add vmd_configure_cfgbar() helper function
From: sashiko-bot @ 2026-06-25 6:17 UTC (permalink / raw)
To: Szymon Durawa; +Cc: linux-pci
In-Reply-To: <20260625082940.1290740-3-szymon.durawa@linux.intel.com>
Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
Pre-existing issues:
- [High] Destruction of SRCU struct before freeing IRQ handlers leads to potential Use-After-Free.
--
commit 277c077bdf8fa22a45d1262f68f3e1a7fc357e50
Author: Szymon Durawa <szymon.durawa@linux.intel.com>
PCI: vmd: Add vmd_configure_cfgbar() helper function
Move the VMD CFGBAR initialization code to a new helper
vmd_configure_cfgbar(). No functional changes.
> diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
> index 070b247b3fbbe..2c824660cabd5 100644
> --- a/drivers/pci/controller/vmd.c
> +++ b/drivers/pci/controller/vmd.c
[ ... ]
> @@ -857,13 +869,7 @@ static int vmd_enable_domain(struct vmd_dev *vmd, unsigned long features)
> return ret;
> }
>
> - res = &vmd->dev->resource[VMD_CFGBAR];
> - vmd->resources[0] = (struct resource) {
> - .name = "VMD CFGBAR",
> - .start = vmd->busn_start,
> - .end = vmd->busn_start + (resource_size(res) >> 20) - 1,
> - .flags = IORESOURCE_BUS | IORESOURCE_PCI_FIXED,
> - };
> + vmd_configure_cfgbar(vmd);
[Severity: High]
This is a pre-existing issue, but does vmd_remove() have a use-after-free
race?
In vmd_alloc_irqs(), the interrupts are requested via devm_request_irq(),
which means they are freed automatically by devres after vmd_remove() returns.
However, vmd_remove() explicitly destroys the SRCU structure:
vmd_remove()
vmd_cleanup_srcu()
cleanup_srcu_struct(&vmd->irqs[i].srcu)
If an interrupt fires after cleanup_srcu_struct() but before the devres
cleanup unregisters the IRQ, vmd_irq() will unconditionally access the
destroyed SRCU structure:
vmd_irq()
idx = srcu_read_lock(&irqs->srcu);
Should the IRQs be freed before the SRCU structures are destroyed?
>
> /*
> * If the window is below 4GB, clear IORESOURCE_MEM_64 so we can
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260625082940.1290740-1-szymon.durawa@linux.intel.com?part=2
^ permalink raw reply
* Re: [PATCH v5 1/8] PCI: vmd: Add vmd_bus_enumeration() helper function
From: sashiko-bot @ 2026-06-25 6:15 UTC (permalink / raw)
To: Szymon Durawa; +Cc: linux-pci
In-Reply-To: <20260625082940.1290740-2-szymon.durawa@linux.intel.com>
> Move the vmd bus enumeration code to a new helper vmd_bus_enumeration().
> No functional changes.
>
> Suggested-by: Nirmal Patel <nirmal.patel@linux.intel.com>
> Signed-off-by: Szymon Durawa <szymon.durawa@linux.intel.com>
Sashiko has reviewed this patch and found no issues. It looks great!
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260625082940.1290740-1-szymon.durawa@linux.intel.com?part=1
^ permalink raw reply
* [PATCH v5 8/8] PCI: vmd: Add workaround for bus number hardwired to fixed non-zero value
From: Szymon Durawa @ 2026-06-25 8:29 UTC (permalink / raw)
To: helgaas, nirmal.patel, szymon.durawa, djbw, linux-pci, lukas
In-Reply-To: <20260625082940.1290740-1-szymon.durawa@linux.intel.com>
The VMD BUS1 root bus number is fixed in hardware to 0x80. It marks bridge
invalid configuration when detecting a non-zero (0x80) the VMD BUS1 root
bus number.
Thus root bus number is deconfigured in the first pass of pci_scan_bridge()
to be re-assigned to 0x0 in the second pass. As a result no subordinate bus
number behind VMD BUS1 is found.
To avoid bus number reconfiguration, BUS1 number has to be the same
as BUS1 primary number.
Log snippet without workaround:
[ 3.752507] vmd 0000:00:0e.0: PCI host bridge to bus 10000:e1
[ 3.752510] pci_bus 10000:e1: busn_res: can not insert [bus e1-ff] under domain [bus 00-ff] (conflicts with (null) [bus e0-f0])
[ 3.752515] pci_bus 10000:e1: root bus resource [bus f1-ff]
[ 3.752517] pci_bus 10000:e1: root bus resource [mem 0x8c800000-0x8cffffff]
[ 3.752519] pci_bus 10000:e1: root bus resource [mem 0x701b802000-0x701bffffff 64bit]
[ 3.752523] pci_bus 10000:e1: scanning bus
[ 3.752732] pci (null): Looking for ACPI companion (address 0x80e0ffff)
[ 3.752745] pci 10000:e1:1c.0: [8086:7f38] type 01 class 0x060400 PCIe Root Port
[ 3.752779] pci 10000:e1:1c.0: PCI bridge to [bus f1]
[ 3.752861] pci 10000:e1:1c.0: PME# supported from D0 D3hot D3cold
[ 3.752864] pci 10000:e1:1c.0: PME# disabled
[ 3.752909] pci 10000:e1:1c.0: PTM enabled (root), 4ns granularity
[ 3.752981] pci 10000:e1:1c.0: vgaarb: pci_notify
[ 3.752987] pci_bus 10000:e1: fixups for bus
[ 3.752992] pci 10000:e1:1c.0: scanning [bus f1-f1] behind bridge, pass 0
[ 3.752993] pci 10000:e1:1c.0: primary 80, bus->number e1.
[ 3.752994] pci 10000:e1:1c.0: bridge configuration invalid ([bus f1-f1]), reconfiguring
[ 3.753003] pci 10000:e1:1c.0: scanning [bus 00-00] behind bridge, pass 1
[ 3.753004] pci 10000:e1:1c.0: primary 00, bus->number e1.
Suggested-by: Nirmal Patel <nirmal.patel@linux.intel.com>
Signed-off-by: Szymon Durawa <szymon.durawa@linux.intel.com>
---
drivers/pci/controller/vmd.c | 32 ++++++++++++++++++++++++++++++--
1 file changed, 30 insertions(+), 2 deletions(-)
diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
index c68ea7837620..d0527be78ebd 100644
--- a/drivers/pci/controller/vmd.c
+++ b/drivers/pci/controller/vmd.c
@@ -440,8 +440,23 @@ static void vmd_remove_irq_domain(struct vmd_dev *vmd)
static void __iomem *vmd_cfg_addr(struct vmd_dev *vmd, struct pci_bus *bus,
unsigned int devfn, int reg, int len)
{
- unsigned int busnr_ecam = bus->number - vmd->busn_start[VMD_BUS_0];
- u32 offset = PCIE_ECAM_OFFSET(busnr_ecam, devfn, reg);
+ unsigned char bus_number;
+ unsigned int busnr_ecam;
+ u32 offset;
+
+ /*
+ * VMD workaround: for VMD BUS1, bus->number is set to VMD_PRIMARY_BUS1
+ * (see comment in vmd_enable_domain() under vmd_create_bus() for BUS1)
+ * but original value is 0xE1 which is stored in
+ * vmd->busn_start[VMD_BUS_1].
+ */
+ if (vmd->bus1_rootbus && bus->number == VMD_PRIMARY_BUS1)
+ bus_number = vmd->busn_start[VMD_BUS_1];
+ else
+ bus_number = bus->number;
+
+ busnr_ecam = bus_number - vmd->busn_start[VMD_BUS_0];
+ offset = PCIE_ECAM_OFFSET(busnr_ecam, devfn, reg);
if (offset + len >= resource_size(&vmd->dev->resource[VMD_CFGBAR]))
return NULL;
@@ -1177,6 +1192,19 @@ static int vmd_enable_domain(struct vmd_dev *vmd, unsigned long features)
*/
vmd->bus[VMD_BUS_1]->primary = VMD_PRIMARY_BUS1;
+ /*
+ * This is a workaround for pci_scan_bridge_extend() code.
+ * It marks bridge invalid configuration when detecting a
+ * non-zero (0x80) the VMD BUS1 root bus number. Thus Primary Bus Number
+ * of Root Ports on BUS1 is deconfigured in the first pass of
+ * pci_scan_bridge() to be re-assigned to 0x0 in the second pass.
+ * As a result no subordinate bus number behind VMD BUS1 is found.
+ * Workaround: update vmd->bus[VMD_BUS_1]->number and
+ * vmd->bus[VMD_BUS_1]->primary to the same value, which
+ * bypasses bus number reconfiguration.
+ */
+ vmd->bus[VMD_BUS_1]->number = VMD_PRIMARY_BUS1;
+
WARN(sysfs_create_link(&vmd->dev->dev.kobj,
&vmd->bus[VMD_BUS_1]->dev.kobj,
"domain1"),
--
2.43.0
^ permalink raw reply related
* [PATCH v5 7/8] PCI: vmd: Add support for second rootbus under VMD
From: Szymon Durawa @ 2026-06-25 8:29 UTC (permalink / raw)
To: helgaas, nirmal.patel, szymon.durawa, djbw, linux-pci, lukas
In-Reply-To: <20260625082940.1290740-1-szymon.durawa@linux.intel.com>
Starting from Intel Arrow Lake VMD enhancement introduces second root bus
support with fixed root bus number (0x80). It means that all 3 MMIO BARs
exposed by VMD are shared now between both buses (current BUS0 and
new BUS1).
Add new BUS1 enumeration and divide MMIO space to be shared between
both root buses. Due to enumeration issues with root bus hardwired to a
fixed non-zero value, this patch will work with a workaround proposed
in next patch. Without workaround user won't see attached devices for BUS1
root bus.
Suggested-by: Nirmal Patel <nirmal.patel@linux.intel.com>
Signed-off-by: Szymon Durawa <szymon.durawa@linux.intel.com>
---
drivers/pci/controller/vmd.c | 225 ++++++++++++++++++++++++++++++-----
1 file changed, 195 insertions(+), 30 deletions(-)
diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
index 2bfc16875d91..c68ea7837620 100644
--- a/drivers/pci/controller/vmd.c
+++ b/drivers/pci/controller/vmd.c
@@ -4,6 +4,7 @@
* Copyright (c) 2015, Intel Corporation.
*/
+#include <linux/bitfield.h>
#include <linux/device.h>
#include <linux/interrupt.h>
#include <linux/irq.h>
@@ -30,6 +31,7 @@
#define VMD_RESTRICT_0_BUS_START 0x0
#define VMD_RESTRICT_1_BUS_START 0x80
#define VMD_RESTRICT_2_BUS_START 0xE0
+#define VMD_RESTRICT_3_BUS_START 0xE1
#define PCI_REG_VMCAP 0x40
#define BUS_RESTRICT_CAP(vmcap) (vmcap & 0x1)
@@ -42,15 +44,36 @@
#define MB2_SHADOW_OFFSET 0x2000
#define MB2_SHADOW_SIZE 16
+/* Primary Bus Number for VMD devices on root bus 0 */
+#define VMD_PRIMARY_BUS0 0x00
+/* Primary Bus Number for VMD devices on root bus 1 */
+#define VMD_PRIMARY_BUS1 0x80
+
+#define VMD_BUSRANGE0 0xc8
+#define VMD_BUSRANGE1 0xcc
+#define VMD_MEMBAR1_OFFSET 0xd0
+#define VMD_MEMBAR2_OFFSET1 0xd8
+#define VMD_MEMBAR2_OFFSET2 0xdc
+#define VMD_BUS_END(busr) FIELD_GET(GENMASK(15, 8), busr)
+#define VMD_BUS_START(busr) FIELD_GET(GENMASK(7, 0), busr)
+
+/*
+ * Add VMD resources for BUS1, it will share the same MMIO space with
+ * previous VMD resources.
+ */
enum vmd_resource {
- VMD_RES_CFGBAR = 0, /* VMD Bus0 Config BAR */
- VMD_RES_MBAR_1, /* VMD Bus0 Resource MemBAR 1 */
- VMD_RES_MBAR_2, /* VMD Bus0 Resource MemBAR 2 */
+ VMD_RES_CFGBAR = 0, /* VMD Bus0 Config BAR */
+ VMD_RES_MBAR_1, /* VMD Bus0 Resource MemBAR 1 */
+ VMD_RES_MBAR_2, /* VMD Bus0 Resource MemBAR 2 */
+ VMD_RES_BUS1_CFGBAR, /* VMD Bus1 Config BAR */
+ VMD_RES_BUS1_MBAR_1, /* VMD Bus1 Resource MemBAR 1 */
+ VMD_RES_BUS1_MBAR_2, /* VMD Bus1 Resource MemBAR 2 */
VMD_RES_COUNT
};
enum vmd_rootbus {
VMD_BUS_0 = 0,
+ VMD_BUS_1,
VMD_BUS_COUNT
};
@@ -94,6 +117,12 @@ enum vmd_features {
* proper power management of the SoC.
*/
VMD_FEAT_BIOS_PM_QUIRK = (1 << 5),
+
+ /*
+ * Starting from Intel Arrow Lake, VMD devices have their VMD root ports
+ * on the additional BUS1 root bus.
+ */
+ VMD_FEAT_HAS_BUS1_ROOTBUS = (1 << 6)
};
#define VMD_BIOS_PM_QUIRK_LTR 0x1003 /* 3145728 ns */
@@ -101,7 +130,8 @@ enum vmd_features {
#define VMD_FEATS_CLIENT (VMD_FEAT_HAS_MEMBAR_SHADOW_VSCAP | \
VMD_FEAT_HAS_BUS_RESTRICTIONS | \
VMD_FEAT_OFFSET_FIRST_VECTOR | \
- VMD_FEAT_BIOS_PM_QUIRK)
+ VMD_FEAT_BIOS_PM_QUIRK | \
+ VMD_FEAT_HAS_BUS1_ROOTBUS)
static DEFINE_IDA(vmd_instance_ida);
@@ -159,6 +189,7 @@ struct vmd_dev {
u8 first_vec;
char *name;
int instance;
+ bool bus1_rootbus;
};
static inline struct vmd_dev *vmd_from_bus(struct pci_bus *bus)
@@ -537,11 +568,23 @@ static inline void vmd_acpi_end(void) { }
static void vmd_domain_reset(struct vmd_dev *vmd)
{
- u16 bus, max_buses = resource_size(&vmd->resources[VMD_RES_CFGBAR]);
+ u16 bus, bus_cnt = resource_size(&vmd->resources[VMD_RES_CFGBAR]);
u8 dev, functions, fn, hdr_type;
char __iomem *base;
- for (bus = 0; bus < max_buses; bus++) {
+ /*
+ * Total bus-number count (bus_cnt) when BUS1 root bus is present:
+ * initialized with BUS0 CFGBAR range, extended by BUS1 CFGBAR range plus
+ * one bus number per root bus' own config space (bus_cnt is increased by 2)
+ * as the CFGBAR ranges do not include the root bus config space.
+ */
+
+ if (vmd->bus1_rootbus) {
+ bus_cnt += resource_size(&vmd->resources[VMD_RES_BUS1_CFGBAR]);
+ bus_cnt += 2;
+ }
+
+ for (bus = 0; bus < bus_cnt; bus++) {
for (dev = 0; dev < 32; dev++) {
base = vmd->cfgbar + PCIE_ECAM_OFFSET(bus,
PCI_DEVFN(dev, 0), 0);
@@ -587,12 +630,24 @@ static void vmd_attach_resources(struct vmd_dev *vmd)
{
vmd->dev->resource[VMD_MEMBAR1].child = &vmd->resources[VMD_RES_MBAR_1];
vmd->dev->resource[VMD_MEMBAR2].child = &vmd->resources[VMD_RES_MBAR_2];
+
+ if (vmd->bus1_rootbus) {
+ vmd->resources[VMD_RES_MBAR_1].sibling =
+ &vmd->resources[VMD_RES_BUS1_MBAR_1];
+ vmd->resources[VMD_RES_MBAR_2].sibling =
+ &vmd->resources[VMD_RES_BUS1_MBAR_2];
+ }
}
static void vmd_detach_resources(struct vmd_dev *vmd)
{
vmd->dev->resource[VMD_MEMBAR1].child = NULL;
vmd->dev->resource[VMD_MEMBAR2].child = NULL;
+
+ if (vmd->bus1_rootbus) {
+ vmd->resources[VMD_RES_MBAR_1].sibling = NULL;
+ vmd->resources[VMD_RES_MBAR_2].sibling = NULL;
+ }
}
static int vmd_get_phys_offsets(struct vmd_dev *vmd, bool native_hint,
@@ -649,7 +704,7 @@ static int vmd_get_phys_offsets(struct vmd_dev *vmd, bool native_hint,
return 0;
}
-static int vmd_get_bus_number_start(struct vmd_dev *vmd)
+static int vmd_get_bus_number_start(struct vmd_dev *vmd, unsigned long features)
{
struct pci_dev *dev = vmd->dev;
u16 reg;
@@ -668,6 +723,19 @@ static int vmd_get_bus_number_start(struct vmd_dev *vmd)
case 2:
vmd->busn_start[VMD_BUS_0] = VMD_RESTRICT_2_BUS_START;
break;
+ case 3:
+ if (!(features & VMD_FEAT_HAS_BUS1_ROOTBUS)) {
+ pci_err(dev, "VMD Bus Restriction detected type %d, but BUS1 Rootbus is not supported, aborting.\n",
+ BUS_RESTRICT_CFG(reg));
+ return -ENODEV;
+ }
+
+ /* VMD (on root bus 0) secondary bus start number */
+ vmd->busn_start[VMD_BUS_0] = VMD_RESTRICT_2_BUS_START;
+ /* VMD (on root bus 1) secondary bus start number */
+ vmd->busn_start[VMD_BUS_1] = VMD_RESTRICT_3_BUS_START;
+ vmd->bus1_rootbus = true;
+ break;
default:
pci_err(dev, "Unknown Bus Offset Setting (%d)\n",
BUS_RESTRICT_CFG(reg));
@@ -799,6 +867,30 @@ static void vmd_configure_cfgbar(struct vmd_dev *vmd)
(resource_size(res) >> 20) - 1,
.flags = IORESOURCE_BUS | IORESOURCE_PCI_FIXED,
};
+
+ if (vmd->bus1_rootbus) {
+ u16 bus0_range = 0;
+ u16 bus1_range = 0;
+
+ pci_read_config_word(vmd->dev, VMD_BUSRANGE0, &bus0_range);
+ pci_read_config_word(vmd->dev, VMD_BUSRANGE1, &bus1_range);
+
+ /*
+ * Resize BUS0 CFGBAR range to make space for BUS1
+ * owned devices by adjusting range end with value stored in
+ * VMD_BUSRANGE0 register.
+ */
+ vmd->resources[VMD_RES_CFGBAR].start = VMD_BUS_START(bus0_range);
+ vmd->resources[VMD_RES_CFGBAR].end = VMD_BUS_END(bus0_range);
+
+ vmd->resources[VMD_RES_BUS1_CFGBAR] = (struct resource){
+ .name = "VMD CFGBAR BUS1",
+ .start = VMD_BUS_START(bus1_range),
+ .end = VMD_BUS_END(bus1_range),
+ .flags = IORESOURCE_BUS | IORESOURCE_PCI_FIXED,
+ .parent = res,
+ };
+ }
}
/*
@@ -828,8 +920,8 @@ static void vmd_configure_membar(struct vmd_dev *vmd,
flags &= ~IORESOURCE_MEM_64;
vmd->resources[resource_number] = (struct resource){
- .name = kasprintf(GFP_KERNEL, "VMD MEMBAR%d",
- membar_number / 2),
+ .name = kasprintf(GFP_KERNEL, "VMD MEMBAR%d %s", membar_number / 2,
+ resource_number > VMD_RES_MBAR_2 ? "BUS1" : ""),
.start = res->start + start_offset,
.end = res->end - end_offset,
.flags = flags,
@@ -840,41 +932,83 @@ static void vmd_configure_membar(struct vmd_dev *vmd,
static void vmd_configure_membar1_membar2(struct vmd_dev *vmd,
resource_size_t mbar2_ofs)
{
- vmd_configure_membar(vmd, 1, VMD_MEMBAR1, 0, 0);
- vmd_configure_membar(vmd, 2, VMD_MEMBAR2, mbar2_ofs, 0);
+ if (vmd->bus1_rootbus) {
+ u32 bus1_mbar1_ofs = 0;
+ u64 bus1_mbar2_ofs = 0;
+ u32 reg;
+
+ pci_read_config_dword(vmd->dev, VMD_MEMBAR1_OFFSET,
+ &bus1_mbar1_ofs);
+
+ pci_read_config_dword(vmd->dev, VMD_MEMBAR2_OFFSET1, ®);
+ bus1_mbar2_ofs = reg;
+
+ pci_read_config_dword(vmd->dev, VMD_MEMBAR2_OFFSET2, ®);
+ bus1_mbar2_ofs |= (u64)reg << 32;
+
+ /*
+ * Resize BUS MEMBAR1 and MEMBAR2 ranges to make space
+ * for BUS1 owned devices by adjusting range end with values
+ * stored in VMD_MEMBAR1_OFFSET and VMD_MEMBAR2_OFFSET registers
+ */
+ vmd_configure_membar(vmd, VMD_RES_MBAR_1, VMD_MEMBAR1, 0,
+ bus1_mbar1_ofs);
+ vmd_configure_membar(vmd, VMD_RES_MBAR_2, VMD_MEMBAR2,
+ mbar2_ofs, bus1_mbar2_ofs - mbar2_ofs);
+
+ vmd_configure_membar(vmd, VMD_RES_BUS1_MBAR_1, VMD_MEMBAR1,
+ bus1_mbar1_ofs, 0);
+ vmd_configure_membar(vmd, VMD_RES_BUS1_MBAR_2, VMD_MEMBAR2,
+ mbar2_ofs + bus1_mbar2_ofs, 0);
+ } else {
+ vmd_configure_membar(vmd, VMD_RES_MBAR_1, VMD_MEMBAR1, 0, 0);
+ vmd_configure_membar(vmd, VMD_RES_MBAR_2, VMD_MEMBAR2,
+ mbar2_ofs, 0);
+ }
}
-static int vmd_create_bus(struct vmd_dev *vmd, struct pci_sysdata *sd,
- resource_size_t *offset)
+static int vmd_create_bus(struct vmd_dev *vmd, enum vmd_rootbus bus_number,
+ struct pci_sysdata *sd, resource_size_t *offset)
{
+ u8 cfgbar = bus_number * 3;
+ u8 membar1 = cfgbar + 1;
+ u8 membar2 = cfgbar + 2;
+ struct pci_bus *vmd_bus;
LIST_HEAD(resources);
- pci_add_resource(&resources, &vmd->resources[VMD_RES_CFGBAR]);
- pci_add_resource_offset(&resources, &vmd->resources[VMD_RES_MBAR_1],
+ pci_add_resource(&resources, &vmd->resources[cfgbar]);
+ pci_add_resource_offset(&resources, &vmd->resources[membar1],
offset[0]);
- pci_add_resource_offset(&resources, &vmd->resources[VMD_RES_MBAR_2],
+ pci_add_resource_offset(&resources, &vmd->resources[membar2],
offset[1]);
- vmd->bus[VMD_BUS_0] = pci_create_root_bus(&vmd->dev->dev,
- vmd->busn_start[VMD_BUS_0],
- &vmd_ops, sd, &resources);
- if (!vmd->bus[VMD_BUS_0]) {
- pci_bus_release_emul_domain_nr(sd->domain);
+ vmd_bus = pci_create_root_bus(&vmd->dev->dev,
+ vmd->busn_start[bus_number], &vmd_ops, sd,
+ &resources);
+
+ if (!vmd_bus) {
pci_free_resource_list(&resources);
- vmd_remove_irq_domain(vmd);
+
+ /* Clean only when BUS0 root bus cannot be created */
+ if (bus_number == VMD_PRIMARY_BUS0) {
+ pci_bus_release_emul_domain_nr(sd->domain);
+ vmd_remove_irq_domain(vmd);
+ }
return -ENODEV;
}
vmd_copy_host_bridge_flags(pci_find_host_bridge(vmd->dev->bus),
- to_pci_host_bridge(vmd->bus[VMD_BUS_0]->bridge));
+ to_pci_host_bridge(vmd_bus->bridge));
vmd_attach_resources(vmd);
if (vmd->irq_domain)
- dev_set_msi_domain(&vmd->bus[VMD_BUS_0]->dev, vmd->irq_domain);
+ dev_set_msi_domain(&vmd_bus->dev, vmd->irq_domain);
else
- dev_set_msi_domain(&vmd->bus[VMD_BUS_0]->dev,
+ dev_set_msi_domain(&vmd_bus->dev,
dev_get_msi_domain(&vmd->dev->dev));
+ vmd->bus[bus_number] = vmd_bus;
+
return 0;
}
@@ -887,7 +1021,9 @@ static void vmd_bus_enumeration(struct pci_bus *bus, unsigned long features)
vmd_acpi_begin();
pci_scan_child_bus(bus);
- vmd_domain_reset(vmd_from_bus(bus));
+
+ if (bus->primary == VMD_PRIMARY_BUS0)
+ vmd_domain_reset(vmd_from_bus(bus));
/*
* When Intel VMD is enabled, the OS does not discover the Root Ports
@@ -955,7 +1091,7 @@ static int vmd_enable_domain(struct vmd_dev *vmd, unsigned long features)
* limits the bus range to between 0-127, 128-255, or 224-255
*/
if (features & VMD_FEAT_HAS_BUS_RESTRICTIONS) {
- ret = vmd_get_bus_number_start(vmd);
+ ret = vmd_get_bus_number_start(vmd, features);
if (ret)
return ret;
}
@@ -1016,7 +1152,7 @@ static int vmd_enable_domain(struct vmd_dev *vmd, unsigned long features)
sd->node = pcibus_to_node(vmd->dev->bus);
- ret = vmd_create_bus(vmd, sd, offset);
+ ret = vmd_create_bus(vmd, VMD_BUS_0, sd, offset);
if (ret) {
pci_err(vmd->dev, "Can't create bus: %d\n", ret);
return ret;
@@ -1028,6 +1164,27 @@ static int vmd_enable_domain(struct vmd_dev *vmd, unsigned long features)
vmd_bus_enumeration(vmd->bus[VMD_BUS_0], features);
+ if (vmd->bus1_rootbus) {
+ ret = vmd_create_bus(vmd, VMD_BUS_1, sd, offset);
+ if (ret) {
+ pci_err(vmd->dev, "Can't create BUS1: %d\n", ret);
+ return ret;
+ }
+
+ /*
+ * Primary bus is not set by pci_create_root_bus(), it is
+ * updated here
+ */
+ vmd->bus[VMD_BUS_1]->primary = VMD_PRIMARY_BUS1;
+
+ WARN(sysfs_create_link(&vmd->dev->dev.kobj,
+ &vmd->bus[VMD_BUS_1]->dev.kobj,
+ "domain1"),
+ "Can't create symlink to domain1\n");
+
+ vmd_bus_enumeration(vmd->bus[VMD_BUS_1], features);
+ }
+
return 0;
}
@@ -1126,10 +1283,18 @@ static void vmd_remove(struct pci_dev *dev)
sysfs_remove_link(&vmd->dev->dev.kobj, "domain");
pci_remove_root_bus(vmd->bus[VMD_BUS_0]);
- /* CFGBAR is static, does not require releasing memory */
+ /* CFGBARs are static, do not require releasing memory */
kfree(vmd->resources[VMD_RES_MBAR_1].name);
kfree(vmd->resources[VMD_RES_MBAR_2].name);
+ if (vmd->bus1_rootbus) {
+ pci_stop_root_bus(vmd->bus[VMD_BUS_1]);
+ sysfs_remove_link(&vmd->dev->dev.kobj, "domain1");
+ pci_remove_root_bus(vmd->bus[VMD_BUS_1]);
+ kfree(vmd->resources[VMD_RES_BUS1_MBAR_1].name);
+ kfree(vmd->resources[VMD_RES_BUS1_MBAR_2].name);
+ }
+
vmd_cleanup_srcu(vmd);
vmd_detach_resources(vmd);
vmd_remove_irq_domain(vmd);
@@ -1222,4 +1387,4 @@ module_pci_driver(vmd_drv);
MODULE_AUTHOR("Intel Corporation");
MODULE_DESCRIPTION("Volume Management Device driver");
MODULE_LICENSE("GPL v2");
-MODULE_VERSION("0.6");
+MODULE_VERSION("0.7");
--
2.43.0
^ permalink raw reply related
* [PATCH v5 6/8] PCI: vmd: Convert bus and busn_start to an array
From: Szymon Durawa @ 2026-06-25 8:29 UTC (permalink / raw)
To: helgaas, nirmal.patel, szymon.durawa, djbw, linux-pci, lukas
In-Reply-To: <20260625082940.1290740-1-szymon.durawa@linux.intel.com>
Convert bus and busn_start from scalar to an array to support
multiple VMD buses in the future. No functional changes.
Suggested-by: Nirmal Patel <nirmal.patel@linux.intel.com>
Signed-off-by: Szymon Durawa <szymon.durawa@linux.intel.com>
---
drivers/pci/controller/vmd.c | 46 +++++++++++++++++++++---------------
1 file changed, 27 insertions(+), 19 deletions(-)
diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
index 9ff82a24e859..2bfc16875d91 100644
--- a/drivers/pci/controller/vmd.c
+++ b/drivers/pci/controller/vmd.c
@@ -49,6 +49,11 @@ enum vmd_resource {
VMD_RES_COUNT
};
+enum vmd_rootbus {
+ VMD_BUS_0 = 0,
+ VMD_BUS_COUNT
+};
+
enum vmd_features {
/*
* Device may contain registers which hint the physical location of the
@@ -149,8 +154,8 @@ struct vmd_dev {
struct pci_sysdata sysdata;
struct resource resources[VMD_RES_COUNT];
struct irq_domain *irq_domain;
- struct pci_bus *bus;
- u8 busn_start;
+ struct pci_bus *bus[VMD_BUS_COUNT];
+ u8 busn_start[VMD_BUS_COUNT];
u8 first_vec;
char *name;
int instance;
@@ -404,7 +409,7 @@ static void vmd_remove_irq_domain(struct vmd_dev *vmd)
static void __iomem *vmd_cfg_addr(struct vmd_dev *vmd, struct pci_bus *bus,
unsigned int devfn, int reg, int len)
{
- unsigned int busnr_ecam = bus->number - vmd->busn_start;
+ unsigned int busnr_ecam = bus->number - vmd->busn_start[VMD_BUS_0];
u32 offset = PCIE_ECAM_OFFSET(busnr_ecam, devfn, reg);
if (offset + len >= resource_size(&vmd->dev->resource[VMD_CFGBAR]))
@@ -655,13 +660,13 @@ static int vmd_get_bus_number_start(struct vmd_dev *vmd)
switch (BUS_RESTRICT_CFG(reg)) {
case 0:
- vmd->busn_start = VMD_RESTRICT_0_BUS_START;
+ vmd->busn_start[VMD_BUS_0] = VMD_RESTRICT_0_BUS_START;
break;
case 1:
- vmd->busn_start = VMD_RESTRICT_1_BUS_START;
+ vmd->busn_start[VMD_BUS_0] = VMD_RESTRICT_1_BUS_START;
break;
case 2:
- vmd->busn_start = VMD_RESTRICT_2_BUS_START;
+ vmd->busn_start[VMD_BUS_0] = VMD_RESTRICT_2_BUS_START;
break;
default:
pci_err(dev, "Unknown Bus Offset Setting (%d)\n",
@@ -789,8 +794,9 @@ static void vmd_configure_cfgbar(struct vmd_dev *vmd)
vmd->resources[VMD_RES_CFGBAR] = (struct resource){
.name = "VMD CFGBAR",
- .start = vmd->busn_start,
- .end = vmd->busn_start + (resource_size(res) >> 20) - 1,
+ .start = vmd->busn_start[VMD_BUS_0],
+ .end = vmd->busn_start[VMD_BUS_0] +
+ (resource_size(res) >> 20) - 1,
.flags = IORESOURCE_BUS | IORESOURCE_PCI_FIXED,
};
}
@@ -849,9 +855,10 @@ static int vmd_create_bus(struct vmd_dev *vmd, struct pci_sysdata *sd,
pci_add_resource_offset(&resources, &vmd->resources[VMD_RES_MBAR_2],
offset[1]);
- vmd->bus = pci_create_root_bus(&vmd->dev->dev, vmd->busn_start,
- &vmd_ops, sd, &resources);
- if (!vmd->bus) {
+ vmd->bus[VMD_BUS_0] = pci_create_root_bus(&vmd->dev->dev,
+ vmd->busn_start[VMD_BUS_0],
+ &vmd_ops, sd, &resources);
+ if (!vmd->bus[VMD_BUS_0]) {
pci_bus_release_emul_domain_nr(sd->domain);
pci_free_resource_list(&resources);
vmd_remove_irq_domain(vmd);
@@ -859,13 +866,13 @@ static int vmd_create_bus(struct vmd_dev *vmd, struct pci_sysdata *sd,
}
vmd_copy_host_bridge_flags(pci_find_host_bridge(vmd->dev->bus),
- to_pci_host_bridge(vmd->bus->bridge));
+ to_pci_host_bridge(vmd->bus[VMD_BUS_0]->bridge));
vmd_attach_resources(vmd);
if (vmd->irq_domain)
- dev_set_msi_domain(&vmd->bus->dev, vmd->irq_domain);
+ dev_set_msi_domain(&vmd->bus[VMD_BUS_0]->dev, vmd->irq_domain);
else
- dev_set_msi_domain(&vmd->bus->dev,
+ dev_set_msi_domain(&vmd->bus[VMD_BUS_0]->dev,
dev_get_msi_domain(&vmd->dev->dev));
return 0;
@@ -1015,10 +1022,11 @@ static int vmd_enable_domain(struct vmd_dev *vmd, unsigned long features)
return ret;
}
- WARN(sysfs_create_link(&vmd->dev->dev.kobj, &vmd->bus->dev.kobj,
- "domain"), "Can't create symlink to domain\n");
+ WARN(sysfs_create_link(&vmd->dev->dev.kobj,
+ &vmd->bus[VMD_BUS_0]->dev.kobj, "domain"),
+ "Can't create symlink to domain\n");
- vmd_bus_enumeration(vmd->bus, features);
+ vmd_bus_enumeration(vmd->bus[VMD_BUS_0], features);
return 0;
}
@@ -1114,9 +1122,9 @@ static void vmd_remove(struct pci_dev *dev)
{
struct vmd_dev *vmd = pci_get_drvdata(dev);
- pci_stop_root_bus(vmd->bus);
+ pci_stop_root_bus(vmd->bus[VMD_BUS_0]);
sysfs_remove_link(&vmd->dev->dev.kobj, "domain");
- pci_remove_root_bus(vmd->bus);
+ pci_remove_root_bus(vmd->bus[VMD_BUS_0]);
/* CFGBAR is static, does not require releasing memory */
kfree(vmd->resources[VMD_RES_MBAR_1].name);
--
2.43.0
^ permalink raw reply related
* [PATCH v5 5/8] PCI: vmd: Replace hardcoded values with enum and defines
From: Szymon Durawa @ 2026-06-25 8:29 UTC (permalink / raw)
To: helgaas, nirmal.patel, szymon.durawa, djbw, linux-pci, lukas
In-Reply-To: <20260625082940.1290740-1-szymon.durawa@linux.intel.com>
Add enum vmd_resource type to replace hardcoded values. Add defines for
vmd bus start number based on VMD restriction value. No functional
changes.
Suggested-by: Nirmal Patel <nirmal.patel@linux.intel.com>
Signed-off-by: Szymon Durawa <szymon.durawa@linux.intel.com>
---
drivers/pci/controller/vmd.c | 43 ++++++++++++++++++++++++------------
1 file changed, 29 insertions(+), 14 deletions(-)
diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
index c0a953df5f23..9ff82a24e859 100644
--- a/drivers/pci/controller/vmd.c
+++ b/drivers/pci/controller/vmd.c
@@ -26,6 +26,11 @@
#define VMD_MEMBAR1 2
#define VMD_MEMBAR2 4
+/* VMD restriction value determines secondary start bus number */
+#define VMD_RESTRICT_0_BUS_START 0x0
+#define VMD_RESTRICT_1_BUS_START 0x80
+#define VMD_RESTRICT_2_BUS_START 0xE0
+
#define PCI_REG_VMCAP 0x40
#define BUS_RESTRICT_CAP(vmcap) (vmcap & 0x1)
#define PCI_REG_VMCONFIG 0x44
@@ -37,6 +42,13 @@
#define MB2_SHADOW_OFFSET 0x2000
#define MB2_SHADOW_SIZE 16
+enum vmd_resource {
+ VMD_RES_CFGBAR = 0, /* VMD Bus0 Config BAR */
+ VMD_RES_MBAR_1, /* VMD Bus0 Resource MemBAR 1 */
+ VMD_RES_MBAR_2, /* VMD Bus0 Resource MemBAR 2 */
+ VMD_RES_COUNT
+};
+
enum vmd_features {
/*
* Device may contain registers which hint the physical location of the
@@ -135,7 +147,7 @@ struct vmd_dev {
struct vmd_irq_list *irqs;
struct pci_sysdata sysdata;
- struct resource resources[3];
+ struct resource resources[VMD_RES_COUNT];
struct irq_domain *irq_domain;
struct pci_bus *bus;
u8 busn_start;
@@ -520,7 +532,7 @@ static inline void vmd_acpi_end(void) { }
static void vmd_domain_reset(struct vmd_dev *vmd)
{
- u16 bus, max_buses = resource_size(&vmd->resources[0]);
+ u16 bus, max_buses = resource_size(&vmd->resources[VMD_RES_CFGBAR]);
u8 dev, functions, fn, hdr_type;
char __iomem *base;
@@ -568,8 +580,8 @@ static void vmd_domain_reset(struct vmd_dev *vmd)
static void vmd_attach_resources(struct vmd_dev *vmd)
{
- vmd->dev->resource[VMD_MEMBAR1].child = &vmd->resources[1];
- vmd->dev->resource[VMD_MEMBAR2].child = &vmd->resources[2];
+ vmd->dev->resource[VMD_MEMBAR1].child = &vmd->resources[VMD_RES_MBAR_1];
+ vmd->dev->resource[VMD_MEMBAR2].child = &vmd->resources[VMD_RES_MBAR_2];
}
static void vmd_detach_resources(struct vmd_dev *vmd)
@@ -643,13 +655,13 @@ static int vmd_get_bus_number_start(struct vmd_dev *vmd)
switch (BUS_RESTRICT_CFG(reg)) {
case 0:
- vmd->busn_start = 0;
+ vmd->busn_start = VMD_RESTRICT_0_BUS_START;
break;
case 1:
- vmd->busn_start = 128;
+ vmd->busn_start = VMD_RESTRICT_1_BUS_START;
break;
case 2:
- vmd->busn_start = 224;
+ vmd->busn_start = VMD_RESTRICT_2_BUS_START;
break;
default:
pci_err(dev, "Unknown Bus Offset Setting (%d)\n",
@@ -775,7 +787,7 @@ static void vmd_configure_cfgbar(struct vmd_dev *vmd)
{
struct resource *res = &vmd->dev->resource[VMD_CFGBAR];
- vmd->resources[0] = (struct resource){
+ vmd->resources[VMD_RES_CFGBAR] = (struct resource){
.name = "VMD CFGBAR",
.start = vmd->busn_start,
.end = vmd->busn_start + (resource_size(res) >> 20) - 1,
@@ -794,7 +806,8 @@ static void vmd_configure_cfgbar(struct vmd_dev *vmd)
*
* Function fills resource buffer inside the VMD structure.
*/
-static void vmd_configure_membar(struct vmd_dev *vmd, u8 resource_number,
+static void vmd_configure_membar(struct vmd_dev *vmd,
+ enum vmd_resource resource_number,
u8 membar_number, resource_size_t start_offset,
resource_size_t end_offset)
{
@@ -830,9 +843,11 @@ static int vmd_create_bus(struct vmd_dev *vmd, struct pci_sysdata *sd,
{
LIST_HEAD(resources);
- pci_add_resource(&resources, &vmd->resources[0]);
- pci_add_resource_offset(&resources, &vmd->resources[1], offset[0]);
- pci_add_resource_offset(&resources, &vmd->resources[2], offset[1]);
+ pci_add_resource(&resources, &vmd->resources[VMD_RES_CFGBAR]);
+ pci_add_resource_offset(&resources, &vmd->resources[VMD_RES_MBAR_1],
+ offset[0]);
+ pci_add_resource_offset(&resources, &vmd->resources[VMD_RES_MBAR_2],
+ offset[1]);
vmd->bus = pci_create_root_bus(&vmd->dev->dev, vmd->busn_start,
&vmd_ops, sd, &resources);
@@ -1104,8 +1119,8 @@ static void vmd_remove(struct pci_dev *dev)
pci_remove_root_bus(vmd->bus);
/* CFGBAR is static, does not require releasing memory */
- kfree(vmd->resources[1].name);
- kfree(vmd->resources[2].name);
+ kfree(vmd->resources[VMD_RES_MBAR_1].name);
+ kfree(vmd->resources[VMD_RES_MBAR_2].name);
vmd_cleanup_srcu(vmd);
vmd_detach_resources(vmd);
--
2.43.0
^ permalink raw reply related
* [PATCH v5 4/8] PCI: vmd: Add vmd_create_bus()
From: Szymon Durawa @ 2026-06-25 8:29 UTC (permalink / raw)
To: helgaas, nirmal.patel, szymon.durawa, djbw, linux-pci, lukas
In-Reply-To: <20260625082940.1290740-1-szymon.durawa@linux.intel.com>
Move the VMD bus initialization code to a new helper vmd_create_bus().
No functional changes.
Suggested-by: Nirmal Patel <nirmal.patel@linux.intel.com>
Signed-off-by: Szymon Durawa <szymon.durawa@linux.intel.com>
---
drivers/pci/controller/vmd.c | 57 ++++++++++++++++++++++--------------
1 file changed, 35 insertions(+), 22 deletions(-)
diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
index 443a0cefb7f7..c0a953df5f23 100644
--- a/drivers/pci/controller/vmd.c
+++ b/drivers/pci/controller/vmd.c
@@ -825,6 +825,37 @@ static void vmd_configure_membar1_membar2(struct vmd_dev *vmd,
vmd_configure_membar(vmd, 2, VMD_MEMBAR2, mbar2_ofs, 0);
}
+static int vmd_create_bus(struct vmd_dev *vmd, struct pci_sysdata *sd,
+ resource_size_t *offset)
+{
+ LIST_HEAD(resources);
+
+ pci_add_resource(&resources, &vmd->resources[0]);
+ pci_add_resource_offset(&resources, &vmd->resources[1], offset[0]);
+ pci_add_resource_offset(&resources, &vmd->resources[2], offset[1]);
+
+ vmd->bus = pci_create_root_bus(&vmd->dev->dev, vmd->busn_start,
+ &vmd_ops, sd, &resources);
+ if (!vmd->bus) {
+ pci_bus_release_emul_domain_nr(sd->domain);
+ pci_free_resource_list(&resources);
+ vmd_remove_irq_domain(vmd);
+ return -ENODEV;
+ }
+
+ vmd_copy_host_bridge_flags(pci_find_host_bridge(vmd->dev->bus),
+ to_pci_host_bridge(vmd->bus->bridge));
+
+ vmd_attach_resources(vmd);
+ if (vmd->irq_domain)
+ dev_set_msi_domain(&vmd->bus->dev, vmd->irq_domain);
+ else
+ dev_set_msi_domain(&vmd->bus->dev,
+ dev_get_msi_domain(&vmd->dev->dev));
+
+ return 0;
+}
+
static void vmd_bus_enumeration(struct pci_bus *bus, unsigned long features)
{
struct pci_bus *child;
@@ -876,7 +907,6 @@ static void vmd_bus_enumeration(struct pci_bus *bus, unsigned long features)
static int vmd_enable_domain(struct vmd_dev *vmd, unsigned long features)
{
struct pci_sysdata *sd = &vmd->sysdata;
- LIST_HEAD(resources);
resource_size_t offset[2] = {0};
resource_size_t membar2_offset = 0x2000;
int ret;
@@ -950,10 +980,6 @@ static int vmd_enable_domain(struct vmd_dev *vmd, unsigned long features)
vmd_set_msi_remapping(vmd, false);
}
- pci_add_resource(&resources, &vmd->resources[0]);
- pci_add_resource_offset(&resources, &vmd->resources[1], offset[0]);
- pci_add_resource_offset(&resources, &vmd->resources[2], offset[1]);
-
sd->vmd_dev = vmd->dev;
/*
@@ -968,25 +994,12 @@ static int vmd_enable_domain(struct vmd_dev *vmd, unsigned long features)
sd->node = pcibus_to_node(vmd->dev->bus);
- vmd->bus = pci_create_root_bus(&vmd->dev->dev, vmd->busn_start,
- &vmd_ops, sd, &resources);
- if (!vmd->bus) {
- pci_bus_release_emul_domain_nr(sd->domain);
- pci_free_resource_list(&resources);
- vmd_remove_irq_domain(vmd);
- return -ENODEV;
+ ret = vmd_create_bus(vmd, sd, offset);
+ if (ret) {
+ pci_err(vmd->dev, "Can't create bus: %d\n", ret);
+ return ret;
}
- vmd_copy_host_bridge_flags(pci_find_host_bridge(vmd->dev->bus),
- to_pci_host_bridge(vmd->bus->bridge));
-
- vmd_attach_resources(vmd);
- if (vmd->irq_domain)
- dev_set_msi_domain(&vmd->bus->dev, vmd->irq_domain);
- else
- dev_set_msi_domain(&vmd->bus->dev,
- dev_get_msi_domain(&vmd->dev->dev));
-
WARN(sysfs_create_link(&vmd->dev->dev.kobj, &vmd->bus->dev.kobj,
"domain"), "Can't create symlink to domain\n");
--
2.43.0
^ permalink raw reply related
* [PATCH v5 3/8] PCI: vmd: Add vmd_configure_membar() and vmd_configure_membar1_membar2()
From: Szymon Durawa @ 2026-06-25 8:29 UTC (permalink / raw)
To: helgaas, nirmal.patel, szymon.durawa, djbw, linux-pci, lukas
In-Reply-To: <20260625082940.1290740-1-szymon.durawa@linux.intel.com>
Move the MEMBAR1 and MEMBAR2 registry initialization code to new helpers
vmd_configure_membar() and vmd_configure_membar1_membar2(). No functional
changes.
Suggested-by: Nirmal Patel <nirmal.patel@linux.intel.com>
Signed-off-by: Szymon Durawa <szymon.durawa@linux.intel.com>
---
drivers/pci/controller/vmd.c | 80 ++++++++++++++++++++++--------------
1 file changed, 50 insertions(+), 30 deletions(-)
diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
index 2c824660cabd..443a0cefb7f7 100644
--- a/drivers/pci/controller/vmd.c
+++ b/drivers/pci/controller/vmd.c
@@ -783,6 +783,48 @@ static void vmd_configure_cfgbar(struct vmd_dev *vmd)
};
}
+/*
+ * vmd_configure_membar - Configure VMD MemBAR register, which points
+ * to MMIO address assigned by the OS or BIOS.
+ * @vmd: the VMD device
+ * @resource_number: resource buffer number to be filled in
+ * @membar_number: number of the MemBAR
+ * @start_offset: 4K aligned offset applied to start of VMD’s MEMBAR MMIO space
+ * @end_offset: 4K aligned offset applied to end of VMD’s MEMBAR MMIO space
+ *
+ * Function fills resource buffer inside the VMD structure.
+ */
+static void vmd_configure_membar(struct vmd_dev *vmd, u8 resource_number,
+ u8 membar_number, resource_size_t start_offset,
+ resource_size_t end_offset)
+{
+ u32 upper_bits;
+ unsigned long flags;
+
+ struct resource *res = &vmd->dev->resource[membar_number];
+
+ upper_bits = upper_32_bits(res->end);
+ flags = res->flags & ~IORESOURCE_SIZEALIGN;
+ if (!upper_bits)
+ flags &= ~IORESOURCE_MEM_64;
+
+ vmd->resources[resource_number] = (struct resource){
+ .name = kasprintf(GFP_KERNEL, "VMD MEMBAR%d",
+ membar_number / 2),
+ .start = res->start + start_offset,
+ .end = res->end - end_offset,
+ .flags = flags,
+ .parent = res,
+ };
+}
+
+static void vmd_configure_membar1_membar2(struct vmd_dev *vmd,
+ resource_size_t mbar2_ofs)
+{
+ vmd_configure_membar(vmd, 1, VMD_MEMBAR1, 0, 0);
+ vmd_configure_membar(vmd, 2, VMD_MEMBAR2, mbar2_ofs, 0);
+}
+
static void vmd_bus_enumeration(struct pci_bus *bus, unsigned long features)
{
struct pci_bus *child;
@@ -834,9 +876,6 @@ static void vmd_bus_enumeration(struct pci_bus *bus, unsigned long features)
static int vmd_enable_domain(struct vmd_dev *vmd, unsigned long features)
{
struct pci_sysdata *sd = &vmd->sysdata;
- struct resource *res;
- u32 upper_bits;
- unsigned long flags;
LIST_HEAD(resources);
resource_size_t offset[2] = {0};
resource_size_t membar2_offset = 0x2000;
@@ -883,36 +922,12 @@ static int vmd_enable_domain(struct vmd_dev *vmd, unsigned long features)
*
* The only way we could use a 64-bit non-prefetchable MEMBAR is
* if its address is <4GB so that we can convert it to a 32-bit
- * resource. To be visible to the host OS, all VMD endpoints must
+ * resource. To be visible to the host OS, all VMD endpoints must
* be initially configured by platform BIOS, which includes setting
- * up these resources. We can assume the device is configured
+ * up these resources. We can assume the device is configured
* according to the platform needs.
*/
- res = &vmd->dev->resource[VMD_MEMBAR1];
- upper_bits = upper_32_bits(res->end);
- flags = res->flags & ~IORESOURCE_SIZEALIGN;
- if (!upper_bits)
- flags &= ~IORESOURCE_MEM_64;
- vmd->resources[1] = (struct resource) {
- .name = "VMD MEMBAR1",
- .start = res->start,
- .end = res->end,
- .flags = flags,
- .parent = res,
- };
-
- res = &vmd->dev->resource[VMD_MEMBAR2];
- upper_bits = upper_32_bits(res->end);
- flags = res->flags & ~IORESOURCE_SIZEALIGN;
- if (!upper_bits)
- flags &= ~IORESOURCE_MEM_64;
- vmd->resources[2] = (struct resource) {
- .name = "VMD MEMBAR2",
- .start = res->start + membar2_offset,
- .end = res->end,
- .flags = flags,
- .parent = res,
- };
+ vmd_configure_membar1_membar2(vmd, membar2_offset);
/*
* Currently MSI remapping must be enabled in guest passthrough mode
@@ -1074,6 +1089,11 @@ static void vmd_remove(struct pci_dev *dev)
pci_stop_root_bus(vmd->bus);
sysfs_remove_link(&vmd->dev->dev.kobj, "domain");
pci_remove_root_bus(vmd->bus);
+
+ /* CFGBAR is static, does not require releasing memory */
+ kfree(vmd->resources[1].name);
+ kfree(vmd->resources[2].name);
+
vmd_cleanup_srcu(vmd);
vmd_detach_resources(vmd);
vmd_remove_irq_domain(vmd);
--
2.43.0
^ permalink raw reply related
* [PATCH v5 2/8] PCI: vmd: Add vmd_configure_cfgbar() helper function
From: Szymon Durawa @ 2026-06-25 8:29 UTC (permalink / raw)
To: helgaas, nirmal.patel, szymon.durawa, djbw, linux-pci, lukas
In-Reply-To: <20260625082940.1290740-1-szymon.durawa@linux.intel.com>
Move the VMD CFGBAR initialization code to a new helper
vmd_configure_cfgbar(). No functional changes.
Suggested-by: Nirmal Patel <nirmal.patel@linux.intel.com>
Signed-off-by: Szymon Durawa <szymon.durawa@linux.intel.com>
---
drivers/pci/controller/vmd.c | 20 +++++++++++++-------
1 file changed, 13 insertions(+), 7 deletions(-)
diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
index 070b247b3fbb..2c824660cabd 100644
--- a/drivers/pci/controller/vmd.c
+++ b/drivers/pci/controller/vmd.c
@@ -771,6 +771,18 @@ static int vmd_pm_enable_quirk(struct pci_dev *pdev, void *userdata)
return 0;
}
+static void vmd_configure_cfgbar(struct vmd_dev *vmd)
+{
+ struct resource *res = &vmd->dev->resource[VMD_CFGBAR];
+
+ vmd->resources[0] = (struct resource){
+ .name = "VMD CFGBAR",
+ .start = vmd->busn_start,
+ .end = vmd->busn_start + (resource_size(res) >> 20) - 1,
+ .flags = IORESOURCE_BUS | IORESOURCE_PCI_FIXED,
+ };
+}
+
static void vmd_bus_enumeration(struct pci_bus *bus, unsigned long features)
{
struct pci_bus *child;
@@ -857,13 +869,7 @@ static int vmd_enable_domain(struct vmd_dev *vmd, unsigned long features)
return ret;
}
- res = &vmd->dev->resource[VMD_CFGBAR];
- vmd->resources[0] = (struct resource) {
- .name = "VMD CFGBAR",
- .start = vmd->busn_start,
- .end = vmd->busn_start + (resource_size(res) >> 20) - 1,
- .flags = IORESOURCE_BUS | IORESOURCE_PCI_FIXED,
- };
+ vmd_configure_cfgbar(vmd);
/*
* If the window is below 4GB, clear IORESOURCE_MEM_64 so we can
--
2.43.0
^ permalink raw reply related
* [PATCH v5 1/8] PCI: vmd: Add vmd_bus_enumeration() helper function
From: Szymon Durawa @ 2026-06-25 8:29 UTC (permalink / raw)
To: helgaas, nirmal.patel, szymon.durawa, djbw, linux-pci, lukas
In-Reply-To: <20260625082940.1290740-1-szymon.durawa@linux.intel.com>
Move the vmd bus enumeration code to a new helper vmd_bus_enumeration().
No functional changes.
Suggested-by: Nirmal Patel <nirmal.patel@linux.intel.com>
Signed-off-by: Szymon Durawa <szymon.durawa@linux.intel.com>
---
drivers/pci/controller/vmd.c | 89 ++++++++++++++++++++----------------
1 file changed, 49 insertions(+), 40 deletions(-)
diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
index d4ae250d4bc6..070b247b3fbb 100644
--- a/drivers/pci/controller/vmd.c
+++ b/drivers/pci/controller/vmd.c
@@ -771,6 +771,54 @@ static int vmd_pm_enable_quirk(struct pci_dev *pdev, void *userdata)
return 0;
}
+static void vmd_bus_enumeration(struct pci_bus *bus, unsigned long features)
+{
+ struct pci_bus *child;
+ struct pci_dev *dev;
+ int ret;
+
+ vmd_acpi_begin();
+
+ pci_scan_child_bus(bus);
+ vmd_domain_reset(vmd_from_bus(bus));
+
+ /*
+ * When Intel VMD is enabled, the OS does not discover the Root Ports
+ * owned by Intel VMD within the MMCFG space. pci_reset_bus() applies
+ * a reset to the parent of the PCI device supplied as argument. This
+ * is why we pass a child device, so the reset can be triggered at
+ * the Intel bridge level and propagated to all the children in the
+ * hierarchy.
+ */
+ list_for_each_entry(child, &bus->children, node) {
+ if (!list_empty(&child->devices)) {
+ dev = list_first_entry(&child->devices, struct pci_dev,
+ bus_list);
+ ret = pci_reset_bus(dev);
+ if (ret)
+ pci_warn(dev, "can't reset device: %d\n", ret);
+
+ break;
+ }
+ }
+
+ pci_assign_unassigned_bus_resources(bus);
+
+ pci_walk_bus(bus, vmd_pm_enable_quirk, &features);
+
+ /*
+ * VMD root buses are virtual and don't return true on pci_is_pcie()
+ * and will fail pcie_bus_configure_settings() early. It can instead be
+ * run on each of the real root ports.
+ */
+ list_for_each_entry(child, &bus->children, node)
+ pcie_bus_configure_settings(child);
+
+ pci_bus_add_devices(bus);
+
+ vmd_acpi_end();
+}
+
static int vmd_enable_domain(struct vmd_dev *vmd, unsigned long features)
{
struct pci_sysdata *sd = &vmd->sysdata;
@@ -780,8 +828,6 @@ static int vmd_enable_domain(struct vmd_dev *vmd, unsigned long features)
LIST_HEAD(resources);
resource_size_t offset[2] = {0};
resource_size_t membar2_offset = 0x2000;
- struct pci_bus *child;
- struct pci_dev *dev;
int ret;
/*
@@ -923,45 +969,8 @@ static int vmd_enable_domain(struct vmd_dev *vmd, unsigned long features)
WARN(sysfs_create_link(&vmd->dev->dev.kobj, &vmd->bus->dev.kobj,
"domain"), "Can't create symlink to domain\n");
- vmd_acpi_begin();
-
- pci_scan_child_bus(vmd->bus);
- vmd_domain_reset(vmd);
+ vmd_bus_enumeration(vmd->bus, features);
- /* When Intel VMD is enabled, the OS does not discover the Root Ports
- * owned by Intel VMD within the MMCFG space. pci_reset_bus() applies
- * a reset to the parent of the PCI device supplied as argument. This
- * is why we pass a child device, so the reset can be triggered at
- * the Intel bridge level and propagated to all the children in the
- * hierarchy.
- */
- list_for_each_entry(child, &vmd->bus->children, node) {
- if (!list_empty(&child->devices)) {
- dev = list_first_entry(&child->devices,
- struct pci_dev, bus_list);
- ret = pci_reset_bus(dev);
- if (ret)
- pci_warn(dev, "can't reset device: %d\n", ret);
-
- break;
- }
- }
-
- pci_assign_unassigned_bus_resources(vmd->bus);
-
- pci_walk_bus(vmd->bus, vmd_pm_enable_quirk, &features);
-
- /*
- * VMD root buses are virtual and don't return true on pci_is_pcie()
- * and will fail pcie_bus_configure_settings() early. It can instead be
- * run on each of the real root ports.
- */
- list_for_each_entry(child, &vmd->bus->children, node)
- pcie_bus_configure_settings(child);
-
- pci_bus_add_devices(vmd->bus);
-
- vmd_acpi_end();
return 0;
}
--
2.43.0
^ permalink raw reply related
* [PATCH v5 0/8] VMD add second root bus support
From: Szymon Durawa @ 2026-06-25 8:29 UTC (permalink / raw)
To: helgaas, nirmal.patel, szymon.durawa, djbw, linux-pci, lukas
Cc: Bjorn Helgaas
This patch series implements second rootbus support inside Intel VMD
module. Current implementation allows VMD to take ownership of devices
only on first bus (Rootbus0). Starting from Intel Arrow Lake, VMD exposes
second bus (Rootbus1) to allow VMD to own devices on this bus as well.
VMD MMIO BARs (CFGBAR. MEMBAR1 and MEMBAR2) are now shared between
Rootbus0 and Rootbus1. Reconfiguration of 3 MMIO BARs is required by
resizing current MMIO BARs ranges. It allows to find/register VMD Rootbus1
and discovers devices or root ports under it.
Patches 1 to 6 introduce code refactoring without functional changes.
Patch 7 implements VMD Rootbus1 support and patch 8 provides workaround
for rootbus number hardwired to fixed non-zero value. Patch 8 is necessary
for correct enumeration attached devices under VMD Rootbus1. Without it
user cannot access those devices as they are not visible in the system,
only drives under VMD Rootbus0 are available to the user.
Changes from v1:
- splitting series into more commits, requested by Bjorn
- adding helper functions, suggested by Bjorn
- minor typos and unclear wording updated, suggested by Bjorn
Changes from v2:
- wording update in commit logs, suggested by Bjorn
Changes from v3:
- using GENMASK() instead of manual bits shifting, suggested by Bjorn
- converting decimal number to hex representation, suggested by Bjorn
- wording update in commit logs, suggested by Bjorn
Changes from v4:
- Update Dan's email address
- Resending the whole series to the correct mailing list
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Dan Williams <djbw@kernel.org>
Cc: Lukas Wunner <lukas@wunner.de>
Cc: linux-pci@vger.kernel.org
Suggested-by: Nirmal Patel <nirmal.patel@linux.intel.com>
Signed-off-by: Szymon Durawa <szymon.durawa@linux.intel.com>
Szymon Durawa (8):
PCI: vmd: Add vmd_bus_enumeration() helper function
PCI: vmd: Add vmd_configure_cfgbar() helper function
PCI: vmd: Add vmd_configure_membar() and
vmd_configure_membar1_membar2()
PCI: vmd: Add vmd_create_bus()
PCI: vmd: Replace hardcoded values with enum and defines
PCI: vmd: Convert bus and busn_start to an array
PCI: vmd: Add support for second rootbus under VMD
PCI: vmd: Add workaround for bus number hardwired to fixed non-zero
value
drivers/pci/controller/vmd.c | 488 +++++++++++++++++++++++++++--------
1 file changed, 376 insertions(+), 112 deletions(-)
--
2.43.0
^ permalink raw reply
* [RFC PATCH 0/0] PCI P2PDMA: Add observability support via tracepoints, debugfs, and sysfs.
From: xiaobing.li @ 2026-06-25 1:59 UTC (permalink / raw)
To: bhelgaas, logang, m.szyprowski, linux-pci, linux-kernel,
linux-trace-kernel
Cc: kun.dou, peiwei.li
In-Reply-To: <CGME20260625015930epcas5p33fa9d4833d45b53597e2994fb9ec2577@epcas5p3.samsung.com>
Hi all,
The Linux kernel's P2P DMA infrastructure is already very mature, but currently it is not user-friendly in terms of metric observability.
For example, without manually adding logs, there is no intuitive data to see how many P2P transfers, which paths are taken,
and how performance is. It is impossible to clearly observe P2PDMA activity from user space, making the following operations difficult:
- Diagnose the reasons why P2PDMA may not work (or perform poorly).
- Verify whether the P2PDMA mapping uses the expected type (BUS_ADDR or THRU_HOST_BRIDGE)
- Monitor the use of P2PDMA in production environments
- Detect potential memory leaks (unmapped allocations)
P2PDMA is a subtle feature. When P2PDMA mapping cannot use BUS_ADDR (Direct PCIe Switch Path), it silently falls back to the THRU_HOST_BRIDGE,
routing traffic to the host bridge. This significantly reduces performance (usually by 10 times or more), but it cannot be detected
from user space.
Therefore, I plan to export some metrics in the user space to better observe P2PDMA activity.
This series of solutions adds three layers of observability:
1. Tracepoints (5 events, optional, no overhead when disabled)
- p2p_dma_alloc: P2P memory allocation
- p2p_dma_free: P2P memory release
- p2p_dma_map: P2P DMA mapping (including client/provider, mapping type,
PCIe distance and process information)
- p2p_dma_unmap: P2P DMA removes mapping
- p2p_map_type_change: New mapping type calculations (xarray missed)
All tracking points include the calling process (comm pid), enabling P2PDMA activity tracking for each process.
Example:
$ cat /sys/kernel/debug/tracing/trace | grep p2p_dma_map
nvme[1234] map nvme0 -> p2p_mem type=BUS_ADDR dist=4
python[5678] map nvme1 -> p2p_mem type=THRU_HOST_BRIDGE dist=8
2. Debugfs (global cumulative counter, always available)
- /sys/kernel/debug/pci-p2pdma/
- 11 counters: total_mappings, bus_addr_mappings, host_bridge_mappings,
total_allocations, error_count, etc.
- Enable the calculation of the "BUS_ADDR ratio" to quantify the effectiveness of P2PDMA.
3. Sysfs (Statistical Information for Each Device, Production Environment Safety)
- /sys/bus/pci/devices/*/p2pmem/stats/
- 4 attributes: alloc_count, free_count, mapped_bytes, peak_mapped_bytes
Performance impact
- Tracking point: Static branch, zero overhead when disabled (by default).
- Debugfs/sysfs: atomic64_t counter, no locking, negligible overhead
- After disabling all observability, the P2PDMA thermal path remains unchanged
I would appreciate feedback on:
1. Is the overall solution worth implementing?
2. Is the set of tracepoints appropriate? Any events I'm missing?
3. Are the tracepoint fields sufficient for debugging?
4. Is the debugfs/sysfs interface design acceptable?
5. Any concerns about the implementation approach?
^ permalink raw reply
* Re: [PATCH v2 1/7] PCI/IOV: Return u16 from pci_sriov_get_totalvfs()
From: Alexandre Courbot @ 2026-06-25 0:45 UTC (permalink / raw)
To: Gary Guo
Cc: David Laight, Zhi Wang, dakr, airlied, simona, ojeda, alex.gaynor,
boqun.feng, bjorn3_gh, lossin, a.hindborg, aliceryhl, tmgross,
jhubbard, ecourtney, joelagnelf, apopple, cjia, smitra, kjaju,
alkumar, ankita, aniketa, kwankhede, targupta, nova-gpu,
linux-kernel, zhiwang, Bjorn Helgaas, linux-pci
In-Reply-To: <DJHNQRALREG5.37R3M4GQ2A29V@garyguo.net>
On Thu Jun 25, 2026 at 8:11 AM JST, Gary Guo wrote:
> On Wed Jun 24, 2026 at 8:38 PM BST, David Laight wrote:
>> On Wed, 24 Jun 2026 23:59:56 +0900
>> "Alexandre Courbot" <acourbot@nvidia.com> wrote:
>>
>>> On Wed Jun 24, 2026 at 10:39 PM JST, David Laight wrote:
>>> > On Wed, 24 Jun 2026 21:40:52 +0900
>>> > "Alexandre Courbot" <acourbot@nvidia.com> wrote:
>>> >
>>> >> On Tue Jun 23, 2026 at 4:43 AM JST, Zhi Wang wrote:
>>> >> > pci_sriov_get_totalvfs() reports a VF count, not an errno-style
>>> >> > status. It returns 0 when SR-IOV is unavailable or the device is not a
>>> >> > PF, and otherwise returns the PF's driver_max_VFs value.
>>> >> >
>>> >> > driver_max_VFs is stored as a u16 in struct pci_sriov. It is derived
>>> >> > from the SR-IOV TotalVFs field or from a driver-provided limit, so the
>>> >> > implementation cannot return a negative value.
>>> >> >
>>> >> > Change the declaration, CONFIG_PCI_IOV stub, and implementation to
>>> >> > return u16. Update callers to store the result in u16 variables, remove
>>> >> > obsolete negative-value checks, and use unsigned format specifiers where
>>> >> > needed.
>>> >> >
>>> >> > Cc: Bjorn Helgaas <bhelgaas@google.com>
>>> >> > Cc: linux-pci@vger.kernel.org
>>> >> > Signed-off-by: Zhi Wang <zhiw@nvidia.com>
>>> >>
>>> >> Suggested-by: Alexandre Courbot <acourbot@nvidia.com>
>>> >> Link: https://lore.kernel.org/all/DETDILPA1GFY.27WND0TEC5352@nvidia.com/
>>> >>
>>> >> > ---
>>> >> > drivers/crypto/hisilicon/qm.c | 8 +++++---
>>> >> > drivers/crypto/intel/qat/qat_common/adf_sriov.c | 6 +++---
>>> >> > drivers/gpu/drm/xe/xe_sriov_pf.c | 6 ++----
>>> >> > drivers/misc/genwqe/card_base.c | 6 ++----
>>> >> > drivers/net/ethernet/cavium/thunder/nic_main.c | 2 +-
>>> >> > drivers/net/ethernet/emulex/benet/be_main.c | 3 ++-
>>> >> > drivers/net/ethernet/mellanox/mlx5/core/sriov.c | 3 ++-
>>> >> > drivers/net/ethernet/sfc/ef10_sriov.c | 2 +-
>>> >>
>>> >> I believe that you can avoid converting all these drivers in this patch.
>>> >> The implicit `u16 -> int` conversion done by C should result in the
>>> >> expected behavior, and it will be fewer Acked-by to collect.
>>> >
>>> > The generated code is also likely to be slightly better if the function
>>> > return value is a 32bit value.
>>> >
>>> > Similarly you don't really want to do any kind of maths on local variables
>>> > that aren't 32bit (or 64bit on 64bit builds).
>>> >
>>> > The fact that the domain of a value fits in 16 bits doesn't mean that
>>> > it is better to use u16 - it is usually worse.
>>> > Pretty much the only place u16 should be used is to reduce the size
>>> > of structures.
>>> >
>>> > So it is probably correct to change the return type to unsigned int and
>>> > remove the error return checks, but nothing else.
>>>
>>> For C, I agree that unsigned int is the safest type.
>>>
>>> Rust otoh does not do implicit integer promotion, and making it return a
>>> `u16` carries useful range information. I wonder if we could have a
>>> private `__pci_sriov_get_totalvfs` that returns a `u16`, make
>>> `pci_sriov_get_totalvfs` promote it to an `unsigned int` and return it,
>>> while the Rust bindings would invoke `__pci_sriov_get_totalvfs` so they
>>> can expose a `u16`?
>
> There're no value in having two versions with just slightly different promotion
> behaviour. Rust side can still use u16 by casting from unsigned int without
> having to expose both.
Indeed, that occurred to me only after I sent my reply. Since the
register this value originates from is 16 bits, we can justify that a
cast of the return to u16 won't lose any data.
Just making the C API return unsigned int is fine then.
^ permalink raw reply
* Re: [PATCH v2 1/7] PCI/IOV: Return u16 from pci_sriov_get_totalvfs()
From: Gary Guo @ 2026-06-24 23:11 UTC (permalink / raw)
To: David Laight, Alexandre Courbot
Cc: Zhi Wang, dakr, airlied, simona, ojeda, alex.gaynor, boqun.feng,
gary, bjorn3_gh, lossin, a.hindborg, aliceryhl, tmgross, jhubbard,
ecourtney, joelagnelf, apopple, cjia, smitra, kjaju, alkumar,
ankita, aniketa, kwankhede, targupta, nova-gpu, linux-kernel,
zhiwang, Bjorn Helgaas, linux-pci
In-Reply-To: <20260624203851.7f6c7be4@pumpkin>
On Wed Jun 24, 2026 at 8:38 PM BST, David Laight wrote:
> On Wed, 24 Jun 2026 23:59:56 +0900
> "Alexandre Courbot" <acourbot@nvidia.com> wrote:
>
>> On Wed Jun 24, 2026 at 10:39 PM JST, David Laight wrote:
>> > On Wed, 24 Jun 2026 21:40:52 +0900
>> > "Alexandre Courbot" <acourbot@nvidia.com> wrote:
>> >
>> >> On Tue Jun 23, 2026 at 4:43 AM JST, Zhi Wang wrote:
>> >> > pci_sriov_get_totalvfs() reports a VF count, not an errno-style
>> >> > status. It returns 0 when SR-IOV is unavailable or the device is not a
>> >> > PF, and otherwise returns the PF's driver_max_VFs value.
>> >> >
>> >> > driver_max_VFs is stored as a u16 in struct pci_sriov. It is derived
>> >> > from the SR-IOV TotalVFs field or from a driver-provided limit, so the
>> >> > implementation cannot return a negative value.
>> >> >
>> >> > Change the declaration, CONFIG_PCI_IOV stub, and implementation to
>> >> > return u16. Update callers to store the result in u16 variables, remove
>> >> > obsolete negative-value checks, and use unsigned format specifiers where
>> >> > needed.
>> >> >
>> >> > Cc: Bjorn Helgaas <bhelgaas@google.com>
>> >> > Cc: linux-pci@vger.kernel.org
>> >> > Signed-off-by: Zhi Wang <zhiw@nvidia.com>
>> >>
>> >> Suggested-by: Alexandre Courbot <acourbot@nvidia.com>
>> >> Link: https://lore.kernel.org/all/DETDILPA1GFY.27WND0TEC5352@nvidia.com/
>> >>
>> >> > ---
>> >> > drivers/crypto/hisilicon/qm.c | 8 +++++---
>> >> > drivers/crypto/intel/qat/qat_common/adf_sriov.c | 6 +++---
>> >> > drivers/gpu/drm/xe/xe_sriov_pf.c | 6 ++----
>> >> > drivers/misc/genwqe/card_base.c | 6 ++----
>> >> > drivers/net/ethernet/cavium/thunder/nic_main.c | 2 +-
>> >> > drivers/net/ethernet/emulex/benet/be_main.c | 3 ++-
>> >> > drivers/net/ethernet/mellanox/mlx5/core/sriov.c | 3 ++-
>> >> > drivers/net/ethernet/sfc/ef10_sriov.c | 2 +-
>> >>
>> >> I believe that you can avoid converting all these drivers in this patch.
>> >> The implicit `u16 -> int` conversion done by C should result in the
>> >> expected behavior, and it will be fewer Acked-by to collect.
>> >
>> > The generated code is also likely to be slightly better if the function
>> > return value is a 32bit value.
>> >
>> > Similarly you don't really want to do any kind of maths on local variables
>> > that aren't 32bit (or 64bit on 64bit builds).
>> >
>> > The fact that the domain of a value fits in 16 bits doesn't mean that
>> > it is better to use u16 - it is usually worse.
>> > Pretty much the only place u16 should be used is to reduce the size
>> > of structures.
>> >
>> > So it is probably correct to change the return type to unsigned int and
>> > remove the error return checks, but nothing else.
>>
>> For C, I agree that unsigned int is the safest type.
>>
>> Rust otoh does not do implicit integer promotion, and making it return a
>> `u16` carries useful range information. I wonder if we could have a
>> private `__pci_sriov_get_totalvfs` that returns a `u16`, make
>> `pci_sriov_get_totalvfs` promote it to an `unsigned int` and return it,
>> while the Rust bindings would invoke `__pci_sriov_get_totalvfs` so they
>> can expose a `u16`?
There're no value in having two versions with just slightly different promotion
behaviour. Rust side can still use u16 by casting from unsigned int without
having to expose both.
> That is getting silly, something will end up doing the masking to police
> the 16bit value - it isn't just the C rules, the ABI pass/return u16 in
> in 32bit registers (even on x86) so one side (or both) has to mask the
> value.
>
> Your rust code has the same problem. Add 1 to a u16 variable and code has
> to be added to mask the result to 16 bits.
> You won't see it on x86 because it has 16bit alu operations, about the
> only other cpu eoth them is m68k (I'm not sure about s390 - it is based
> on a very old instructions set).
There's no implicit integer casting (extending or truncating) in Rust so usually
the setter and getters for the same property would be of the same type.
pci_sriov_set_totalvfs takes u16 so I expect the pci_sriov_get_totalvfs
API binding in Rust returning u16 too. That said, this can be handled by Rust
bindings as mentioned above.
This is not hot path and just have a handful of users anyway and I wouldn't
worry about a single masking/sign-extension instruction.
Best,
Gary
^ permalink raw reply
* Re: [GIT PULL] PCI changes for v7.2
From: pr-tracker-bot @ 2026-06-24 22:17 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: Linus Torvalds, linux-pci, linux-kernel, Rob Herring,
Lorenzo Pieralisi, Manivannan Sadhasivam,
Krzysztof Wilczyński
In-Reply-To: <20260624110021.GA996382@bhelgaas>
The pull request you sent on Wed, 24 Jun 2026 06:00:21 -0500:
> git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git tags/pci-v7.2-changes
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/26ae421f7f49f8a6a32d15b1d21a782b46a1bad5
Thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html
^ permalink raw reply
* Re: [PATCH v2 1/7] PCI/IOV: Return u16 from pci_sriov_get_totalvfs()
From: David Laight @ 2026-06-24 19:38 UTC (permalink / raw)
To: Alexandre Courbot
Cc: Zhi Wang, dakr, airlied, simona, ojeda, alex.gaynor, boqun.feng,
gary, bjorn3_gh, lossin, a.hindborg, aliceryhl, tmgross, jhubbard,
ecourtney, joelagnelf, apopple, cjia, smitra, kjaju, alkumar,
ankita, aniketa, kwankhede, targupta, nova-gpu, linux-kernel,
zhiwang, Bjorn Helgaas, linux-pci
In-Reply-To: <DJHDAQZ1AASK.14ATT70HTI450@nvidia.com>
On Wed, 24 Jun 2026 23:59:56 +0900
"Alexandre Courbot" <acourbot@nvidia.com> wrote:
> On Wed Jun 24, 2026 at 10:39 PM JST, David Laight wrote:
> > On Wed, 24 Jun 2026 21:40:52 +0900
> > "Alexandre Courbot" <acourbot@nvidia.com> wrote:
> >
> >> On Tue Jun 23, 2026 at 4:43 AM JST, Zhi Wang wrote:
> >> > pci_sriov_get_totalvfs() reports a VF count, not an errno-style
> >> > status. It returns 0 when SR-IOV is unavailable or the device is not a
> >> > PF, and otherwise returns the PF's driver_max_VFs value.
> >> >
> >> > driver_max_VFs is stored as a u16 in struct pci_sriov. It is derived
> >> > from the SR-IOV TotalVFs field or from a driver-provided limit, so the
> >> > implementation cannot return a negative value.
> >> >
> >> > Change the declaration, CONFIG_PCI_IOV stub, and implementation to
> >> > return u16. Update callers to store the result in u16 variables, remove
> >> > obsolete negative-value checks, and use unsigned format specifiers where
> >> > needed.
> >> >
> >> > Cc: Bjorn Helgaas <bhelgaas@google.com>
> >> > Cc: linux-pci@vger.kernel.org
> >> > Signed-off-by: Zhi Wang <zhiw@nvidia.com>
> >>
> >> Suggested-by: Alexandre Courbot <acourbot@nvidia.com>
> >> Link: https://lore.kernel.org/all/DETDILPA1GFY.27WND0TEC5352@nvidia.com/
> >>
> >> > ---
> >> > drivers/crypto/hisilicon/qm.c | 8 +++++---
> >> > drivers/crypto/intel/qat/qat_common/adf_sriov.c | 6 +++---
> >> > drivers/gpu/drm/xe/xe_sriov_pf.c | 6 ++----
> >> > drivers/misc/genwqe/card_base.c | 6 ++----
> >> > drivers/net/ethernet/cavium/thunder/nic_main.c | 2 +-
> >> > drivers/net/ethernet/emulex/benet/be_main.c | 3 ++-
> >> > drivers/net/ethernet/mellanox/mlx5/core/sriov.c | 3 ++-
> >> > drivers/net/ethernet/sfc/ef10_sriov.c | 2 +-
> >>
> >> I believe that you can avoid converting all these drivers in this patch.
> >> The implicit `u16 -> int` conversion done by C should result in the
> >> expected behavior, and it will be fewer Acked-by to collect.
> >
> > The generated code is also likely to be slightly better if the function
> > return value is a 32bit value.
> >
> > Similarly you don't really want to do any kind of maths on local variables
> > that aren't 32bit (or 64bit on 64bit builds).
> >
> > The fact that the domain of a value fits in 16 bits doesn't mean that
> > it is better to use u16 - it is usually worse.
> > Pretty much the only place u16 should be used is to reduce the size
> > of structures.
> >
> > So it is probably correct to change the return type to unsigned int and
> > remove the error return checks, but nothing else.
>
> For C, I agree that unsigned int is the safest type.
>
> Rust otoh does not do implicit integer promotion, and making it return a
> `u16` carries useful range information. I wonder if we could have a
> private `__pci_sriov_get_totalvfs` that returns a `u16`, make
> `pci_sriov_get_totalvfs` promote it to an `unsigned int` and return it,
> while the Rust bindings would invoke `__pci_sriov_get_totalvfs` so they
> can expose a `u16`?
That is getting silly, something will end up doing the masking to police
the 16bit value - it isn't just the C rules, the ABI pass/return u16 in
in 32bit registers (even on x86) so one side (or both) has to mask the
value.
Your rust code has the same problem. Add 1 to a u16 variable and code has
to be added to mask the result to 16 bits.
You won't see it on x86 because it has 16bit alu operations, about the
only other cpu eoth them is m68k (I'm not sure about s390 - it is based
on a very old instructions set).
David
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox