Linux Documentation

Linux Documentation
 help / color / mirror / Atom feed

* Re: [PATCH v4 00/31] Introduce SCMI Telemetry FS support
From: Cristian Marussi @ 2026-06-18 23:28 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Cristian Marussi, Christian Brauner, linux-kernel,
	linux-arm-kernel, arm-scmi, linux-fsdevel, linux-doc,
	sudeep.holla, james.quinlan, f.fainelli, vincent.guittot,
	etienne.carriere, peng.fan, michal.simek, d-gole, jic23,
	elif.topuz, lukasz.luba, philip.radford, souvik.chakravarty,
	leitao, kas, puranjay, usama.arif, kernel-team
In-Reply-To: <29a304f0-1e62-418a-b84f-aabdc4c0de8d@kernel.org>

On Thu, Jun 18, 2026 at 07:22:29PM +0200, David Hildenbrand (Arm) wrote:
> Hi,

Hi David,

> 
> asking some clarifying questions that I assume also Christian might want to know.
> 

Thanks for having a look !

> >>> In a nutshell, the SCMI Telemetry protocol allows an agent to discover at
> >>> runtime the set of Telemetry Data Events (DEs) available on a specific
> >>> platform and provides the means to configure the set of DEs that a user is
> 
> Is the configuration aspect limited to enabling selected events, or is there
> more that can be configured?
>

The needed configuration is:

 - global Telemetry enable (tlm_enable)
 - global common update_interval (current_update_interval)
 - per-DE enable/disable (des/0x<NNNN>/enable)
 - per-DE timestamping enable/disable (des/0x<NNNN>/tstamp_enable)

 ... then there are a couple of handy catch-all entries:
	all_des_enable, all_des_tstamp_enable

Note that all the existent DEs are discovered at runtime dynamically via
SCMI in the background at init/probe and then never change: i.e.
the tree is statically created upon discovery, user cannot
create/destroy or symlink files at will, nor the backend platform FW
running the SCMI server can pop-up new DataEvents after the initial
enumeration.

All the above configs can also be pre-defined in the FW (at built time)
as being default boot-on with predefined values, like a specific
boot-on update interval, so that you could have a system in which really
you dont need to configure anything...everything is on and you just
read data. (unless you want to change config of course...)

There is more stuff that indeed is configurable per the SCMI spec
but these additional params are hidden into the SCMI Telemetry protocol
layer (the initial patches in this series) and NOT made available to
the driver/users of the protocol (like the SCMI FS driver that sits on
top)

IOW, this humonguos series (~8k lines) is only partially composed by
the Filesystem driver (~3k): the bulk of the Telemetry logic and SCMI
message exchanges are contained in the SCMI Protocol stack which has
been extended to support the Telemerty protocol at first
(the 'firmware: arm_scmi:' initial patches).

This latter common support is exposed by the SCMI stack for the SCMI
drivers to use via custom per-protocol operations (not an orginal name :P)
exposed in include/linux/scmi_protocol.h

So when you write into FS to configure smth, you end up calling an internal
tlm_proto_ops that in turn will cause an SCMI message to be sent
(in some cases say to enable a DE or set the update interval)

When you read something, you end up calling another Telemetry operation
that in turn returns you the DataEvent value you were looking for...how
this is retrieved via SCMI in the background is transparent to the
FS driver because, again, these details are buried into the protocol
layer. Talking about reads, you can:

 - read a single value from des/0x<NNNN>/value
 - read ALL the currently enabled DE in a bulk read via des_bulk_read

...most of the other entries in the tree are simply RO properties of the DEs
that have been discovered at enumeration time.

Given that walking a FS tree and issuing configuration as writes is NOT
performant really (nor handy if you are not a human), currently, even
in this FS-based series you can really perform all of the discovery AND
the configuration tasks WITHOUT walking the filesystem tree, but instead
issuing a bunch of IOCTLs issued on a special 'control' file that I
embedded in the FS. Such UAPI IOCTLs described at:

https://lore.kernel.org/arm-scmi/20260612223802.1337232-6-cristian.marussi@arm.com/T/#u

So my plan of action in order to get rid of the FS in-kenel implementation
would be to drop this Filesystem in favour of simple character devices
and move the existent IOCTLs interface (revisited where needed) on top of
these devices: that way you will be able to use IOCTLs to enumerate the
Telemetry sources and then configure them.

Read will then happen (probably) leveraging a number of chardev fops like:
IOCTLs, .read and .mmap...up to the tool decide what to use.

After this porting to chardev is done, I would start optionally exposing
again all of this in a human-readable alternative way by adding a layer
of FUSE on top of this chardev interface.

Basically my aim is to drop the FS implementation from the kernel, as
advised, while trying to optionally make it still available via a userspace
FUSE implementation...IOW the intention would be for the next V5 to expose
the same interfaces as V4 but with the help of a tool instead that builds,
if wanted, a FUSE mount built on top of the chardev interface.

So basically 'floating up' the current FS-like interface into userspace.

> >>> interested into, while reading them back using the collection method that
> >>> is deeemed more suitable for the usecase at hand. (...amongst the various
> >>> possible collection methods allowed by SCMI specification)
> >>>
> >>> Without delving into the gory details of the whole SCMI Telemetry protocol
> >>> let's just say that the SCMI platform/server firmware advertises a number
> >>> of Telemetry Data Events, each one identified by a 32bit unique ID, and an
> >>> SCMI agent/client, like Linux, can discover them and read back at will the
> >>> associated data value in a number of ways.
> >>> Data collection is mainly intended to happen on demand via shared memory
> >>> areas exposed by the platform firmware, discovered dynamically via SCMI
> >>> Telemetry and accessed by Linux on-demand, but some DE can also be reported
> >>> via SCMI Notifications asynchronous messages or via direct dedicated
> >>> FastChannels (another kind of SCMI memory based access): all of this
> >>> underlying mechanism is anyway hidden to the user since it is mediated by
> >>> the kernel driver which will return the proper data value when queried.
> >>>
> >>> Anyway, the set of well-known architected DE IDs defined by the spec is
> >>> limited to a dozen IDs, which means that the vast majority of DE IDs are
> >>> customizable per-platform: as a consequence, though, the same ID, say
> >>> '0x1234', could represent completely different things on different systems.
> >>>
> >>> Precise definitions and semantic of such custom Data Event IDs are out of
> >>> the scope of the SCMI Telemetry specification and of this implementation:
> >>> they are supposed to be provided using some kind of JSON-like description
> >>> file that will have to be consumed by a userspace tool which would be
> >>> finally in charge of making sense of the set of available DEs.
> 
> You mention json here ... but I assume the data we are getting fed by the
> protocol is not in some default format? (e.g., json)

The data format is defined by the SCMI spec and it is buried in the SCMI
layer, there are a number of collection method and a number of formats: this
is NOT exposed from the SCMI core BUT handled transparently.

The raw spec format basically defines how DE ID, Tstamps, values are represented
in memory and how their consistency can be assured despite the fact that
platform could update the same entries that a user is concurrently reading...

JSON definitions only assign a semantic to the DEs (in theory...): e.g. on this
specific platform...wth is 0x1234 ? ..also note that JSON defs are NOT part of
the spec....they do NOT really exist for the Kernel: they are parsed and
interpreted by more complex user space tools that are supposed to leverage some
of these interfaces to retrieve data and carry-on analysis.

> 
> >>>
> >>> IOW, in turn, this means that even though the DEs enumerated via SCMI come
> >>> with some sort of topological and qualitative description provided by the
> >>> protocol (like unit of measurements, name, topology info etc), kernel-wise
> >>> we CANNOT be completely sure of "what is what" without being fed-back some
> >>> sort of information about the DEs by the afore mentioned userspace tool.
> 
> Maybe you have it in some of the patches here, but what does the typical
> directory + file structure look like in the current implementation?
> 
> Do you have an example?
> 
> Also, is everything in that filesystem read-only, or are there some writable
> file (IOW, how is stuff configured?).

See above for config/write entry ... and I think you found the FS layout in the
doc already...

> 
> >>>
> >>> For these reasons, currently this series does NOT attempt to register any
> >>> of these DEs with any of the usual in-kernel subsystems (like HWMON, IIO,
> >>> PERF etc), simply because we cannot be sure which DE is suitable, or even
> >>> desirable, for a given subsystem. This also means there are NO in-kernel
> >>> users of these Telemetry data events as of now.
> 
> Okay, so you really only feed this data to user space, exposing all the data you
> have easily available as part of the protocol.

Yes, no interpetation nor filtering: I expose all that have enumerated and/
discovered by the protocol, allowing for configurations while hiding the inner
SCMI Telemetry mechanism...

> 
> >>>
> >>> So, while we do not exclude, for the future, to feed/register some of the
> >>> discovered DEs to/with some of the above mentioned Kernel subsystems, as
> >>> of now we have ONLY modeled a custom userspace API to make SCMI Telemetry
> >>> available to userspace tools.
> 
> It's a good question how that could be done, if you need more information about
> these events from user space.

I have NOT really delved into that, so as of know we do NOT fed any data
to existing Kernel subsystems, not there is any available in-kernel
interface to consume DE data (nobody asked), but, I can imagine 2 solution:

 - our beloved architects decide to 'architect' more DataEvents in the
   next version of the spec.. i.e. they reserve some specific DE IDs to
   represent some well defined entity (like it is done already in the spec
   for a dozen IDs)...this avoids the needs of any new interface all
   together

OR

- we open some sort of user-->kernel ABI channel 'somewhere' where the
  userspace tool, interpreting the JSON description, can communicate something
  like " on this platform ID 1,2,3,4 should be fed to the IIO sensors frmwk
  too, while ID 39,8,76 can be fed to HWMON..." etc

> 
> >>>
> >>> In deciding which kind of interface to expose SCMI Telemetry data to a
> >>> user, this new SCMI Telemetry driver aims at satisfying 2 main reqs:
> >>>
> >>>  - exposing an FS-based human-readable interface that can be used to
> >>>    discover, configure and access our Telemetry data directly also from
> >>>    the shell without special tools
> >>>
> >>>  - exposing alternative machine-friendly, more-performant, binary
> >>>    interfaces that can be used to avoid the overhead of multiple accesses
> >>>    to the VFS and that can be more suitable to access with custom tools
> [...]
> 
> >>>
> >>> Due to the above reasoning, since V1 we opted for a new approach with the
> >>> proposed interfaces now based on a full fledged, unified, virtual pseudo
> >>> filesystem implemented from scratch, so that we can:
> >>>
> >>>  - expose all the DEs property we like as before with SysFS, but without
> >>>    any of the constraint imposed by the usage of SysFs or kernfs.
> >>>
> >>>  - easily expose additional alternative views of the same set of DEs
> >>>    using symlinking capabilities (e.g. alternative topological view)
> 
> That sounds reasonable.
> 
> [...]
> 
> > ...I would not say that this was the kind of feedback I was hoping for,
> > but I am NOT gonna argue, given that you shot down already what I thought
> > were all my best selling points :P
> > 
> > At this point my understanding is that the way forward must be to use
> > a custom tool to configure/extract/translate the raw Telemetry data and
> > move up into userspace the whole human readable FS layer via FUSE, if
> > really needed.
> > 
> > I suppose that the new kernel/user interface has to be some dedicated char
> > device implementing proper fops. (like I did previously in early versions
> > of this series and then abandoned...)
> > 
> > Is this you have in mind ? Dedicated character device(s) with enough fops
> > to be able to configure/extract Telemetry data with a custom tool ?
> 
> I cannot speak for Christian, but I guess you could have some kind of libscmi in
> user space that can obtain the information (as you say, probably char device,
> not sure which alternatives we have), to expose the data through a nice ABI, to
> then either make tools build upon that directly, or have a fuse server in user
> space that mimics what you currently do with the file system.

My aim would be at first a simple tool that can exercise the chardev interface to
discover configure and read back data, and then a FUSE server on top of this to
optionally expose the human readable FS....I suppose our internal and external
customers can use the FS interface to validate/test/script on one side, OR
simply code their own tools/libs to use directly the bare chardev inteface...

...we do have a tools team already working on a library to ease all of this
SCMI Telemtry collection and analysis...it is just a matter to re-target the
kind of lower level interfaces that they are using in the near future
probably (they were already planning indeed AFAIK to use more performant
interface that FS...)

> 
> One thing that is not clear to me yet is how stuff would be configured, and how
> possibly multiple users of libscmi would possibly interact.
> 

Configuration/discovery will happen via IOCTls, while consuming the Data
can happen:

 - all together in bulk via a device read fops
 - a single DE via a targeted IOCTL
 - direct access to the raw SCMI data via dev/mmap of the underlying SCMI
   areas (that means the tool has to parse the SCMI format defined by the
   spec on its own, without the currently provided Kernel mediation...)

Regarding the user concurrency, I have already explicitly pushed back on
this, our own tools team: any concurrent read or configuration write is
allowed and properly handled in a consistent way, BUT on the configuration
side the last write/ioctl wins: there is NO in-kernel OR userspace
co-ordination provided out of the box: IOW if you use multiple tools
concurrently to apply conflicting configurations, it is none of our problem

...similarly as if you have an actively running network configuration daemon
and you try to set your IP manually...nobody will prevent you from doing this,
the same netlink will be used freely by you on the shell and the daemon (if you
have enough privilege), but you will gonna have unexpected result...

I dont either see the case to enforce exclusive access for Telemetry resources:
co-ordination is up to the user in my view...I mean if you have 2 tools
configuring concurrently SCMI telemetry in a conflicting way something has been
misconfigured somewhere

.....having said that, I understand that the concurrency co-ordination
issue can be particularly tricky to spot and solve in userspace, so I DO
expose a generation counter entry that is updated on any configuration
change, so that a userspace app using Telemetry can monitor (poll) this
counter to spot if someone else on the system is quietly suddenly applying
configuration changes...

> > 
> > Should/could such a tool live in the kernel tree (tools/) at least for
> > ease of development/deployment ?
> 
> I think OOT.
> 

Ok.

Sorry for the long email..I hope I have clarified the situation, anyway
I am already moving to get rid of the in-kernel interface as advised in
favour of a chardev kernel interface and an optional FUSE based FS...

Any further feedback is welcome...

Thanks,
Cristian

^ permalink raw reply

* Re: [PATCH] kselftest docs: remove reference to obsolete/archived wiki
From: Shuah Khan @ 2026-06-18 23:12 UTC (permalink / raw)
  To: Brett Sheffield
  Cc: Rafael Passos, shuah, corbet, linux-kselftest, workflows,
	linux-doc, linux-kernel, Shuah Khan
In-Reply-To: <ajQ1iGNw2lf_ZHqp@karahi.librecast.net>

On 6/18/26 12:14, Brett Sheffield wrote:
> On 2026-06-18 11:02, Shuah Khan wrote:
>> My apologies  for not taking your patch earlier. Considering the effort
>> you put in with a re-sending the patch and following up here, it is
>> only fair for me to take yours instead. Hope it will apply cleanly on
>> top of kselftest-next
>>
>> Rafael, I am going to take Brett;s patch instead of yours.
>>
>> Apologies to both of you for the mix up.
> 
> Thanks Shuah & no worries.
> 
> 

Your patch in now in linux-kselftest next for my next pr 7.2-rc1

thanks,
-- Shuah

^ permalink raw reply

* Re: [RFC PATCH] reserve_mem: add support for static memory
From: Randy Dunlap @ 2026-06-18 23:02 UTC (permalink / raw)
  To: Shyam Saini, linux-mm, linux-doc, linux-kernel
  Cc: rppt, akpm, kees, tony.luck, gpiccoli, bp, peterz, feng.tang,
	dapeng1.mi, elver, enelsonmoore, kuba, lirongqing, ebiggers
In-Reply-To: <20260618224018.117978-1-shyamsaini@linux.microsoft.com>



On 6/18/26 3:40 PM, Shyam Saini wrote:
> reserve_mem relies on dynamic memory allocation, this limits the
> usecase where memory and its address is required to be preserved
> across the boots. Eg: ramoops memory reservation on ACPI platforms
> 
> So add support to pass a pre-determined static address and reserve
> memory at this specified address. This enables use case like ramoops
> on ACPI platforms to reliably access ramoops region across the boots.
> 
> Also skip parsing of "align" parameter when static address is passed.
> 
> Example syntax for static address
>  reserve_mem=4M@0x1E0000000:oops ramoops.mem_name=oops
> 
> Signed-off-by: Shyam Saini <shyamsaini@linux.microsoft.com>
> ---
>  .../admin-guide/kernel-parameters.txt         | 15 ++++++
>  mm/memblock.c                                 | 48 +++++++++++++------
>  2 files changed, 49 insertions(+), 14 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index b5493a7f8f228..7e0baca564b97 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -6563,6 +6563,21 @@ Kernel parameters
>  
>  			reserve_mem=12M:4096:oops ramoops.mem_name=oops
>  
> +	reserve_mem=	[RAM]

According to this file (kernel-parameters.txt), "RAM" means that this
option applies when "RAM disk support is enabled."
Is that correct here?
My guess is that it's incorrect for the other (existing) "reserve_mem=" also.


> +			Format: nn[KMG]:<@offset>:<label>
> +			Reserve physical memory at pre-determined location and label it with

Preferably		                           predetermined

> +			a name that other subsystems can use to access it. This is typically
> +			used for systems that do not wipe the RAM, and this command
> +			line will try to reserve the same physical memory on
> +			soft reboots. Note, it is guaranteed to be the same
> +			location unless some other early allocation for eg: crashkernel=256M

			                                 allocation, e.g.: crashkernel=256M

> +                        (without static address) is reserved or overlapps this region.

			                                           overlaps

> +
> +			The format is size:offset:label for example, to request
> +			4 megabytes for ramoops at 0x1E0000000:
> +
> +			reserve_mem=4M@0x1E0000000:oops ramoops.mem_name=oops
> +
>  	reservetop=	[X86-32,EARLY]
>  			Format: nn[KMG]
>  			Reserves a hole at the top of the kernel virtual
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 6349c48154f4b..79f1f806ea364 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -2721,6 +2721,7 @@ static int __init reserve_mem(char *p)
>  	char *name;
>  	char *oldp;
>  	int len;
> +	bool addr_is_static = false;
>  
>  	if (!p)
>  		goto err_param;
> @@ -2739,16 +2740,27 @@ static int __init reserve_mem(char *p)
>  	if (*p != ':')
>  		goto err_param;
>  
> -	align = memparse(p+1, &p);
> +	/* parse the static memory address */
> +	if (*p == '@') {
> +		start = memparse(p+1, &p);
> +		addr_is_static = true;
> +	}
> +
>  	if (*p != ':')
>  		goto err_param;
>  
> -	/*
> -	 * memblock_phys_alloc() doesn't like a zero size align,
> -	 * but it is OK for this command to have it.
> -	 */
> -	if (align < SMP_CACHE_BYTES)
> -		align = SMP_CACHE_BYTES;
> +	if (!addr_is_static) {
> +		align = memparse(p+1, &p);
> +		if (*p != ':')
> +			goto err_param;
> +
> +		/*
> +		 * memblock_phys_alloc() doesn't like a zero size align,
> +		 * but it is OK for this command to have it.
> +		 */
> +		if (align < SMP_CACHE_BYTES)
> +			align = SMP_CACHE_BYTES;
> +	}
>  
>  	name = p + 1;
>  	len = strlen(name);
> @@ -2772,16 +2784,24 @@ static int __init reserve_mem(char *p)
>  	}
>  
>  	/* Pick previous allocations up from KHO if available */
> -	if (reserve_mem_kho_revive(name, size, align))
> +	if (!addr_is_static && reserve_mem_kho_revive(name, size, align))
>  		return 1;
>  
> -	/* TODO: Allocation must be outside of scratch region */
> -	start = memblock_phys_alloc(size, align);
> -	if (!start) {
> -		pr_err("reserve_mem: memblock allocation failed\n");
> -		return -ENOMEM;
> -	}
> +	if (addr_is_static) {
> +		if (memblock_reserve(start, size)) {
> +			pr_err("reserve_mem: memblock reservation failed\n");
> +			return -ENOMEM;
> +		}
> +
> +	} else {
> +		/* TODO: Allocation must be outside of scratch region */
> +		start = memblock_phys_alloc(size, align);
> +		if (!start) {
> +			pr_err("reserve_mem: memblock allocation failed\n");
> +			return -ENOMEM;
> +		}
>  
> +	}
>  	reserved_mem_add(start, size, name);
>  
>  	return 1;

__setup() functions return 1 for "yes, I handled this option" (even if there
were errors in it) and 0 for failure ("no idea what this string/option means").
These -ENOMEM, -EINVAL, -EBUSY (in this function) are non-zero so they look like
success and should not be here.

-- 
~Randy


^ permalink raw reply

* [RFC PATCH] reserve_mem: add support for static memory
From: Shyam Saini @ 2026-06-18 22:40 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel
  Cc: rppt, akpm, kees, tony.luck, gpiccoli, bp, rdunlap, peterz,
	feng.tang, dapeng1.mi, elver, enelsonmoore, kuba, lirongqing,
	ebiggers

reserve_mem relies on dynamic memory allocation, this limits the
usecase where memory and its address is required to be preserved
across the boots. Eg: ramoops memory reservation on ACPI platforms

So add support to pass a pre-determined static address and reserve
memory at this specified address. This enables use case like ramoops
on ACPI platforms to reliably access ramoops region across the boots.

Also skip parsing of "align" parameter when static address is passed.

Example syntax for static address
 reserve_mem=4M@0x1E0000000:oops ramoops.mem_name=oops

Signed-off-by: Shyam Saini <shyamsaini@linux.microsoft.com>
---
 .../admin-guide/kernel-parameters.txt         | 15 ++++++
 mm/memblock.c                                 | 48 +++++++++++++------
 2 files changed, 49 insertions(+), 14 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index b5493a7f8f228..7e0baca564b97 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6563,6 +6563,21 @@ Kernel parameters
 
 			reserve_mem=12M:4096:oops ramoops.mem_name=oops
 
+	reserve_mem=	[RAM]
+			Format: nn[KMG]:<@offset>:<label>
+			Reserve physical memory at pre-determined location and label it with
+			a name that other subsystems can use to access it. This is typically
+			used for systems that do not wipe the RAM, and this command
+			line will try to reserve the same physical memory on
+			soft reboots. Note, it is guaranteed to be the same
+			location unless some other early allocation for eg: crashkernel=256M
+                        (without static address) is reserved or overlapps this region.
+
+			The format is size:offset:label for example, to request
+			4 megabytes for ramoops at 0x1E0000000:
+
+			reserve_mem=4M@0x1E0000000:oops ramoops.mem_name=oops
+
 	reservetop=	[X86-32,EARLY]
 			Format: nn[KMG]
 			Reserves a hole at the top of the kernel virtual
diff --git a/mm/memblock.c b/mm/memblock.c
index 6349c48154f4b..79f1f806ea364 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -2721,6 +2721,7 @@ static int __init reserve_mem(char *p)
 	char *name;
 	char *oldp;
 	int len;
+	bool addr_is_static = false;
 
 	if (!p)
 		goto err_param;
@@ -2739,16 +2740,27 @@ static int __init reserve_mem(char *p)
 	if (*p != ':')
 		goto err_param;
 
-	align = memparse(p+1, &p);
+	/* parse the static memory address */
+	if (*p == '@') {
+		start = memparse(p+1, &p);
+		addr_is_static = true;
+	}
+
 	if (*p != ':')
 		goto err_param;
 
-	/*
-	 * memblock_phys_alloc() doesn't like a zero size align,
-	 * but it is OK for this command to have it.
-	 */
-	if (align < SMP_CACHE_BYTES)
-		align = SMP_CACHE_BYTES;
+	if (!addr_is_static) {
+		align = memparse(p+1, &p);
+		if (*p != ':')
+			goto err_param;
+
+		/*
+		 * memblock_phys_alloc() doesn't like a zero size align,
+		 * but it is OK for this command to have it.
+		 */
+		if (align < SMP_CACHE_BYTES)
+			align = SMP_CACHE_BYTES;
+	}
 
 	name = p + 1;
 	len = strlen(name);
@@ -2772,16 +2784,24 @@ static int __init reserve_mem(char *p)
 	}
 
 	/* Pick previous allocations up from KHO if available */
-	if (reserve_mem_kho_revive(name, size, align))
+	if (!addr_is_static && reserve_mem_kho_revive(name, size, align))
 		return 1;
 
-	/* TODO: Allocation must be outside of scratch region */
-	start = memblock_phys_alloc(size, align);
-	if (!start) {
-		pr_err("reserve_mem: memblock allocation failed\n");
-		return -ENOMEM;
-	}
+	if (addr_is_static) {
+		if (memblock_reserve(start, size)) {
+			pr_err("reserve_mem: memblock reservation failed\n");
+			return -ENOMEM;
+		}
+
+	} else {
+		/* TODO: Allocation must be outside of scratch region */
+		start = memblock_phys_alloc(size, align);
+		if (!start) {
+			pr_err("reserve_mem: memblock allocation failed\n");
+			return -ENOMEM;
+		}
 
+	}
 	reserved_mem_add(start, size, name);
 
 	return 1;
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH 2/3] irqchip/gic-v3: Add Renesas R-Car Gen4 erratum workaround
From: Marek Vasut @ 2026-06-18 21:54 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Marek Vasut, linux-pci, Yoshihiro Shimoda,
	Krzysztof Wilczyński, Bjorn Helgaas, Catalin Marinas,
	Conor Dooley, Geert Uytterhoeven, Krzysztof Kozlowski,
	Lorenzo Pieralisi, Manivannan Sadhasivam, Rob Herring, devicetree,
	linux-arm-kernel, linux-doc, linux-kernel, linux-renesas-soc
In-Reply-To: <86ldccs0oj.wl-maz@kernel.org>

On 6/18/26 10:38 AM, Marc Zyngier wrote:

Hello Marc,

>>>> Renesas R-Car S4/V4H/V4M GIC600 integration has address width for AXI
>>>> or APB interface configured to 32 bit, it can therefore access only
>>>> the first 4 GiB of physical address space. This information comes from
>>>> R-Car V4H Interface Specification sheet, there is currently no technical
>>>> update number assigned to this limitation. Further input from hardware
>>>> engineer indicates that this limitation also applies to R-Car S4 and V4M.
>>>> Name the limitation GEN4GICITS1, and add a driver quirk to mitigate this
>>>> limitation.
>>
>> My concern is this ^ , I do not have an erratum number, because there
>> isn't one. I am in touch with the hardware engineer and I did get a
>> glimpse at internal details of the three SoC, which confirm the
>> limitations. Is this sufficient ?
> 
> To be honest, this is between you and the SoC vendor. I'll take
> whatever symbol you come up with at face value, and will assume that
> the vendor agrees with it. After all, they are on Cc and have their
> SoB on the patch.

All right.

>>>> Note that the 0x0201743b GIC600 ID is not Renesas-specific, it is
>>>> common for many ARM GICv3 implementations. Therefore, add an extra
>>>
>>> Not quite. It designates GIC600 unambiguously.
>>
>> What I am trying to communicate is, that the 0x0201743b ID is not ID
>> of the Renesas GIC implementation, but it is a generic ARM GIC600
>> ID. That is why we cannot match the quirk on the ID (it is generic ARM
>> GIC600 ID), and instead we have to match the quirk on the [ ID
>> combined with of_machine_is_compatible("renesas,...") ].
> 
> This is understood, and is no different from the other broken
> platforms in the tree.
> 
>>
>>> It is just that GIC600
>>> is integrated in zillions of SoCs, most of which don't have this
>>> problem (the machine I'm typing this from has a GIC600 *and* 96GB of
>>> RAM).
>>
>> Right.
>>
>> Shall I reword this paragraph somehow to make it clearer ?
> 
> I'd simply say that the workaround is keyed on the combination of the
> GIC implementation and the platform identification in the device tree.

OK

>>>> of_machine_is_compatible() check.
>>>>
>>>> The GIC600 implementation in R-Car S4/V4H/V4M is r1p6.
>>>
>>> Is this relevant?
>>
>> I included it for the sake of completeness and to provide all relevant
>> information, based on previous discussions about similar limitations
>> that I could find on lore.k.o
> 
> This information is already contained in the ID you quote (bits
> [19:12]), and can be decoded using the public TRM [1].
I'll drop it then.

Thanks

^ permalink raw reply

* Re: [PATCH 1/3] PCI: rcar-gen4: Configure AXIINTC if iMSI-RX not used
From: Marek Vasut @ 2026-06-18 21:53 UTC (permalink / raw)
  To: Marc Zyngier, Marek Vasut
  Cc: linux-pci, Yoshihiro Shimoda, Krzysztof Wilczyński,
	Bjorn Helgaas, Catalin Marinas, Conor Dooley, Geert Uytterhoeven,
	Krzysztof Kozlowski, Lorenzo Pieralisi, Manivannan Sadhasivam,
	Rob Herring, devicetree, linux-arm-kernel, linux-doc,
	linux-kernel, linux-renesas-soc
In-Reply-To: <8633yltylq.wl-maz@kernel.org>

On 6/17/26 9:28 AM, Marc Zyngier wrote:

[...]

>> +/* INTC address */
>> +#define AXIINTCADDR		0x0a00
>> +/* GITS GIC ITS translation register */
>> +#define AXIINTCADDR_VAL		0xf1050000
> 
> Wouldn't it be preferable to source the address from the device tree,
> rather than hardcoding this?
It would, I will do so in V2.

^ permalink raw reply

* [PATCH v2 4/4] arm64: dts: renesas: r8a779g0: Add GICv3 ITS and update PCIe nodes
From: Marek Vasut @ 2026-06-18 22:02 UTC (permalink / raw)
  To: linux-pci
  Cc: Marek Vasut, Yoshihiro Shimoda, Krzysztof Wilczyński,
	Bjorn Helgaas, Catalin Marinas, Conor Dooley, Geert Uytterhoeven,
	Krzysztof Kozlowski, Lorenzo Pieralisi, Manivannan Sadhasivam,
	Marc Zyngier, Rob Herring, devicetree, linux-arm-kernel,
	linux-doc, linux-kernel, linux-renesas-soc
In-Reply-To: <20260618220427.14325-1-marek.vasut+renesas@mailbox.org>

This SoC implements GIC600 with GICv3 ITS and PCIe host mode on this
SoC can use it. Add GIC ITS node into GIC node, update interrupt-map
and add msi-map into PCIe controller node.

The GIC ITS does have master interface to issue transactions to RAM.
The interface does support cacheable transactions, however, it does
not support shareable attribute, because the AXI port signals are tied
to inactive in this implementation. Therefore, add "dma-noncoherent"
DT property into the GIC ITS subnode.

The GIC redistributor does not have cacheable/shareable, therefore
add "dma-noncoherent" DT property into the GIC node.

Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Signed-off-by: Marek Vasut <marek.vasut+renesas@mailbox.org>
---
NOTE: This would not be possible without prior work from Shimoda-san
      https://lore.kernel.org/all/20240214052144.1966569-1-yoshihiro.shimoda.uh@renesas.com/
---
Cc: "Krzysztof Wilczyński" <kwilczynski@kernel.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Conor Dooley <conor+dt@kernel.org>
Cc: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Krzysztof Kozlowski <krzk+dt@kernel.org>
Cc: Lorenzo Pieralisi <lpieralisi@kernel.org>
Cc: Manivannan Sadhasivam <mani@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Cc: devicetree@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Cc: linux-renesas-soc@vger.kernel.org
---
V2: No change
---
 arch/arm64/boot/dts/renesas/r8a779g0.dtsi | 31 ++++++++++++++++-------
 1 file changed, 22 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/boot/dts/renesas/r8a779g0.dtsi b/arch/arm64/boot/dts/renesas/r8a779g0.dtsi
index 3a8af825bb253..82e864acf2601 100644
--- a/arch/arm64/boot/dts/renesas/r8a779g0.dtsi
+++ b/arch/arm64/boot/dts/renesas/r8a779g0.dtsi
@@ -792,6 +792,7 @@ pciec0: pcie@e65d0000 {
 			resets = <&cpg 624>;
 			reset-names = "pwr";
 			max-link-speed = <4>;
+			msi-parent = <&its>;
 			num-lanes = <2>;
 			#address-cells = <3>;
 			#size-cells = <2>;
@@ -802,10 +803,10 @@ pciec0: pcie@e65d0000 {
 			dma-ranges = <0x42000000 0 0x00000000 0 0x00000000 1 0x00000000>;
 			#interrupt-cells = <1>;
 			interrupt-map-mask = <0 0 0 7>;
-			interrupt-map = <0 0 0 1 &gic GIC_SPI 449 IRQ_TYPE_LEVEL_HIGH>,
-					<0 0 0 2 &gic GIC_SPI 449 IRQ_TYPE_LEVEL_HIGH>,
-					<0 0 0 3 &gic GIC_SPI 449 IRQ_TYPE_LEVEL_HIGH>,
-					<0 0 0 4 &gic GIC_SPI 449 IRQ_TYPE_LEVEL_HIGH>;
+			interrupt-map = <0 0 0 1 &gic 0 0 GIC_SPI 449 IRQ_TYPE_LEVEL_HIGH>,
+					<0 0 0 2 &gic 0 0 GIC_SPI 449 IRQ_TYPE_LEVEL_HIGH>,
+					<0 0 0 3 &gic 0 0 GIC_SPI 449 IRQ_TYPE_LEVEL_HIGH>,
+					<0 0 0 4 &gic 0 0 GIC_SPI 449 IRQ_TYPE_LEVEL_HIGH>;
 			snps,enable-cdm-check;
 			status = "disabled";
 
@@ -839,6 +840,7 @@ pciec1: pcie@e65d8000 {
 			resets = <&cpg 625>;
 			reset-names = "pwr";
 			max-link-speed = <4>;
+			msi-parent = <&its>;
 			num-lanes = <2>;
 			#address-cells = <3>;
 			#size-cells = <2>;
@@ -849,10 +851,10 @@ pciec1: pcie@e65d8000 {
 			dma-ranges = <0x42000000 0 0x00000000 0 0x00000000 1 0x00000000>;
 			#interrupt-cells = <1>;
 			interrupt-map-mask = <0 0 0 7>;
-			interrupt-map = <0 0 0 1 &gic GIC_SPI 456 IRQ_TYPE_LEVEL_HIGH>,
-					<0 0 0 2 &gic GIC_SPI 456 IRQ_TYPE_LEVEL_HIGH>,
-					<0 0 0 3 &gic GIC_SPI 456 IRQ_TYPE_LEVEL_HIGH>,
-					<0 0 0 4 &gic GIC_SPI 456 IRQ_TYPE_LEVEL_HIGH>;
+			interrupt-map = <0 0 0 1 &gic 0 0 GIC_SPI 456 IRQ_TYPE_LEVEL_HIGH>,
+					<0 0 0 2 &gic 0 0 GIC_SPI 456 IRQ_TYPE_LEVEL_HIGH>,
+					<0 0 0 3 &gic 0 0 GIC_SPI 456 IRQ_TYPE_LEVEL_HIGH>,
+					<0 0 0 4 &gic 0 0 GIC_SPI 456 IRQ_TYPE_LEVEL_HIGH>;
 			snps,enable-cdm-check;
 			status = "disabled";
 
@@ -2131,11 +2133,22 @@ ipmmu_mm: iommu@eefc0000 {
 		gic: interrupt-controller@f1000000 {
 			compatible = "arm,gic-v3";
 			#interrupt-cells = <3>;
-			#address-cells = <0>;
+			#address-cells = <2>;
+			#size-cells = <2>;
 			interrupt-controller;
 			reg = <0x0 0xf1000000 0 0x20000>,
 			      <0x0 0xf1060000 0 0x110000>;
 			interrupts = <GIC_PPI 9 IRQ_TYPE_LEVEL_HIGH>;
+			dma-noncoherent;
+
+			ranges = <0x0 0x0 0x0 0xf1000000 0x0 0x200000>;
+
+			its: msi-controller@40000 {
+				compatible = "arm,gic-v3-its";
+				reg = <0x0 0x40000 0x0 0x20000>;
+				dma-noncoherent;
+				msi-controller;
+			};
 		};
 
 		gpu: gpu@fd000000 {
-- 
2.53.0


^ permalink raw reply related

* [PATCH v2 3/4] irqchip/gic-v3: Add Renesas R-Car Gen4 erratum workaround
From: Marek Vasut @ 2026-06-18 22:02 UTC (permalink / raw)
  To: linux-pci
  Cc: Marek Vasut, Yoshihiro Shimoda, Krzysztof Wilczyński,
	Bjorn Helgaas, Catalin Marinas, Conor Dooley, Geert Uytterhoeven,
	Krzysztof Kozlowski, Lorenzo Pieralisi, Manivannan Sadhasivam,
	Marc Zyngier, Rob Herring, devicetree, linux-arm-kernel,
	linux-doc, linux-kernel, linux-renesas-soc
In-Reply-To: <20260618220427.14325-1-marek.vasut+renesas@mailbox.org>

Renesas R-Car S4/V4H/V4M GIC600 integration has address width for AXI
or APB interface configured to 32 bit, it can therefore access only
the first 4 GiB of physical address space. This information comes from
R-Car V4H Interface Specification sheet, there is currently no technical
update number assigned to this limitation. Further input from hardware
engineer indicates that this limitation also applies to R-Car S4 and V4M.
Name the limitation GEN4GICITS1, and add a driver quirk to mitigate this
limitation.

The quirk is keyed on the combination of the GIC implementation
and the platform identification in the device tree.

Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Signed-off-by: Marek Vasut <marek.vasut+renesas@mailbox.org>
---
NOTE: This would not be possible without prior work from Shimoda-san
      https://lore.kernel.org/all/20240214052050.1966439-1-yoshihiro.shimoda.uh@renesas.com/
---
Cc: "Krzysztof Wilczyński" <kwilczynski@kernel.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Conor Dooley <conor+dt@kernel.org>
Cc: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Krzysztof Kozlowski <krzk+dt@kernel.org>
Cc: Lorenzo Pieralisi <lpieralisi@kernel.org>
Cc: Manivannan Sadhasivam <mani@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Cc: devicetree@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Cc: linux-renesas-soc@vger.kernel.org
---
V2: Minimize the patch based on newly added patch in the series and
    only add entries into dma_32bit_impaired_platforms array. Update
    second paragraph of commit message slightly.
---
 Documentation/arch/arm64/silicon-errata.rst | 1 +
 arch/arm64/Kconfig                          | 9 +++++++++
 drivers/irqchip/irq-gic-v3-its.c            | 5 +++++
 3 files changed, 15 insertions(+)

diff --git a/Documentation/arch/arm64/silicon-errata.rst b/Documentation/arch/arm64/silicon-errata.rst
index 014aa1c215a16..b0c68b64f5ac2 100644
--- a/Documentation/arch/arm64/silicon-errata.rst
+++ b/Documentation/arch/arm64/silicon-errata.rst
@@ -352,6 +352,7 @@ stable kernels.
 +----------------+-----------------+-----------------+-----------------------------+
 | Qualcomm Tech. | Kryo4xx Gold    | N/A             | ARM64_ERRATUM_1286807       |
 +----------------+-----------------+-----------------+-----------------------------+
+| Renesas        | S4/V4H/V4M      | N/A             | RENESAS_ERRATUM_GEN4GICITS1 |
 +----------------+-----------------+-----------------+-----------------------------+
 | Rockchip       | RK3588          | #3588001        | ROCKCHIP_ERRATUM_3588001    |
 +----------------+-----------------+-----------------+-----------------------------+
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index b3afe0688919b..b9e17ce475e61 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1382,6 +1382,15 @@ config NVIDIA_CARMEL_CNP_ERRATUM
 
 	  If unsure, say Y.
 
+config RENESAS_ERRATUM_GEN4GICITS1
+	bool "Renesas R-Car Gen4: GIC600 can not access physical addresses above 4 GiB"
+	default y
+	help
+	  The Renesas R-Car Gen4 S4/V4H/V4M GIC600 SoC integrations have AXI
+	  addressing limited to the first 32-bit of physical address space.
+
+	  If unsure, say Y.
+
 config ROCKCHIP_ERRATUM_3568002
 	bool "Rockchip 3568002: GIC600 can not access physical addresses higher than 4GB"
 	default y
diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
index 8ec2175b78288..96b26822e5bd2 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -4891,6 +4891,11 @@ static bool __maybe_unused its_enable_quirk_hip09_162100801(void *data)
 }
 
 static const char * const dma_32bit_impaired_platforms[] = {
+#ifdef CONFIG_RENESAS_ERRATUM_GEN4GICITS1
+	"renesas,r8a779f0",
+	"renesas,r8a779g0",
+	"renesas,r8a779h0",
+#endif
 #ifdef CONFIG_ROCKCHIP_ERRATUM_3568002
 	"rockchip,rk3566",
 	"rockchip,rk3568",
-- 
2.53.0


^ permalink raw reply related

* [PATCH v2 1/4] PCI: rcar-gen4: Configure AXIINTC if iMSI-RX not used
From: Marek Vasut @ 2026-06-18 22:01 UTC (permalink / raw)
  To: linux-pci
  Cc: Marek Vasut, Yoshihiro Shimoda, Krzysztof Wilczyński,
	Bjorn Helgaas, Catalin Marinas, Conor Dooley, Geert Uytterhoeven,
	Krzysztof Kozlowski, Lorenzo Pieralisi, Manivannan Sadhasivam,
	Marc Zyngier, Rob Herring, devicetree, linux-arm-kernel,
	linux-doc, linux-kernel, linux-renesas-soc
In-Reply-To: <20260618220427.14325-1-marek.vasut+renesas@mailbox.org>

In case MSI are enabled, but DWC built-in iMSI-RX is not in use, the
MSI are handled via GIC ITS. Configure all controller MSI registers
fully.

Set or clear MSI capability register MSICAP0 MSI enable MSIE bit and
PCIe Interrupt Status 0 Enable register PCIEINTSTS0EN MSI interrupt
enable MSI_CTRL_INT bit according to MSI enable state, set both bits
if MSI are enabled, clear both bits if MSI are disabled.

If MSI are disabled, or MSI are enabled and iMSI-RX is used, then
deconfigure AXIINTCADDR and AXIINTCCONT to 0, which disables any
pass through of MSI TLPs onto the AXI bus and then further into
GIC ITS translation registers.

If MSI are enabled and iMSI-RX is not used, the configure AXIINTCADDR
with target address of GIC ITS translation registers, and configure
AXIINTCCONT to enable MSI TLP pass through onto AXI bus and into the
GIC ITS. This specific configuration allows handling of MSI via the
GIC ITS instead of integrated iMSI-RX.

Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Signed-off-by: Marek Vasut <marek.vasut+renesas@mailbox.org>
---
NOTE: This would not be possible without prior work from Shimoda-san
---
Cc: "Krzysztof Wilczyński" <kwilczynski@kernel.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Conor Dooley <conor+dt@kernel.org>
Cc: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Krzysztof Kozlowski <krzk+dt@kernel.org>
Cc: Lorenzo Pieralisi <lpieralisi@kernel.org>
Cc: Manivannan Sadhasivam <mani@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Cc: devicetree@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Cc: linux-renesas-soc@vger.kernel.org
---
V2: Pull GITS_TRANSLATER address from DT, which also fixes missing +0x40
    offset of the GITS_TRANSLATER register
---
 drivers/pci/controller/dwc/pcie-rcar-gen4.c | 118 +++++++++++++++++++-
 1 file changed, 113 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/controller/dwc/pcie-rcar-gen4.c b/drivers/pci/controller/dwc/pcie-rcar-gen4.c
index 8b03c42f8c84c..6300ab4dc38b3 100644
--- a/drivers/pci/controller/dwc/pcie-rcar-gen4.c
+++ b/drivers/pci/controller/dwc/pcie-rcar-gen4.c
@@ -13,8 +13,11 @@
 #include <linux/interrupt.h>
 #include <linux/io.h>
 #include <linux/iopoll.h>
+#include <linux/irqchip/arm-gic-v3.h>
 #include <linux/module.h>
 #include <linux/of.h>
+#include <linux/of_address.h>
+#include <linux/of_irq.h>
 #include <linux/pci.h>
 #include <linux/platform_device.h>
 #include <linux/pm_runtime.h>
@@ -31,6 +34,10 @@
 #define DEVICE_TYPE_RC		BIT(4)
 #define BIFUR_MOD_SET_ON	BIT(0)
 
+/* MSI Capability */
+#define MSICAP0			0x0050
+#define MSICAP0_MSIE		BIT(16)
+
 /* PCIe Interrupt Status 0 */
 #define PCIEINTSTS0		0x0084
 
@@ -55,6 +62,14 @@
 #define APP_HOLD_PHY_RST	BIT(16)
 #define APP_LTSSM_ENABLE	BIT(0)
 
+/* INTC address */
+#define AXIINTCADDR		0x0a00
+
+/* INTC control & mask */
+#define AXIINTCCONT		0x0a04
+#define INTC_EN			BIT(31)
+#define INTC_MASK		GENMASK(11, 2)
+
 /* PCIe Power Management Control */
 #define PCIEPWRMNGCTRL		0x0070
 #define APP_CLK_REQ_N		BIT(11)
@@ -305,13 +320,103 @@ static struct rcar_gen4_pcie *rcar_gen4_pcie_alloc(struct platform_device *pdev)
 	return rcar;
 }
 
+static int rcar_gen4_pcie_host_msi_addr(struct dw_pcie_rp *pp, u32 *msi_addr)
+{
+	struct dw_pcie *dw = to_dw_pcie_from_pp(pp);
+	struct device_node *msi_node = NULL;
+	struct device *dev = dw->dev;
+	struct resource res;
+	u64 addr;
+	int ret;
+
+	/*
+	 * Either the "msi-parent" or the "msi-map" phandle needs to exist
+	 * to obtain the MSI node.
+	 */
+	of_msi_xlate(dev, &msi_node, 0);
+	if (!msi_node)
+		return -ENODEV;
+
+	/* Check if "msi-parent" or the "msi-map" points to ARM GICv3 ITS. */
+	if (!of_device_is_compatible(msi_node, "arm,gic-v3-its"))
+		return dev_err_probe(dev, -ENODEV, "Compatible MSI controller not found\n");
+
+	/* Derive GITS_TRANSLATER address from GICv3 */
+	ret = of_address_to_resource(msi_node, 0, &res);
+	if (ret < 0)
+		return dev_err_probe(dev, ret, "MSI controller resources not obtained\n");
+
+	addr = res.start + GITS_TRANSLATER;
+	if (addr >= SZ_4G)
+		return dev_err_probe(dev, -EINVAL, "MSI controller address above 32bit range\n");
+
+	*msi_addr = addr;
+	return 0;
+}
+
+static int rcar_gen4_pcie_host_msi_init(struct dw_pcie_rp *pp)
+{
+	struct dw_pcie *dw = to_dw_pcie_from_pp(pp);
+	struct rcar_gen4_pcie *rcar = to_rcar_gen4_pcie(dw);
+	u32 val;
+	int ret;
+
+	/* Make sure MSICAP0 MSIE is configured. */
+	val = dw_pcie_readl_dbi(dw, MSICAP0);
+	if (pci_msi_enabled())
+		val |= MSICAP0_MSIE;
+	else
+		val &= ~MSICAP0_MSIE;
+	dw_pcie_writel_dbi(dw, MSICAP0, val);
+
+	if (!pci_msi_enabled() || pp->use_imsi_rx) {
+		/* Clear AXIINTC mapping. */
+		writel(0, rcar->base + AXIINTCADDR);
+		writel(0, rcar->base + AXIINTCCONT);
+	} else {
+		ret = rcar_gen4_pcie_host_msi_addr(pp, &val);
+		if (ret)
+			goto err;
+
+		/* Point AXIINTC to GIC ITS and enable. */
+		writel(val, rcar->base + AXIINTCADDR);
+		writel(INTC_EN | INTC_MASK, rcar->base + AXIINTCCONT);
+	}
+
+	/* Configure MSI interrupt signal */
+	val = readl(rcar->base + PCIEINTSTS0EN);
+	if (pci_msi_enabled())
+		val |= MSI_CTRL_INT;
+	else
+		val &= ~MSI_CTRL_INT;
+	writel(val, rcar->base + PCIEINTSTS0EN);
+
+	return 0;
+
+err:
+	/* Deconfigure MSICAP0 MSIE. */
+	val = dw_pcie_readl_dbi(dw, MSICAP0);
+	val &= ~MSICAP0_MSIE;
+	dw_pcie_writel_dbi(dw, MSICAP0, val);
+
+	/* Clear AXIINTC mapping. */
+	writel(0, rcar->base + AXIINTCADDR);
+	writel(0, rcar->base + AXIINTCCONT);
+
+	/* Deconfigure MSI interrupt signal */
+	val = readl(rcar->base + PCIEINTSTS0EN);
+	val &= ~MSI_CTRL_INT;
+	writel(val, rcar->base + PCIEINTSTS0EN);
+
+	return ret;
+}
+
 /* Host mode */
 static int rcar_gen4_pcie_host_init(struct dw_pcie_rp *pp)
 {
 	struct dw_pcie *dw = to_dw_pcie_from_pp(pp);
 	struct rcar_gen4_pcie *rcar = to_rcar_gen4_pcie(dw);
 	int ret;
-	u32 val;
 
 	gpiod_set_value_cansleep(dw->pe_rst, 1);
 
@@ -328,16 +433,19 @@ static int rcar_gen4_pcie_host_init(struct dw_pcie_rp *pp)
 	dw_pcie_writel_dbi2(dw, PCI_BASE_ADDRESS_0, 0x0);
 	dw_pcie_writel_dbi2(dw, PCI_BASE_ADDRESS_1, 0x0);
 
-	/* Enable MSI interrupt signal */
-	val = readl(rcar->base + PCIEINTSTS0EN);
-	val |= MSI_CTRL_INT;
-	writel(val, rcar->base + PCIEINTSTS0EN);
+	ret = rcar_gen4_pcie_host_msi_init(pp);
+	if (ret)
+		goto err;
 
 	msleep(PCIE_T_PVPERL_MS);	/* pe_rst requires 100msec delay */
 
 	gpiod_set_value_cansleep(dw->pe_rst, 0);
 
 	return 0;
+
+err:
+	rcar_gen4_pcie_common_deinit(rcar);
+	return ret;
 }
 
 static void rcar_gen4_pcie_host_deinit(struct dw_pcie_rp *pp)
-- 
2.53.0


^ permalink raw reply related

* [PATCH v2 2/4] irqchip/gic-v3: Refactor GIC600 limited to 32bit PA erratum handling
From: Marek Vasut @ 2026-06-18 22:02 UTC (permalink / raw)
  To: linux-pci
  Cc: Marek Vasut, Marc Zyngier, Krzysztof Wilczyński,
	Bjorn Helgaas, Catalin Marinas, Conor Dooley, Geert Uytterhoeven,
	Krzysztof Kozlowski, Lorenzo Pieralisi, Manivannan Sadhasivam,
	Rob Herring, Yoshihiro Shimoda, devicetree, linux-arm-kernel,
	linux-doc, linux-kernel, linux-renesas-soc
In-Reply-To: <20260618220427.14325-1-marek.vasut+renesas@mailbox.org>

The GIC600 implementation is now known to be used on multiple 64-bit
SoCs, where it has address width for AXI or APB interface configured
to 32 bit, and it can access only the first 4GiB of physical address
space.

Rework the handling of the quirk to work around this limitation such
that new entries can be added purely as new compatible strings, with
no need to add additional functions or new its_quirk array entries.

Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Marek Vasut <marek.vasut+renesas@mailbox.org>
---
V2: New patch
---
Cc: "Krzysztof Wilczyński" <kwilczynski@kernel.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Conor Dooley <conor+dt@kernel.org>
Cc: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Krzysztof Kozlowski <krzk+dt@kernel.org>
Cc: Lorenzo Pieralisi <lpieralisi@kernel.org>
Cc: Manivannan Sadhasivam <mani@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Cc: devicetree@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Cc: linux-renesas-soc@vger.kernel.org
---
 drivers/irqchip/irq-gic-v3-its.c | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
index b57d81ad33a0a..8ec2175b78288 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -4890,10 +4890,17 @@ static bool __maybe_unused its_enable_quirk_hip09_162100801(void *data)
 	return true;
 }
 
-static bool __maybe_unused its_enable_rk3568002(void *data)
+static const char * const dma_32bit_impaired_platforms[] = {
+#ifdef CONFIG_ROCKCHIP_ERRATUM_3568002
+	"rockchip,rk3566",
+	"rockchip,rk3568",
+#endif
+	NULL,
+};
+
+static bool __maybe_unused its_enable_dma32(void *data)
 {
-	if (!of_machine_is_compatible("rockchip,rk3566") &&
-	    !of_machine_is_compatible("rockchip,rk3568"))
+	if (!of_machine_compatible_match(dma_32bit_impaired_platforms))
 		return false;
 
 	gfp_flags_quirk |= GFP_DMA32;
@@ -4968,14 +4975,12 @@ static const struct gic_quirk its_quirks[] = {
 		.property = "dma-noncoherent",
 		.init   = its_set_non_coherent,
 	},
-#ifdef CONFIG_ROCKCHIP_ERRATUM_3568002
 	{
-		.desc   = "ITS: Rockchip erratum RK3568002",
+		.desc   = "ITS: Broken GIC600 integration limited to 32bit PA",
 		.iidr   = 0x0201743b,
 		.mask   = 0xffffffff,
-		.init   = its_enable_rk3568002,
+		.init   = its_enable_dma32,
 	},
-#endif
 	{
 	}
 };
-- 
2.53.0


^ permalink raw reply related

* [PATCH v2 0/4] PCI: rcar-gen4: irqchip/gic-v3: Handle GIC ITS
From: Marek Vasut @ 2026-06-18 22:01 UTC (permalink / raw)
  To: linux-pci
  Cc: Marek Vasut, Krzysztof Wilczyński, Bjorn Helgaas,
	Catalin Marinas, Conor Dooley, Geert Uytterhoeven,
	Krzysztof Kozlowski, Lorenzo Pieralisi, Manivannan Sadhasivam,
	Marc Zyngier, Rob Herring, Yoshihiro Shimoda, devicetree,
	linux-arm-kernel, linux-doc, linux-kernel, linux-renesas-soc

Configure all R-Car Gen4 PCIe controller MSI registers fully, both in
case MSI are enabled and disabled.

Patch GIC ITS driver and add quirks for R-Car Gen4 GIC ITS, which is
configured to 32-bit address width for AXI or APB interface.

Switch R-Car V4H to use GIC ITS in its DT and describe the GIC ITS
implementation cacheable and shareable limitations.

Marek Vasut (4):
  PCI: rcar-gen4: Configure AXIINTC if iMSI-RX not used
  irqchip/gic-v3: Refactor GIC600 limited to 32bit PA erratum handling
  irqchip/gic-v3: Add Renesas R-Car Gen4 erratum workaround
  arm64: dts: renesas: r8a779g0: Add GICv3 ITS and update PCIe nodes

 Documentation/arch/arm64/silicon-errata.rst |   1 +
 arch/arm64/Kconfig                          |   9 ++
 arch/arm64/boot/dts/renesas/r8a779g0.dtsi   |  31 +++--
 drivers/irqchip/irq-gic-v3-its.c            |  24 ++--
 drivers/pci/controller/dwc/pcie-rcar-gen4.c | 118 +++++++++++++++++++-
 5 files changed, 162 insertions(+), 21 deletions(-)

---
Cc: "Krzysztof Wilczyński" <kwilczynski@kernel.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Conor Dooley <conor+dt@kernel.org>
Cc: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Krzysztof Kozlowski <krzk+dt@kernel.org>
Cc: Lorenzo Pieralisi <lpieralisi@kernel.org>
Cc: Manivannan Sadhasivam <mani@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Cc: devicetree@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Cc: linux-renesas-soc@vger.kernel.org

-- 
2.53.0


^ permalink raw reply

* Re: [PATCH v3 08/13] genirq: Add explicit housekeeping callback for managed IRQ migration
From: Thomas Gleixner @ 2026-06-18 21:11 UTC (permalink / raw)
  To: Jing Wu, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Tejun Heo, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Waiman Long
  Cc: linux-kernel, rcu, cgroups, linux-doc, linux-kselftest, Jing Wu,
	Qiliang Yuan
In-Reply-To: <87cxxnegqa.ffs@fw13>

On Thu, Jun 18 2026 at 22:27, Thomas Gleixner wrote:
> On Thu, Jun 18 2026 at 11:11, Jing Wu wrote:
>> +		 */
>> +		if (irqd_affinity_is_managed(&desc->irq_data)) {
>
> So you set the affinity even on an interrupt which is shutdown?
>
>> +			const struct cpumask *mask;
>> +			struct cpumask *tmp = this_cpu_ptr(&__tmp_mask);

How is this correct? You cannot get the per cpu pointer in preemptible
context. The task might be migrated and then fiddle with the wrong
per CPU data. But that's moot as this code is broken anyway.



^ permalink raw reply

* Re: [PATCH v3 05/13] cpu/hotplug: Reserve CPUHP states for nohz_full and managed IRQ down-paths
From: Thomas Gleixner @ 2026-06-18 21:01 UTC (permalink / raw)
  To: Jing Wu, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Tejun Heo, Jonathan Corbet, Shuah Khan,
	Shuah Khan
  Cc: linux-kernel, rcu, cgroups, linux-doc, linux-kselftest, Jing Wu,
	Qiliang Yuan
In-Reply-To: <871pe3de9b.ffs@fw13>

On Thu, Jun 18 2026 at 18:06, Thomas Gleixner wrote:
> On Thu, Jun 18 2026 at 11:11, Jing Wu wrote:
>> Add CPUHP_AP_NO_HZ_FULL_DYING and CPUHP_AP_IRQ_AFFINITY_DYING to the
>> cpuhp_state enum.  These dying callbacks are invoked during CPU offline
>> before the tick is stopped, enabling clean tick handover and managed
>> IRQ migration when a CPU transitions between isolated and housekeeping
>> states.
>>
>> The existing CPUHP_AP_IRQ_AFFINITY_ONLINE already handles managed IRQ
>> restoration on CPU online.  The new dying callback completes the pair,
>> migrating managed interrupts away from the CPU before it goes down.
>
> What? They are migrated away today already when the CPU goes down unless
> the CPU is the last one in the affinity set of the interrupt. So why do
> you need a new step for something which already exists?

Aside of that these hotplug states are not used at all. So what is this
patch for?


^ permalink raw reply

* Re: [PATCH v3 11/13] cgroup/cpuset: Extend isolated partition to trigger kernel-noise isolation
From: Thomas Gleixner @ 2026-06-18 20:55 UTC (permalink / raw)
  To: Jing Wu, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Tejun Heo, Jonathan Corbet, Shuah Khan,
	Shuah Khan
  Cc: linux-kernel, rcu, cgroups, linux-doc, linux-kselftest, Jing Wu,
	Qiliang Yuan
In-Reply-To: <20260618-wujing-dhm-v3-11-28f1a4d83b68@gmail.com>

On Thu, Jun 18 2026 at 11:11, Jing Wu wrote:
>  
>  	if (update_housekeeping) {
> +		static const unsigned long noise_types =
> +			BIT(HK_TYPE_KERNEL_NOISE) | BIT(HK_TYPE_MANAGED_IRQ);
> +
>  		update_housekeeping = false;
>  		cpumask_copy(isolated_hk_cpus, isolated_cpus);
>  
> -		/*
> -		 * housekeeping_update() is now called without holding
> -		 * cpus_read_lock and cpuset_mutex. Only cpuset_top_mutex
> -		 * is still being held for mutual exclusion.
> -		 */

Why are you randomly removing useful comments?

^ permalink raw reply

* Re: [PATCH v3 10/13] sched: Guard sched_tick_start/stop against uninitialized tick_work_cpu
From: Thomas Gleixner @ 2026-06-18 20:50 UTC (permalink / raw)
  To: Jing Wu, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Tejun Heo, Jonathan Corbet, Shuah Khan,
	Shuah Khan
  Cc: linux-kernel, rcu, cgroups, linux-doc, linux-kselftest, Jing Wu,
	Qiliang Yuan
In-Reply-To: <20260618-wujing-dhm-v3-10-28f1a4d83b68@gmail.com>

On Thu, Jun 18 2026 at 11:11, Jing Wu wrote:
> sched_tick_start() and sched_tick_stop() are called during CPU hotplug
> for CPUs not in the HK_TYPE_KERNEL_NOISE set.  They dereference
> tick_work_cpu, which is allocated by sched_tick_offload_init() and only
> called from housekeeping_init() when nohz_full= is present at boot.
>
> When the DHM subsystem first-enables HK_TYPE_KERNEL_NOISE at runtime via
> housekeeping_update_types(), tick_work_cpu remains NULL because
> sched_tick_offload_init() is __init-only and cannot be re-invoked.  A
> subsequent CPU offline/online cycle for an isolated CPU triggers
> WARN_ON_ONCE(!tick_work_cpu) followed by a NULL-pointer dereference in
> per_cpu_ptr(tick_work_cpu, cpu), crashing the kernel.
>
> Since nohz_full= was not active at boot, tick_nohz_full_running remains
> false and the tick-offload infrastructure is never activated; isolated
> CPUs continue to receive their own ticks.  Guard both helpers with an
> additional !tick_work_cpu check so they become no-ops in this case.

This is the same fake functionality as with the tick itself. Seriously?

> -	if (housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE))
> +	if (housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE) || !tick_work_cpu)
>  		return;
>  
>  	WARN_ON_ONCE(!tick_work_cpu);
> @@ -5799,7 +5799,7 @@ static void sched_tick_stop(int cpu)
>  	struct tick_work *twork;
>  	int os;
>  
> -	if (housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE))
> +	if (housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE) || !tick_work_cpu)
>  		return;
>  
>  	WARN_ON_ONCE(!tick_work_cpu);

Brilliant stuff that. Guard against tick_work_cpu == NULL and then keep
the WARN_ON() there, which became completely pointless.

But that's all just mindless tinkering and fixing the symptoms.

If all of this is runtime managed, then all the initialization needs to
be made unconditional. Yes, that wastes a few bytes of memory per CPU if
it's not used, but avoids these completely inconsistent hacks all over
the place and provides a coherent user interface.

Stop trying to duct tape this in. This needs more thoughts than just
sprinkling works a few works for me hacks all over the place.

Thanks,

        tglx

^ permalink raw reply

* Re: [PATCH v3 08/13] genirq: Add explicit housekeeping callback for managed IRQ migration
From: Thomas Gleixner @ 2026-06-18 20:27 UTC (permalink / raw)
  To: Jing Wu, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Tejun Heo, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Waiman Long
  Cc: linux-kernel, rcu, cgroups, linux-doc, linux-kselftest, Jing Wu,
	Qiliang Yuan
In-Reply-To: <20260618-wujing-dhm-v3-8-28f1a4d83b68@gmail.com>

On Thu, Jun 18 2026 at 11:11, Jing Wu wrote:
> +
> +/*
> + * Managed IRQ housekeeping callback: iterate all managed IRQs and ask

S/IRQ/interrupt/ 

> + * the chip to move them off CPUs newly removed from HK_TYPE_MANAGED_IRQ.

Also this doesn't ask the chip to move it.

> + */
> +static void irq_hk_apply(enum hk_type type)
> +{
> +	cpumask_var_t hk_mask;
> +	struct irq_desc *desc;
> +	unsigned int irq;
> +
> +	if (!alloc_cpumask_var(&hk_mask, GFP_KERNEL))
> +		return;
> +
> +	/*
> +	 * Snapshot the new HK_TYPE_MANAGED_IRQ mask under an RCU read lock
> +	 * before iterating IRQ descriptors.  The lockdep annotation in
> +	 * housekeeping_cpumask() requires an RCU read-side critical section
> +	 * for runtime-mutable types.
> +	 */
> +	rcu_read_lock();
> +	cpumask_copy(hk_mask, housekeeping_cpumask_rcu(HK_TYPE_MANAGED_IRQ));
> +	rcu_read_unlock();

Same comments as in the nohz patch.

> +
> +	irq_lock_sparse();
> +
> +	for_each_active_irq(irq) {
> +		desc = irq_to_desc(irq);
> +		if (!desc || !desc->action)
> +			continue;
> +

	for (unsigned int irq = 0; irq < total_nr_irqs; irq++) {
                struct irq_desc *desc;

                 scoped_guard(rcu)
                 	desc = irq_find_desc_at_or_after(irq);
                 ....

> +		/*
> +		 * Only managed interrupts are selected: they have
> +		 * IRQF_AFFINITY_MANAGED set, meaning the kernel owns their
> +		 * affinity.  User-controlled IRQs are intentionally skipped.
> +		 *
> +		 * When the intersection of the current affinity mask and the
> +		 * new housekeeping mask is non-empty, re-apply the restricted
> +		 * affinity to migrate the IRQ away from newly isolated CPUs.
> +		 * If the intersection is empty (all serving CPUs are now
> +		 * isolated), the IRQ is left on its current CPU temporarily;
> +		 * handling that case (IRQ shutdown / re-startup) is left for
> +		 * a follow-up.

Oh well...

> +		 */
> +		if (irqd_affinity_is_managed(&desc->irq_data)) {

So you set the affinity even on an interrupt which is shutdown?

> +			const struct cpumask *mask;
> +			struct cpumask *tmp = this_cpu_ptr(&__tmp_mask);
> +
> +			raw_spin_lock_irq(&desc->lock);

                        guard()

> +			mask = irq_data_get_affinity_mask(&desc->irq_data);
> +			cpumask_and(tmp, mask, hk_mask);
> +			if (cpumask_intersects(tmp, cpu_online_mask))
> +				irq_do_set_affinity(&desc->irq_data, tmp, false);

That's completely broken. You _cannot_ change the affinity mask of a
managed interrupt. The mask itself is immutable.

The effective affinity can be changed by invoking the affinity setter
with the original unmodified mask. irq_do_set_affinity() already deals
with the housekeeping mask.

Also invoking irq_do_set_affinity() directly here is just wrong. It
breaks interrupts which cannot be moved in process context.

But even if that is fixed, then there is zero coordination with the
affected drivers/subsystems. Managed interrupts are related to device
and block queues and you cannot change one without the other. Neither
can you stop managed interrupts without quiescing the related device
queue. Starting them up requires also to reenable the device queue.

This problem needs to be fixed no matter what. See below.

> +static int irq_hk_validate(enum hk_type type,
> +			   const struct cpumask *cur_mask,
> +			   const struct cpumask *new_mask)
> +{
> +	if (!IS_ENABLED(CONFIG_SMP))
> +		return -EOPNOTSUPP;
> +	return 0;

Seriously? Why is this stuff even built when CONFIG_SMP=n?

So these validate callback seem to be just another voodoo container for
no value.

While this series might work for you by some definition of "works", it's
broken beyond repair and it's really annoying that I explained all of it
to the other people who try to solve that very same problem. Of course
you did not read any of that otherwise you would have CC'ed them.

     https://lore.kernel.org/lkml/87o6jcb84w.ffs@tglx

Trying to do that without taking the CPUs mostly offline and bringing
them online again is not going to work and there is zero benefit trying
to avoid that. First of all changing the isolation is not a hotpath
operation. Doing it one by one without bringing the CPU completely down
as I outlined in the above linked mail is not much more disruptive than
trying to do all of this on the fly. If you isolate a CPU then the tasks
on that CPU which do not belong to the isolation set need to get off the
CPU anyway. If you unisolate a CPU then it's really not a problem
whether the non-isolated tasks can move on it 10 milliseconds earlier or
later.

If you want to solve all the problems related to NOHZ, managed
interrupts, RCU etc. without the hotplug machinery then you end up
replicating half of it. Don't even try to think about it, that's a
complete waste of time and won't go anywhere.

Fix the few issues which are related to hotplug that I described in the
above linked mail and use the fully correct and tested common code for
your isolation muck. Please coordinate with Waiman or whoever is working
on it at RH right now.

Thanks,

        tglx

^ permalink raw reply

* Re: [PATCH v3 06/13] tick/nohz, context_tracking: Prepare for runtime nohz_full updates
From: Thomas Gleixner @ 2026-06-18 19:49 UTC (permalink / raw)
  To: Jing Wu, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Tejun Heo, Jonathan Corbet, Shuah Khan,
	Shuah Khan
  Cc: linux-kernel, rcu, cgroups, linux-doc, linux-kselftest, Jing Wu,
	Qiliang Yuan
In-Reply-To: <87ik7fep2j.ffs@fw13>

On Thu, Jun 18 2026 at 19:27, Thomas Gleixner wrote:
> On Thu, Jun 18 2026 at 11:11, Jing Wu wrote:
>> Remove __init from ct_cpu_track_user() and __initdata from the
>> initialized flag so context tracking can be activated on CPUs that
>> join nohz_full at runtime.  Drop the __ro_after_init attribute from
>> the context_tracking_key static key, allowing static_branch_dec()
>> when a CPU leaves nohz_full.
>>
>> Add ct_cpu_untrack_user() to reverse ct_cpu_track_user(), decrementing
>> the static key and clearing the per-CPU tracking state.
>
> Please do not enumerate WHAT the patch is doing. Explain the context and
> the WHY
>
>   https://docs.kernel.org/process/maintainer-tip.html#changelog

Just for the record. I told your colleague the same thing already....

^ permalink raw reply

* Re: htmldocs: Documentation/scheduler/sched-arch.rst:108: WARNING: Block quote ends without a blank line; unexpected unindent. [docutils]
From: Randy Dunlap @ 2026-06-18 19:33 UTC (permalink / raw)
  To: Shrikanth Hegde, kernel test robot; +Cc: oe-kbuild-all, linux-doc
In-Reply-To: <f1a4c4c7-9ad8-40f5-b1a9-ba631977dac6@linux.ibm.com>



On 6/17/26 10:19 PM, Shrikanth Hegde wrote:
> 
> 
> On 6/18/26 10:40 AM, kernel test robot wrote:
>> tree:   https://github.com/intel-lab-lkp/linux/commits/Shrikanth-Hegde/sched-debug-Remove-unused-schedstats/20260618-031604
>> head:   bcb0c494e4af36dd6306a5a1839a0c03046053af
>> commit: 4c29e4f3ba22adc04fc456620f2c6abf539d76df sched/docs: Document cpu_preferred_mask and Preferred CPU concept
>> date:   10 hours ago
>> compiler: clang version 22.1.8 (https://github.com/llvm/llvm-project ca7933e47d3a3451d81e72ac174dcb5aa28b59d1)
>> docutils: docutils (Docutils 0.21.2, Python 3.13.5, on linux)
>> reproduce: (https://download.01.org/0day-ci/archive/20260618/202606180717.yNM0yb41-lkp@intel.com/reproduce)
>>
>> If you fix the issue in a separate patch/commit (i.e. not just a new version of
>> the same patch/commit), kindly add following tags
>> | Reported-by: kernel test robot <lkp@intel.com>
>> | Closes: https://lore.kernel.org/oe-kbuild-all/202606180717.yNM0yb41-lkp@intel.com/
>>
>> All warnings (new ones prefixed by >>):
>>
>>     Checksumming on output with GSO
>>     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [docutils]
>>     MAINTAINERS:40: WARNING: Inline strong start-string without end-string. [docutils]

>>     Documentation/scheduler/sched-arch.rst:107: ERROR: Unexpected indentation. [docutils]
>>>> Documentation/scheduler/sched-arch.rst:108: WARNING: Block quote ends without a blank line; unexpected unindent. [docutils]
>>     Documentation/userspace-api/landlock:504: ./security/landlock/errata/abi-4.h:5: ERROR: Unexpected section title.
>>
>>
>> vim +108 Documentation/scheduler/sched-arch.rst
>>
>>     102   
>>     103    Notes:
>>     104    1. This feature is available under CONFIG_PREFERRED_CPU
>>     105    2. This feature works for FAIR class only.
>>     106    3. A task pinned, which can't be moved to preferred CPUs will continue
>>     107       to run based on its affinity. But no load balancing happens
> 
> is it flagging here due to missing . ?

No, but you could add that anyway.

>>   > 108    4. If needed, steal time based governors/arch dependent method
>>     109       could be used to cater to different types of cpu numbers.
>>     110       Arch can do so by implementing its own hooks.
>>     111    5. Decision to use/not use is driven by kernel. Hence it shouldn't
>>     112       break user affinities. One of the main reason why CPU hotplug
>>     113       or Isolated cpuset partitions was not a solution.
>>     114   
It wants a blank line between each list item (if the list items are multi-line).
For the list above this one (3 items, all single line), blank lines aren't needed.
[These comments come from testing, not reading specs.]

I made these changes and a couple of others to make the rendered html look
reasonable.

Use (or not).
---
From: Shrikanth Hegde <sshegde@linux.ibm.com>
To: linux-kernel@vger.kernel.org, mingo@kernel.org, peterz@infradead.org, juri.lelli@redhat.com, vincent.guittot@linaro.org, yury.norov@gmail.com, kprateek.nayak@amd.com, iii@linux.ibm.com
Cc: sshegde@linux.ibm.com, tglx@kernel.org, gregkh@linuxfoundation.org, pbonzini@redhat.com, seanjc@google.com, vschneid@redhat.com, huschle@linux.ibm.com, rostedt@goodmis.org, dietmar.eggemann@arm.com, mgorman@suse.de, bsegall@google.com, maddy@linux.ibm.com, srikar@linux.ibm.com, hdanton@sina.com, chleroy@kernel.org, vineeth@bitbyteword.org, frederic@kernel.org, arighi@nvidia.com, pauld@redhat.com, christian.loehle@arm.com, tj@kernel.org, tommaso.cucinotta@gmail.com, maz@kernel.org, rafael@kernel.org
Subject: [PATCH v4 02/20] sched/docs: Document cpu_preferred_mask and Preferred CPU concept
Date: Wed, 17 Jun 2026 23:11:21 +0530
Message-ID: <20260617174139.155540-3-sshegde@linux.ibm.com>


Add documentation for new cpumask called cpu_preferred_mask. This could
help users in understanding what this mask is and the concept behind it.

Document how to enable it and implementation aspects of it.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v3->v4:
- update docs to reflect preferred is subset of active.

 Documentation/scheduler/sched-arch.rst |   61 ++++++++++++++++++++++-
 1 file changed, 59 insertions(+), 2 deletions(-)

--- linux-next.orig/Documentation/scheduler/sched-arch.rst
+++ linux-next/Documentation/scheduler/sched-arch.rst
@@ -6,7 +6,8 @@ CPU Scheduler implementation hints for a
 
 Context switch
 ==============
-1. Runqueue locking
+Runqueue locking
+
 By default, the switch_to arch function is called with the runqueue
 locked. This is usually not a problem unless switch_to may need to
 take the runqueue lock. This is usually due to a wake up operation in
@@ -62,11 +63,67 @@ Your cpu_idle routines need to obey the
 arch/x86/kernel/process.c has examples of both polling and
 sleeping idle functions.
 
+Preferred CPUs
+==============
+
+In virtualised environments it is possible to overcommit CPU resources.
+i.e sum of virtual CPU(vCPU) of all VM's is greater than number of physical
+CPUs(pCPU). Under such conditions when all or many VM's have high utilization,
+hypervisor won't be able to satisfy the CPU requirement and has to context
+switch within or across VM. i.e hypervisor need to preempt one vCPU to run
+another. This is called vCPU preemption. This is more expensive compared to
+task context switch within a vCPU.
+
+In such cases it is better that combined vCPU ask from all VM is reduced
+by not using some of the vCPUs. vCPUs where workload can be safely
+scheduled which won't increase any contention for pCPU are called as
+"Preferred CPUs".
+
+In most cases preferred CPUs will be same as active CPUs, when there is pCPU
+contention, Preferred CPUs will reduce based on the amount of steal time.
+When the pCPU contention goes away as indicated by steal time, Preferred CPUs
+will become same as active CPUs again. One has to enable the feature by
+writing 1 to /sys/kernel/debug/sched/steal_monitor/enable
+
+One of the design construct is preferred CPUs is always subset of active CPUs.
+With CONFIG_PREFERRED_CPU=n, it is same as active CPUs.
+
+For scheduling decisions such as wakeup, pushing the task etc, needs this
+CPU state info. This is maintained in cpu_preferred_mask.
+
+vCPUs which are not in cpu_preferred_mask should be treated as vCPUs which
+should not be used at this moment provided it doesn't break user affinity.
+This is achieved by:
+
+1. Selecting a preferred CPU at wakeup.
+2. Push the task away from non-preferred CPU at tick.
+3. Only select preferred CPUs for load balance.
+
+/sys/devices/system/cpu/preferred prints the current cpu_preferred_mask in
+cpulist format.
+
+Notes:
+
+1. This feature is available under CONFIG_PREFERRED_CPU
+
+2. This feature works for FAIR class only.
+
+3. A task pinned, which can't be moved to preferred CPUs will continue
+   to run based on its affinity. But no load balancing happens
+
+4. If needed, steal time based governors/arch dependent method
+   could be used to cater to different types of cpu numbers.
+   Arch can do so by implementing its own hooks.
+
+5. Decision to use/not use is driven by kernel. Hence it shouldn't
+   break user affinities. One of the main reason why CPU hotplug
+   or Isolated cpuset partitions was not a solution.
 
 Possible arch/ problems
 =======================
 
 Possible arch problems I found (and either tried to fix or didn't):
 
-sparc - IRQs on at this point(?), change local_irq_save to _disable.
+sparc:
+      - IRQs on at this point(?), change local_irq_save to _disable.
       - TODO: needs secondary CPUs to disable preempt (See #1)


^ permalink raw reply

* Re: [PATCH v3 07/12] fs/resctrl: Add info/kernel_mode for kernel-mode policy introspection
From: Babu Moger @ 2026-06-18 19:16 UTC (permalink / raw)
  To: Reinette Chatre, corbet, tony.luck, Dave.Martin, james.morse,
	tglx, bp, dave.hansen
  Cc: skhan, x86, mingo, hpa, akpm, rdunlap, pawan.kumar.gupta,
	feng.tang, dapeng1.mi, kees, elver, lirongqing, paulmck, bhelgaas,
	seanjc, alexandre.chartre, yazen.ghannam, peterz, chang.seok.bae,
	kim.phillips, xin, naveen, thomas.lendacky, linux-doc,
	linux-kernel, eranian, peternewman
In-Reply-To: <2429a51a-92ad-4810-bee9-44bd6fba3443@intel.com>

Hi Reinette,

On 6/16/26 18:38, Reinette Chatre wrote:
> Hi Babu,
> 
> How should "introspection" as used in subject be interpreted? This just
> displays the supported and active kernel modes to user space, no?

Yes. Will change it.

> 
> On 4/30/26 4:24 PM, Babu Moger wrote:
>> There is no user-visible way today to see which kernel-mode CLOSID/RMID
>> policies the running kernel supports, which one is active, or which
>> resctrl group currently owns the kernel CLOSID/RMID.
> 
> Why should there be? This is a new feature being added in this series.
> No need to write this as a bugfix.
> 

Sure.

>>
>> Add a read-only top-level sysfs file, info/kernel_mode.  It emits one
>> line per mode advertised in resctrl_kcfg.kmode, in stable lowercase
>> spelling derived from enum resctrl_kernel_modes, e.g.:
> 
> All these changelogs feel so strange ... as though they are written by
> somebody who simultaneously has no and full knowledge of resctrl.
> These verbatim descriptions of what the code does is not necessary. Please
> start with why the patch is needed.

Sure. My bad. Will re-write it.

> 
>>
>>    [inherit_ctrl_and_mon:group=//]
> 
> This is unexpected. There should be no group associated with this default mode.
> This is how I interpreted our previous discussion ending:
> https://lore.kernel.org/lkml/6709398b-269d-47b5-9b41-084f410bb1a6@amd.com/

Ack.

> 
>>    global_assign_ctrl_inherit_mon_per_cpu:group=none
>>    global_assign_ctrl_assign_mon_per_cpu:group=none
>>
>> The effective policy (resctrl_kcfg.kmode_cur) is wrapped in square
> 
> (needs imperative - please check all changelogs)

Sure.

> 
>> brackets and its :group= suffix names the resctrl group currently
>> bound to the kernel CLOSID/RMID (resctrl_kcfg.k_rdtgrp), formatted as
>> <ctrl>/<mon>/ with empty components left blank.  Inactive modes are
>> reported as :group=none.
>>
>> rdtgroup_mutex is held while printing, matching other info/ show paths.
> 
> No need to describe details that can be seen from patch.

ok.

> 
>>
>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>> ---
>> v3: New patch to handle the changed interface file info/kernel_mode.
>>      Changed the group name to "none" if kmode binding is not done.
>>      Reinette suggested "uninitialized". "none" seemed more relevent.
>> ---
>>   fs/resctrl/rdtgroup.c | 74 +++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 74 insertions(+)
>>
>> diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
>> index a7bfc74897cc..9cdcfa64c4a2 100644
>> --- a/fs/resctrl/rdtgroup.c
>> +++ b/fs/resctrl/rdtgroup.c
>> @@ -988,6 +988,73 @@ static int rdt_last_cmd_status_show(struct kernfs_open_file *of,
>>   	return 0;
>>   }
>>   
>> +/* Sysfs lines for info/kernel_mode; indexed by &enum resctrl_kernel_modes */
>> +static const char * const resctrl_mode_str[] = {
>> +	[INHERIT_CTRL_AND_MON]			= "inherit_ctrl_and_mon",
>> +	[GLOBAL_ASSIGN_CTRL_INHERIT_MON_PER_CPU] = "global_assign_ctrl_inherit_mon_per_cpu",
>> +	[GLOBAL_ASSIGN_CTRL_ASSIGN_MON_PER_CPU]	= "global_assign_ctrl_assign_mon_per_cpu",
> 
> Please make alignment consistent.
> 

Sure.

>> +};
>> +
>> +static_assert(ARRAY_SIZE(resctrl_mode_str) == RESCTRL_NUM_KERNEL_MODES);
>> +
>> +/**
>> + * resctrl_kernel_mode_show() - Enumerate supported and effective kernel-mode policies
> 
> "Enumerate" -> "Display"?

sure.

> 
>> + * @of: kernfs open file
>> + * @seq: output seq_file
>> + * @v: unused
>> + *
>> + * Emits one line per mode advertised in resctrl_kcfg.kmode (each mode is one
>> + * BIT(index) per &enum resctrl_kernel_modes).  Every line carries a
> 
> Above is clear from the code. Please instead describe what this means.

Sure.
> 
>> + * ":group=<name>" suffix:
>> + *
>> + *   - The effective policy (whose BIT matches resctrl_kcfg.kmode_cur) is
>> + *     wrapped in square brackets and <name> is the resctrl group that
>> + *     currently owns the kernel CLOSID/RMID (resctrl_kcfg.k_rdtgrp),
>> + *     formatted as "<ctrl>/<mon>/".  A component is left empty when it
>> + *     does not apply: an RDTCTRL_GROUP emits "<ctrl>//", an RDTMON_GROUP
>> + *     under the default control group emits "/<mon>/", and an RDTMON_GROUP
>> + *     under a named control group emits "<ctrl>/<mon>/".
>> + *
>> + *   - Other supported but inactive modes are emitted without brackets and
>> + *     <name> is reported as "none".
>> + *
>> + * Context: Called under rdtgroup_mutex like other resctrl sysfs show paths.
> 
> This does not look accurate since it is not called with mutex held but instead
> takes the mutex itself. Also no need to refer to what other code does.

ok.

> 
>> + */
>> +static int resctrl_kernel_mode_show(struct kernfs_open_file *of,
>> +				    struct seq_file *seq, void *v)
>> +{
>> +	struct rdtgroup *rdtgrp;
>> +	const char *ctrl, *mon;
>> +	int i;
>> +
>> +	mutex_lock(&rdtgroup_mutex);
>> +	for (i = 0; i < RESCTRL_NUM_KERNEL_MODES; i++) {
>> +		if (!(resctrl_kcfg.kmode & BIT(i)))
>> +			continue;
>> +
>> +		if (resctrl_kcfg.kmode_cur != BIT(i)) {
>> +			seq_printf(seq, "%s:group=none\n",
>> +				   resctrl_mode_str[i]);
>> +			continue;
>> +		}
>> +
>> +		rdtgrp = resctrl_kcfg.k_rdtgrp;
>> +		ctrl = "";
>> +		mon = "";
>> +		if (rdtgrp->type == RDTMON_GROUP) {
>> +			if (rdtgrp->mon.parent != &rdtgroup_default)
>> +				ctrl = rdtgrp->mon.parent->kn->name;
> 
> Isn't default group's kn->name is initialized correctly via
> rdtgroup_setup_root()->kernfs_create_root()->__kernfs_new_node(root, NULL, "", ...) ?

Yes. that is correct. I will remove the check.

> 
>> +			mon = rdtgrp->kn->name;
>> +		} else {
>> +			ctrl = rdtgrp->kn->name;
>> +		}
> 
> Can the names not just be initialized directly from kn->name?

Yes. I think so. But I need to know if this is a control group or mon 
group to make it generic. Let me see if I can optimize this section.

> 
> 
>> +		seq_printf(seq, "[%s:group=%s/%s/]\n",
>> +			   resctrl_mode_str[i], ctrl, mon);
> 
> This is not where I understood our discussion landed. I expected that the display will
> reflect what can/should be assigned in a mode. For example, mode "inherit_ctrl_and_mon"
> does not have an associated resource group and should thus not display one,

Correct.

> "global_assign_ctrl_inherit_mon_per_cpu" can only be assigned a control group and
> should thus not display a monitor group also.

Yes. True. In that case "mon" is empty. It will print correctly.  Let me 
see if I can optimize this section.

Thanks
Babu

^ permalink raw reply

* Re: [PATCH v3] kconfig: add optional warnings for changed input values
From: Nathan Chancellor @ 2026-06-18 18:48 UTC (permalink / raw)
  To: Masahiro Yamada, Nicolas Schier, Pengpeng Hou
  Cc: Jonathan Corbet, linux-kbuild, linux-doc, linux-kernel
In-Reply-To: <20260611060000.23858-1-pengpeng@iscas.ac.cn>

On Thu, 11 Jun 2026 14:00:00 +0800, Pengpeng Hou wrote:
> kconfig: add optional warnings for changed input values

Since v2 was nearly there over a month ago and it is a genuine quality
of life improvement hidden behind an off-by-default option, I am going
to take this for my late 7.2 Kbuild pull request.

Applied to

  https://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux.git kbuild-next-unstable

Thanks!

[1/1] kconfig: add optional warnings for changed input values
      https://git.kernel.org/kbuild/c/645323a7f4e55

Please look out for regression or issue reports or other follow up
comments, as they may result in the patch/series getting dropped or
reverted. Patches applied to an "unstable" branch are accepted pending
wider testing in -next and any post-commit review; they will generally
be moved to the main branch in a week if no issues are found.

Best regards,
-- 
Cheers,
Nathan

^ permalink raw reply

* Re: [PATCH] kselftest docs: remove reference to obsolete/archived wiki
From: Brett Sheffield @ 2026-06-18 18:14 UTC (permalink / raw)
  To: Shuah Khan
  Cc: Rafael Passos, shuah, corbet, linux-kselftest, workflows,
	linux-doc, linux-kernel
In-Reply-To: <1306d609-3375-4f52-8239-b4c2fffb7bec@linuxfoundation.org>

On 2026-06-18 11:02, Shuah Khan wrote:
> My apologies  for not taking your patch earlier. Considering the effort
> you put in with a re-sending the patch and following up here, it is
> only fair for me to take yours instead. Hope it will apply cleanly on
> top of kselftest-next
> 
> Rafael, I am going to take Brett;s patch instead of yours.
> 
> Apologies to both of you for the mix up.

Thanks Shuah & no worries.


-- 
Brett Sheffield (he/him)

^ permalink raw reply

* Re: [PATCH v6 06/16] iio: core: create local __iio_chan_prefix_emit() for reuse
From: Andy Shevchenko @ 2026-06-18 18:14 UTC (permalink / raw)
  To: Rodrigo Alencar
  Cc: Nuno Sá, rodrigo.alencar, linux-iio, devicetree,
	linux-kernel, linux-doc, linux-hardening, Lars-Peter Clausen,
	Michael Hennerich, Jonathan Cameron, David Lechner,
	Andy Shevchenko, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Philipp Zabel, Jonathan Corbet, Shuah Khan, Kees Cook,
	Gustavo A. R. Silva
In-Reply-To: <x3aijvc4buo7aqbchikuoyyrgiq3afidtkla37h2rg4tvfdbc3@h42qp3estg2s>

On Thu, Jun 18, 2026 at 05:14:19PM +0100, Rodrigo Alencar wrote:
> On 18/06/26 16:06, Nuno Sá wrote:
> > On Thu, Jun 18, 2026 at 02:27:22PM +0100, Rodrigo Alencar via B4 Relay wrote:

...

> > > +	dev_attr->attr.name = kasprintf(GFP_KERNEL, "%s%s", prefix, postfix);
> > > +	if (!dev_attr->attr.name)
> > >  		return -ENOMEM;
> > 
> > I don't oppose the change. Looks like a nice cleanup.

May I oppose it? I found use scnprintf() is harder to follow in comparison to
nice kasprintf() that takes care for the dynamically allocated buffer.

Also there is a chance to get a name silently cut due to insufficient space.
Besides that this function can't be used (again due to 'c') in kasprintf()-like
wrapper. I do not consider this as a good approach. Have you looked at seq_buf
instead?

> > But bear in mind this very sensible as any subtle mistake means ABI breakage.

Which immediately raises a question of test coverage. Do we have one? If not,
this code must be accompanied with one.

> Yes! I tried to be careful... this is dangerous stuff!

-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply

* [PATCH v6 6/6] kselftest: alloc_tag: extend the allocinfo ioctl kselftest
From: Abhishek Bapat @ 2026-06-18 17:36 UTC (permalink / raw)
  To: Suren Baghdasaryan, Andrew Morton, Kent Overstreet, Hao Ge
  Cc: Shuah Khan, Jonathan Corbet, linux-doc, linux-kernel, linux-mm,
	Sourav Panda, Abhishek Bapat
In-Reply-To: <cover.1781803482.git.abhishekbapat@google.com>

Add the following 2 scenarios to the allocinfo ioctl kselftest:
1. Validate size based filtering
2. Validate lineno based filtering

The first test uses "do_init_module" as the candidate function for the
test. This is because the associated site will only allocate memory when
a kernel module is loaded. The return value of get_content_id() changes
every time modules are loaded or unloaded. Hence, as long as
get_content_id() values at the start and the end of the test are the
same, the memory allocated by the do_init_module call site should also
remain the same. Consequently, the test can assume consistency between
the value returned by the ioctl and the procfs resulting in less
flakiness.

Signed-off-by: Abhishek Bapat <abhishekbapat@google.com>
---
 .../alloc_tag/allocinfo_ioctl_test.c          | 198 +++++++++++++++++-
 1 file changed, 197 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/alloc_tag/allocinfo_ioctl_test.c b/tools/testing/selftests/alloc_tag/allocinfo_ioctl_test.c
index 1ae0291f2245..50755a45d3fe 100644
--- a/tools/testing/selftests/alloc_tag/allocinfo_ioctl_test.c
+++ b/tools/testing/selftests/alloc_tag/allocinfo_ioctl_test.c
@@ -5,6 +5,7 @@
  * Copyright (C) 2026 Google, Inc.
  */
 
+#include <errno.h>
 #include <fcntl.h>
 #include <stdio.h>
 #include <stdlib.h>
@@ -313,11 +314,194 @@ static int test_function_filter(void)
 	return run_filter_test(&filter);
 }
 
+static int test_size_filter(void)
+{
+	int fd;
+	struct allocinfo_tag_data_vec *tags = malloc(sizeof(*tags));
+	struct allocinfo_tag_data_vec *procfs_entries = malloc(sizeof(*procfs_entries));
+	struct allocinfo_filter filter;
+	int ret = KSFT_PASS;
+	__u64 target_size, i, pos;
+	bool found;
+	const char *target_function = "do_init_module";
+	struct allocinfo_content_id start_cont_id, end_cont_id;
+	int retry = 0;
+	const int max_retries = 10;
+
+	if (!tags || !procfs_entries) {
+		ksft_print_msg("Memory allocation failed.\n");
+		ret = KSFT_FAIL;
+		goto freemem;
+	}
+
+	fd = open(ALLOCINFO_PROC, O_RDONLY);
+	if (fd < 0) {
+		ksft_print_msg("Failed to open " ALLOCINFO_PROC ": %s\n", strerror(errno));
+		ret = KSFT_FAIL;
+		goto freemem;
+	}
+
+	do {
+		found = false;
+		pos = 0;
+
+		if (__allocinfo_get_content_id(fd, &start_cont_id)) {
+			ksft_print_msg("allocinfo_get_content_id failed\n");
+			ret = KSFT_FAIL;
+			goto exit;
+		}
+
+		memset(&filter, 0, sizeof(filter));
+		filter.mask |= ALLOCINFO_FILTER_MASK_FUNCTION;
+		strncpy(filter.fields.function, target_function, ALLOCINFO_STR_SIZE);
+
+		if (get_filtered_procfs_entries(procfs_entries, &filter)) {
+			ksft_print_msg("Error retrieving entries from " ALLOCINFO_PROC "\n");
+			ret = KSFT_FAIL;
+			goto exit;
+		}
+
+		if (procfs_entries->count == 0) {
+			ksft_print_msg("Function %s not found in procfs\n", target_function);
+			ret = KSFT_SKIP;
+			goto exit;
+		}
+
+		target_size = procfs_entries->tag[0].counter.bytes;
+
+		memset(&filter, 0, sizeof(filter));
+		filter.mask |= ALLOCINFO_FILTER_MASK_MIN_SIZE | ALLOCINFO_FILTER_MASK_MAX_SIZE;
+		filter.min_size = target_size;
+		filter.max_size = target_size;
+
+		while (1) {
+			struct allocinfo_get_at get_at_params;
+
+			memset(&get_at_params, 0, sizeof(get_at_params));
+			memcpy(&get_at_params.filter, &filter, sizeof(filter));
+			get_at_params.pos = pos;
+
+			if (__allocinfo_get_at(fd, &get_at_params))
+				break;
+
+			tags->count = 0;
+			memcpy(&tags->tag[tags->count++], &get_at_params.data,
+			       sizeof(get_at_params.data));
+
+			while (tags->count < VEC_MAX_ENTRIES &&
+			       __allocinfo_get_next(fd, &tags->tag[tags->count]) == 0)
+				tags->count++;
+
+			for (i = 0; i < tags->count; i++) {
+				if (strcmp(tags->tag[i].tag.function, target_function) == 0) {
+					found = true;
+					break;
+				}
+			}
+
+			if (found || tags->count < VEC_MAX_ENTRIES)
+				break;
+
+			pos += tags->count;
+		}
+
+		if (__allocinfo_get_content_id(fd, &end_cont_id)) {
+			ksft_print_msg("allocinfo_get_content_id failed\n");
+			ret = KSFT_FAIL;
+			goto exit;
+		}
+
+		if (start_cont_id.id == end_cont_id.id)
+			break;
+
+		ksft_print_msg("Module load detected during size verification, retrying...\n");
+	} while (retry++ < max_retries);
+
+	if (start_cont_id.id == end_cont_id.id && !found) {
+		ksft_print_msg("Entry with function %s not found in IOCTL results\n",
+			       target_function);
+		ret = KSFT_FAIL;
+	} else if (start_cont_id.id != end_cont_id.id) {
+		ksft_print_msg("Failed to match content_ids for procfs and IOCTL, skipping...\n");
+		ret = KSFT_SKIP;
+	}
+
+exit:
+	close(fd);
+freemem:
+	free(tags);
+	free(procfs_entries);
+	return ret;
+}
+
+static int test_lineno_filter(void)
+{
+	struct allocinfo_tag_data_vec *tags = malloc(sizeof(*tags));
+	struct allocinfo_tag_data_vec *procfs_entries = malloc(sizeof(*procfs_entries));
+	struct allocinfo_filter filter;
+	enum ioctl_ret ioctl_status;
+	int ret = KSFT_PASS;
+	__u64 target_lineno, i;
+
+	if (!tags || !procfs_entries) {
+		ksft_print_msg("Memory allocation failed.\n");
+		ret = KSFT_FAIL;
+		goto exit;
+	}
+
+	memset(&filter, 0, sizeof(filter));
+
+	if (get_filtered_procfs_entries(procfs_entries, &filter)) {
+		ksft_print_msg("Error retrieving entries from " ALLOCINFO_PROC "\n");
+		ret = KSFT_FAIL;
+		goto exit;
+	}
+	if (procfs_entries->count == 0) {
+		ksft_print_msg("Could not retrieve procfs entries\n");
+		ret = KSFT_SKIP;
+		goto exit;
+	}
+	/*
+	 * We depend on the result of procfs entries to create the ioctl_filter. Hence we
+	 * cannot recycle the run_filter_test function here.
+	 */
+	target_lineno = procfs_entries->tag[0].tag.lineno;
+
+	filter.mask |= ALLOCINFO_FILTER_MASK_LINENO;
+	filter.fields.lineno = target_lineno;
+
+	ioctl_status = get_filtered_ioctl_entries(tags, &filter, 0);
+	if (ioctl_status == IOCTL_INVALID_DATA) {
+		ksft_print_msg("Trouble retrieving valid IOCTL entries, skipping.\n");
+		ret = KSFT_SKIP;
+		goto exit;
+	}
+	if (ioctl_status == IOCTL_FAILURE) {
+		ksft_print_msg("Error retrieving IOCTL entries.\n");
+		ret = KSFT_FAIL;
+		goto exit;
+	}
+
+	for (i = 0; i < tags->count; i++) {
+		if (tags->tag[i].tag.lineno != target_lineno) {
+			ksft_print_msg("IOCTL entry %llu has incorrect lineno %llu.\n",
+				       i, tags->tag[i].tag.lineno);
+			ret = KSFT_FAIL;
+			goto exit;
+		}
+	}
+
+exit:
+	free(tags);
+	free(procfs_entries);
+	return ret;
+}
+
 int main(int argc, char *argv[])
 {
 	int ret;
 
-	ksft_set_plan(2);
+	ksft_set_plan(4);
 
 	ret = test_filename_filter();
 	if (ret == KSFT_SKIP)
@@ -331,5 +515,17 @@ int main(int argc, char *argv[])
 	else
 		ksft_test_result(ret == KSFT_PASS, "test_function_filter\n");
 
+	ret = test_size_filter();
+	if (ret == KSFT_SKIP)
+		ksft_test_result_skip("Skipping test_size_filter\n");
+	else
+		ksft_test_result(ret == KSFT_PASS, "test_size_filter\n");
+
+	ret = test_lineno_filter();
+	if (ret == KSFT_SKIP)
+		ksft_test_result_skip("Skipping test_lineno_filter\n");
+	else
+		ksft_test_result(ret == KSFT_PASS, "test_lineno_filter\n");
+
 	ksft_finished();
 }
-- 
2.55.0.rc0.786.g65d90a0328-goog


^ permalink raw reply related

* [PATCH v6 5/6] kselftest: alloc_tag: add kselftest for ioctl interface
From: Abhishek Bapat @ 2026-06-18 17:36 UTC (permalink / raw)
  To: Suren Baghdasaryan, Andrew Morton, Kent Overstreet, Hao Ge
  Cc: Shuah Khan, Jonathan Corbet, linux-doc, linux-kernel, linux-mm,
	Sourav Panda, Abhishek Bapat
In-Reply-To: <cover.1781803482.git.abhishekbapat@google.com>

Introduce a kselftest to verify the new IOCTL-based interface for
/proc/allocinfo. The test covers:

1. Validation of the filename filter.
2. Validation of the function filter.

The first test validates the functionality of the filename filter. Using
"mm/memory.c" as the candidate filename filter, it retrieves filtered
entries from both procfs and ioctl and matches the first VEC_MAX_ENTRIES
entries.

The second test validates the functionality of the function filter.
It uses "dup_mm" as the candidate function as we do not expect this
function name to change frequently and hence won't be needing to modify
this test often.

Note that both the tests match line no, function name and file name
fields. Bytes allocated and calls are not matched as those values may
change in the time when the data is being read from procfs and ioctl and
hence can lead to false negatives.

Signed-off-by: Abhishek Bapat <abhishekbapat@google.com>
---
 MAINTAINERS                                   |   1 +
 tools/testing/selftests/Makefile              |   1 +
 tools/testing/selftests/alloc_tag/Makefile    |   9 +
 .../alloc_tag/allocinfo_ioctl_test.c          | 335 ++++++++++++++++++
 4 files changed, 346 insertions(+)
 create mode 100644 tools/testing/selftests/alloc_tag/Makefile
 create mode 100644 tools/testing/selftests/alloc_tag/allocinfo_ioctl_test.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 019cc4c285a3..6610dd42e484 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16715,6 +16715,7 @@ F:	include/linux/alloc_tag.h
 F:	include/linux/pgalloc_tag.h
 F:	include/uapi/linux/alloc_tag.h
 F:	lib/alloc_tag.c
+F:	tools/testing/selftests/alloc_tag/
 
 MEMORY CONTROLLER DRIVERS
 M:	Krzysztof Kozlowski <krzk@kernel.org>
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 6e59b8f63e41..276a78c64736 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0
 TARGETS += acct
+TARGETS += alloc_tag
 TARGETS += alsa
 TARGETS += amd-pstate
 TARGETS += arm64
diff --git a/tools/testing/selftests/alloc_tag/Makefile b/tools/testing/selftests/alloc_tag/Makefile
new file mode 100644
index 000000000000..f2b8fc022c3b
--- /dev/null
+++ b/tools/testing/selftests/alloc_tag/Makefile
@@ -0,0 +1,9 @@
+# SPDX-License-Identifier: GPL-2.0
+
+TEST_GEN_PROGS := allocinfo_ioctl_test
+
+CFLAGS += -Wall
+CFLAGS += -I../../../../usr/include
+
+include ../lib.mk
+
diff --git a/tools/testing/selftests/alloc_tag/allocinfo_ioctl_test.c b/tools/testing/selftests/alloc_tag/allocinfo_ioctl_test.c
new file mode 100644
index 000000000000..1ae0291f2245
--- /dev/null
+++ b/tools/testing/selftests/alloc_tag/allocinfo_ioctl_test.c
@@ -0,0 +1,335 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+/* kselftest for allocinfo ioctl
+ * allocinfo ioctl retrives allocinfo data through ioctl
+ * Copyright (C) 2026 Google, Inc.
+ */
+
+#include <fcntl.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <stdbool.h>
+#include <unistd.h>
+#include <sys/ioctl.h>
+#include <linux/types.h>
+#include <linux/alloc_tag.h>
+#include "../kselftest.h"
+
+#define MAX_LINE_LEN		512
+#define ALLOCINFO_PROC		"/proc/allocinfo"
+
+enum ioctl_ret {
+	IOCTL_SUCCESS = 0,
+	IOCTL_FAILURE = 1,
+	IOCTL_INVALID_DATA = 2,
+};
+
+#define VEC_MAX_ENTRIES 32
+
+struct allocinfo_tag_data_vec {
+	struct allocinfo_tag_data tag[VEC_MAX_ENTRIES];
+	__u64 count;
+};
+
+static inline int __allocinfo_get_content_id(int dev_fd, struct allocinfo_content_id *params)
+{
+	return ioctl(dev_fd, ALLOCINFO_IOC_CONTENT_ID, params);
+}
+
+static inline int __allocinfo_get_at(int dev_fd, struct allocinfo_get_at *params)
+{
+	return ioctl(dev_fd, ALLOCINFO_IOC_GET_AT, params);
+}
+
+static inline int __allocinfo_get_next(int dev_fd, struct allocinfo_tag_data *params)
+{
+	return ioctl(dev_fd, ALLOCINFO_IOC_GET_NEXT, params);
+}
+
+static bool match_entry(const struct allocinfo_tag_data *procfs_entry,
+			const struct allocinfo_tag_data *tag_data,
+			bool match_bytes, bool match_calls, bool match_lineno,
+			bool match_function, bool match_filename)
+{
+	if (match_bytes && tag_data->counter.bytes != procfs_entry->counter.bytes) {
+		ksft_print_msg("size retrieved through ioctl does not match procfs\n");
+		return false;
+	}
+
+	if (match_calls && tag_data->counter.calls != procfs_entry->counter.calls) {
+		ksft_print_msg("call count retrieved through ioctl does not match procfs\n");
+		return false;
+	}
+
+	if (match_lineno && tag_data->tag.lineno != procfs_entry->tag.lineno) {
+		ksft_print_msg("lineno retrieved through ioctl does not match procfs\n");
+		return false;
+	}
+
+	if (match_function &&
+	    strncmp(tag_data->tag.function, procfs_entry->tag.function, ALLOCINFO_STR_SIZE)) {
+		ksft_print_msg("function retrieved through ioctl does not match procfs\n");
+		return false;
+	}
+
+	if (match_filename &&
+	    strncmp(tag_data->tag.filename, procfs_entry->tag.filename, ALLOCINFO_STR_SIZE)) {
+		ksft_print_msg("filename retrieved through ioctl does not match procfs\n");
+		return false;
+	}
+	return true;
+}
+
+static bool match_entries(const struct allocinfo_tag_data_vec *procfs_entries,
+			  const struct allocinfo_tag_data_vec *tags,
+			  bool match_bytes, bool match_calls, bool match_lineno,
+			  bool match_function, bool match_filename)
+{
+	__u64 i;
+
+	if (procfs_entries->count != tags->count) {
+		ksft_print_msg("Entry count mismatch. ioctl entries: %llu, proc entries: %llu\n",
+			       tags->count, procfs_entries->count);
+		return false;
+	}
+	for (i = 0; i < procfs_entries->count; i++) {
+		if (!match_entry(&procfs_entries->tag[i], &tags->tag[i],
+				 match_bytes, match_calls, match_lineno,
+				 match_function, match_filename)) {
+			ksft_print_msg("%lluth entry does not match.\n", i);
+			return false;
+		}
+	}
+	return true;
+}
+
+static const char *allocinfo_str(const char *str)
+{
+	size_t len = strlen(str);
+
+	if (len >= ALLOCINFO_STR_SIZE)
+		str += (len - ALLOCINFO_STR_SIZE) + 1;
+	return str;
+}
+
+static void allocinfo_copy_str(char *dest, const char *src)
+{
+	strncpy(dest, allocinfo_str(src), ALLOCINFO_STR_SIZE - 1);
+	dest[ALLOCINFO_STR_SIZE - 1] = '\0';
+}
+
+static int get_filtered_procfs_entries(struct allocinfo_tag_data_vec *procfs_entries,
+				       const struct allocinfo_filter *filter)
+{
+	FILE *fp = fopen(ALLOCINFO_PROC, "r");
+	char line[MAX_LINE_LEN];
+	int matches;
+	struct allocinfo_tag_data procfs_entry;
+
+	if (!fp) {
+		ksft_print_msg("Failed to open " ALLOCINFO_PROC " for reading\n");
+		return 1;
+	}
+	memset(procfs_entries, 0, sizeof(*procfs_entries));
+	while (fgets(line, sizeof(line), fp) && procfs_entries->count < VEC_MAX_ENTRIES) {
+		char filename[MAX_LINE_LEN];
+		char function[MAX_LINE_LEN];
+
+		memset(&procfs_entry, 0, sizeof(procfs_entry));
+		matches = sscanf(line, "%llu %llu %[^:]:%llu func:%s",
+				 &procfs_entry.counter.bytes,
+				 &procfs_entry.counter.calls,
+				 filename,
+				 &procfs_entry.tag.lineno,
+				 function);
+
+		if (matches != 5)
+			continue;
+
+		allocinfo_copy_str(procfs_entry.tag.filename, filename);
+		allocinfo_copy_str(procfs_entry.tag.function, function);
+
+		if (filter->mask & ALLOCINFO_FILTER_MASK_FILENAME) {
+			if (strncmp(procfs_entry.tag.filename,
+				    filter->fields.filename, ALLOCINFO_STR_SIZE))
+				continue;
+		}
+		if (filter->mask & ALLOCINFO_FILTER_MASK_FUNCTION) {
+			if (strncmp(procfs_entry.tag.function,
+				    filter->fields.function, ALLOCINFO_STR_SIZE))
+				continue;
+		}
+		if (filter->mask & ALLOCINFO_FILTER_MASK_LINENO) {
+			if (procfs_entry.tag.lineno != filter->fields.lineno)
+				continue;
+		}
+		if (filter->mask & ALLOCINFO_FILTER_MASK_MIN_SIZE) {
+			if (procfs_entry.counter.bytes < filter->min_size)
+				continue;
+		}
+		if (filter->mask & ALLOCINFO_FILTER_MASK_MAX_SIZE) {
+			if (procfs_entry.counter.bytes > filter->max_size)
+				continue;
+		}
+
+		memcpy(&procfs_entries->tag[procfs_entries->count++], &procfs_entry,
+		       sizeof(procfs_entry));
+	}
+	fclose(fp);
+	return 0;
+}
+
+static enum ioctl_ret get_filtered_ioctl_entries(struct allocinfo_tag_data_vec *tags,
+						 const struct allocinfo_filter *filter,
+						 __u64 start_pos)
+{
+	int fd = open(ALLOCINFO_PROC, O_RDONLY);
+
+	if (fd < 0) {
+		ksft_print_msg("Failed to open " ALLOCINFO_PROC " for IOCTL\n");
+		return IOCTL_FAILURE;
+	}
+
+	struct allocinfo_content_id start_cont_id, end_cont_id;
+	struct allocinfo_get_at get_at_params;
+	const int max_retries = 10;
+	int retry_count = 0;
+	int status;
+
+	/*
+	 * __allocinfo_get_content_id may return different values if a kernel module was loaded
+	 * between the two calls. If that happens, the data gathered cannot be considered consistent
+	 * and hence needs to be fetched again to avoid flakiness.
+	 */
+	do {
+		if (__allocinfo_get_content_id(fd, &start_cont_id)) {
+			ksft_print_msg("allocinfo_get_content_id failed\n");
+			status = IOCTL_FAILURE;
+			goto exit;
+		}
+
+		memset(tags, 0, sizeof(*tags));
+		memset(&get_at_params, 0, sizeof(get_at_params));
+		memcpy(&get_at_params.filter, filter, sizeof(*filter));
+		get_at_params.pos = start_pos;
+		if (__allocinfo_get_at(fd, &get_at_params)) {
+			ksft_print_msg("allocinfo_get_at failed\n");
+			status = IOCTL_FAILURE;
+			goto exit;
+		}
+		memcpy(&tags->tag[tags->count++], &get_at_params.data, sizeof(get_at_params.data));
+
+		while (tags->count < VEC_MAX_ENTRIES &&
+		       __allocinfo_get_next(fd, &tags->tag[tags->count]) == 0)
+			tags->count++;
+
+		if (__allocinfo_get_content_id(fd, &end_cont_id)) {
+			ksft_print_msg("allocinfo_get_content_id failed\n");
+			status = IOCTL_FAILURE;
+			goto exit;
+		}
+
+		if (start_cont_id.id == end_cont_id.id) {
+			status = IOCTL_SUCCESS;
+		} else {
+			ksft_print_msg("allocinfo_get_content_id mismatch, retrying...\n");
+			status = IOCTL_INVALID_DATA;
+		}
+	} while (status == IOCTL_INVALID_DATA && retry_count++ < max_retries);
+
+exit:
+	close(fd);
+	return status;
+}
+
+static int run_filter_test(const struct allocinfo_filter *filter)
+{
+	struct allocinfo_tag_data_vec *tags = malloc(sizeof(*tags));
+	struct allocinfo_tag_data_vec *procfs_entries = malloc(sizeof(*procfs_entries));
+	int ioctl_status;
+	int ret = KSFT_PASS;
+
+	if (!tags || !procfs_entries) {
+		ksft_print_msg("Memory allocation failed.\n");
+		ret = KSFT_FAIL;
+		goto exit;
+	}
+
+	if (get_filtered_procfs_entries(procfs_entries, filter)) {
+		ksft_print_msg("Error retrieving entries from " ALLOCINFO_PROC "\n");
+		ret = KSFT_SKIP;
+		goto exit;
+	}
+
+	if (procfs_entries->count == 0) {
+		ksft_print_msg("No entries found in " ALLOCINFO_PROC ", skipping test\n");
+		ret = KSFT_SKIP;
+		goto exit;
+	}
+
+	ioctl_status = get_filtered_ioctl_entries(tags, filter, 0);
+	if (ioctl_status == IOCTL_INVALID_DATA) {
+		ksft_print_msg("Trouble retrieving valid IOCTL entries, skipping.\n");
+		ret = KSFT_SKIP;
+		goto exit;
+	}
+	if (ioctl_status == IOCTL_FAILURE) {
+		ksft_print_msg("Error retrieving IOCTL entries.\n");
+		ret = KSFT_FAIL;
+		goto exit;
+	}
+
+	if (!match_entries(procfs_entries, tags, false, false, true, true, true))
+		ret = KSFT_FAIL;
+
+exit:
+	free(tags);
+	free(procfs_entries);
+	return ret;
+}
+
+static int test_filename_filter(void)
+{
+	struct allocinfo_filter filter;
+	const char *target_filename = "mm/memory.c";
+
+	memset(&filter, 0, sizeof(filter));
+	filter.mask |= ALLOCINFO_FILTER_MASK_FILENAME;
+	strncpy(filter.fields.filename, target_filename, ALLOCINFO_STR_SIZE);
+
+	return run_filter_test(&filter);
+}
+
+static int test_function_filter(void)
+{
+	struct allocinfo_filter filter;
+	const char *target_function = "dup_mm";
+
+	memset(&filter, 0, sizeof(filter));
+	filter.mask |= ALLOCINFO_FILTER_MASK_FUNCTION;
+	strncpy(filter.fields.function, target_function, ALLOCINFO_STR_SIZE);
+
+	return run_filter_test(&filter);
+}
+
+int main(int argc, char *argv[])
+{
+	int ret;
+
+	ksft_set_plan(2);
+
+	ret = test_filename_filter();
+	if (ret == KSFT_SKIP)
+		ksft_test_result_skip("Skipping test_filename_filter\n");
+	else
+		ksft_test_result(ret == KSFT_PASS, "test_filename_filter\n");
+
+	ret = test_function_filter();
+	if (ret == KSFT_SKIP)
+		ksft_test_result_skip("Skipping test_function_filter\n");
+	else
+		ksft_test_result(ret == KSFT_PASS, "test_function_filter\n");
+
+	ksft_finished();
+}
-- 
2.55.0.rc0.786.g65d90a0328-goog


^ permalink raw reply related

* [PATCH v6 4/6] alloc_tag: add accuracy based filtering to ioctl
From: Abhishek Bapat @ 2026-06-18 17:36 UTC (permalink / raw)
  To: Suren Baghdasaryan, Andrew Morton, Kent Overstreet, Hao Ge
  Cc: Shuah Khan, Jonathan Corbet, linux-doc, linux-kernel, linux-mm,
	Sourav Panda, Abhishek Bapat
In-Reply-To: <cover.1781803482.git.abhishekbapat@google.com>

Extend the allocinfo filtering mechanism to allow users to filter tags
based on their accuracy.

Signed-off-by: Abhishek Bapat <abhishekbapat@google.com>
Acked-by: Hao Ge <hao.ge@linux.dev>
Acked-by: Suren Baghdasaryan <surenb@google.com>
---
 include/uapi/linux/alloc_tag.h | 4 ++++
 lib/alloc_tag.c                | 8 ++++++++
 2 files changed, 12 insertions(+)

diff --git a/include/uapi/linux/alloc_tag.h b/include/uapi/linux/alloc_tag.h
index 0de5fc180790..270f693b1822 100644
--- a/include/uapi/linux/alloc_tag.h
+++ b/include/uapi/linux/alloc_tag.h
@@ -31,6 +31,8 @@ struct allocinfo_tag {
 	char function[ALLOCINFO_STR_SIZE];
 	char filename[ALLOCINFO_STR_SIZE];
 	__u64 lineno;
+	/* filter criteria only; see allocinfo_counter.accurate for actual accuracy */
+	__u64 inaccurate;
 };
 
 /* The alignment ensures 32-bit compatible interfaces are not broken */
@@ -50,6 +52,7 @@ enum {
 	ALLOCINFO_FILTER_FUNCTION,
 	ALLOCINFO_FILTER_FILENAME,
 	ALLOCINFO_FILTER_LINENO,
+	ALLOCINFO_FILTER_INACCURATE,
 	ALLOCINFO_FILTER_MIN_SIZE,
 	ALLOCINFO_FILTER_MAX_SIZE,
 	__ALLOCINFO_FILTER_LAST = ALLOCINFO_FILTER_MAX_SIZE
@@ -59,6 +62,7 @@ enum {
 #define ALLOCINFO_FILTER_MASK_FUNCTION		(1 << ALLOCINFO_FILTER_FUNCTION)
 #define ALLOCINFO_FILTER_MASK_FILENAME		(1 << ALLOCINFO_FILTER_FILENAME)
 #define ALLOCINFO_FILTER_MASK_LINENO		(1 << ALLOCINFO_FILTER_LINENO)
+#define ALLOCINFO_FILTER_MASK_INACCURATE	(1 << ALLOCINFO_FILTER_INACCURATE)
 #define ALLOCINFO_FILTER_MASK_MIN_SIZE		(1 << ALLOCINFO_FILTER_MIN_SIZE)
 #define ALLOCINFO_FILTER_MASK_MAX_SIZE		(1 << ALLOCINFO_FILTER_MAX_SIZE)
 
diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
index ad33d63ef7b4..32ac0674d8bf 100644
--- a/lib/alloc_tag.c
+++ b/lib/alloc_tag.c
@@ -249,6 +249,8 @@ static bool matches_filter(struct codetag *ct, struct allocinfo_filter *filter,
 			   struct alloc_tag_counters *counters,
 			   bool *fetched_counters)
 {
+	bool inaccurate;
+
 	if (!filter || !filter->mask)
 		return true;
 
@@ -274,6 +276,12 @@ static bool matches_filter(struct codetag *ct, struct allocinfo_filter *filter,
 	    ct->lineno != filter->fields.lineno)
 		return false;
 
+	if (filter->mask & ALLOCINFO_FILTER_MASK_INACCURATE) {
+		inaccurate = !!(ct->flags & CODETAG_FLAG_INACCURATE);
+		if (inaccurate != !!(filter->fields.inaccurate))
+			return false;
+	}
+
 	if (filter->mask & (ALLOCINFO_FILTER_MASK_MIN_SIZE | ALLOCINFO_FILTER_MASK_MAX_SIZE)) {
 		if (!*fetched_counters) {
 			*counters = allocinfo_prefetch_counters(ct);
-- 
2.55.0.rc0.786.g65d90a0328-goog


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox