Linux Documentation
 help / color / mirror / Atom feed
* Re: [PATCH v4 00/31] Introduce SCMI Telemetry FS support
From: Christian Brauner @ 2026-06-17 12:58 UTC (permalink / raw)
  To: Cristian Marussi
  Cc: linux-kernel, linux-arm-kernel, arm-scmi, linux-fsdevel,
	linux-doc, sudeep.holla, james.quinlan, f.fainelli,
	vincent.guittot, etienne.carriere, peng.fan, michal.simek, d-gole,
	jic23, elif.topuz, lukasz.luba, philip.radford,
	souvik.chakravarty, leitao, kas, puranjay, usama.arif,
	kernel-team
In-Reply-To: <20260612223802.1337232-1-cristian.marussi@arm.com>

On Fri, Jun 12, 2026 at 11:37:30PM +0100, Cristian Marussi wrote:
> Hi all,
> 
> --------------------------------------------------------------------------------
> [TLDR Summary]
> This series introduces a new SCMI driver which uses a new Telemetry FS to expose
> and configure SCMI Telemetry Data Events retrieved from the platform SCMI FW
> at runtime. The patches carrying the new STLMFS Filesystem support are tagged
> with 'stlmfs'.
> --------------------------------------------------------------------------------
> 
> the upcoming SCMI v4.0 specification [0] introduces a new SCMI protocol
> dedicated to System Telemetry.
> 
> In a nutshell, the SCMI Telemetry protocol allows an agent to discover at
> runtime the set of Telemetry Data Events (DEs) available on a specific
> platform and provides the means to configure the set of DEs that a user is
> interested into, while reading them back using the collection method that
> is deeemed more suitable for the usecase at hand. (...amongst the various
> possible collection methods allowed by SCMI specification)
> 
> Without delving into the gory details of the whole SCMI Telemetry protocol
> let's just say that the SCMI platform/server firmware advertises a number
> of Telemetry Data Events, each one identified by a 32bit unique ID, and an
> SCMI agent/client, like Linux, can discover them and read back at will the
> associated data value in a number of ways.
> Data collection is mainly intended to happen on demand via shared memory
> areas exposed by the platform firmware, discovered dynamically via SCMI
> Telemetry and accessed by Linux on-demand, but some DE can also be reported
> via SCMI Notifications asynchronous messages or via direct dedicated
> FastChannels (another kind of SCMI memory based access): all of this
> underlying mechanism is anyway hidden to the user since it is mediated by
> the kernel driver which will return the proper data value when queried.
> 
> Anyway, the set of well-known architected DE IDs defined by the spec is
> limited to a dozen IDs, which means that the vast majority of DE IDs are
> customizable per-platform: as a consequence, though, the same ID, say
> '0x1234', could represent completely different things on different systems.
> 
> Precise definitions and semantic of such custom Data Event IDs are out of
> the scope of the SCMI Telemetry specification and of this implementation:
> they are supposed to be provided using some kind of JSON-like description
> file that will have to be consumed by a userspace tool which would be
> finally in charge of making sense of the set of available DEs.
> 
> IOW, in turn, this means that even though the DEs enumerated via SCMI come
> with some sort of topological and qualitative description provided by the
> protocol (like unit of measurements, name, topology info etc), kernel-wise
> we CANNOT be completely sure of "what is what" without being fed-back some
> sort of information about the DEs by the afore mentioned userspace tool.
> 
> For these reasons, currently this series does NOT attempt to register any
> of these DEs with any of the usual in-kernel subsystems (like HWMON, IIO,
> PERF etc), simply because we cannot be sure which DE is suitable, or even
> desirable, for a given subsystem. This also means there are NO in-kernel
> users of these Telemetry data events as of now.
> 
> So, while we do not exclude, for the future, to feed/register some of the
> discovered DEs to/with some of the above mentioned Kernel subsystems, as
> of now we have ONLY modeled a custom userspace API to make SCMI Telemetry
> available to userspace tools.
> 
> In deciding which kind of interface to expose SCMI Telemetry data to a
> user, this new SCMI Telemetry driver aims at satisfying 2 main reqs:
> 
>  - exposing an FS-based human-readable interface that can be used to
>    discover, configure and access our Telemetry data directly also from
>    the shell without special tools
> 
>  - exposing alternative machine-friendly, more-performant, binary
>    interfaces that can be used to avoid the overhead of multiple accesses
>    to the VFS and that can be more suitable to access with custom tools
> 
> In the initial RFC posted a few months ago [1], the above was achieved
> with a combination of a SysFS interface, for the human-readable side of
> the story, and a classic chardev/ioctl for the plain binary access.
> 
> Since V1, instead, we moved away from this combined approach, especially
> away from SysFS, for the following reason:
> 
>  1. "Abusing SysFS": SysFS is a handy way to expose device related
>       properties in a common way, using a few common helpers built on
>       kernfs; this means, though, that unfortunately in our scenario I had
>       to generate a dummy simple device for EACH SCMI Telemetry DataEvent
>       that I got to discover at runtime and attach to them, all of the
>       properties I need.
>       This by itself seemed to me abusing the SysFS framework, but, even
>       ignoring this, the impact on the system when we have to deal with
>       hundreds or tens of thousands of DEs is sensible.
>       In some test scenario I ended with 50k DE devices and half-a-millon
>       related property files ... O_o
> 
>  2. "SysFS constraints": SysFS usage itself has its well-known constraints
>       and best practices, like the one-file/one-value rule, and due to the
>       fact that any virtual file with a complex structure or handling logic
>       is frowned upon, you can forget about IOCTLs and mmap'ing to provide
>       a more performant interface within SysFs, which is the reason why,
>       in the previous RFC, there was an additional alternative chardev
>       interface.
>       These latter limitations around the implementation of files with a
>       more complex semantic (i.e. with a broader set of file_operations)
>       derive from the underlying KernFS support, so KernFS is equally not
>       suitable as a building block for our implementation.
> 
>  2. "Chardev limitations": Given the nature of the protocol, the hybrid
>       approach employing character devices was itself problematic: first
>       of all because there is an upper limit on the number of chardev we
>       can create, dictated by the range of available minor numbers, and
>       then because the fact itself to have to maintain 2 completely
>       different interfaces (FS + chardev) is painful.
> 
> As a final remark, please NOTE THAT all of this is supposed to be available
> in production systems across a number of heterogeneous platforms: for these
> reasons the easy choice, debugFS, is NOT an option here.
> 
> Due to the above reasoning, since V1 we opted for a new approach with the
> proposed interfaces now based on a full fledged, unified, virtual pseudo
> filesystem implemented from scratch, so that we can:
> 
>  - expose all the DEs property we like as before with SysFS, but without
>    any of the constraint imposed by the usage of SysFs or kernfs.
> 
>  - easily expose additional alternative views of the same set of DEs
>    using symlinking capabilities (e.g. alternative topological view)
> 
>  - additionally expose a few alternative and more performant interfaces
>    by embedding in that same FS, a few special virtual files:
> 
>    + 'control': to issue IOCTLs for quicker discovery and on-demand access
>    		to data
>    + 'pipe' [TBD]: to provide a stream of events using a virtual
>    		   infinite-style file
>    + 'raw_<N>' [TBD]: to provide direct memory mapped access to the raw
>    		      SCMI Telemetry data from userspace

A filsystem driver for telemetry like this is really misguided. I think
shell access is really not an argument for adding a filesystem into the
kernel like this. That's just not appropriate justification to push
thousand and thousands of lines of code into the kernel.

You're building completely new infrastructure. The format is whatever it
is. If you stream it somehow just add a binary that userspace can use to
consume or translate it. If you need a filesystem interface for
convenience build it via FUSE on top of whatever streams that data and
get it ouf of the kernels way.

You also buy into all kinds of really wonky properties. If you split it
over multiple files you can never get a snapshot of data that is
consistent if it's across multiple files.

Telemetry over a filesystem is just not a great idea. If you did it via
sysfs I really wouldn't care because all because the infrastructure
already exists and I couldn't be bothered if this grew yet another wart
but as a separate massive hand-rolled pseudofs, no I'm not seeing it.

^ permalink raw reply

* Re: [PATCH 00/19] init: discoverable root partitions, a.k.a. an omittable "root=" cmdline option
From: Christian Brauner @ 2026-06-17 12:41 UTC (permalink / raw)
  To: Vincent Mailhol
  Cc: Jens Axboe, Davidlohr Bueso, Alexander Viro, Jan Kara,
	linux-kernel, linux-block, linux-efi, linux-fsdevel,
	Richard Henderson, Matt Turner, Magnus Lindholm, linux-alpha,
	Vineet Gupta, linux-snps-arc, Russell King, linux-arm-kernel,
	Catalin Marinas, Will Deacon, Huacai Chen, WANG Xuerui, loongarch,
	Thomas Bogendoerfer, linux-mips, James E.J. Bottomley,
	Helge Deller, linux-parisc, Madhavan Srinivasan, Michael Ellerman,
	linuxppc-dev, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	linux-riscv, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	linux-s390, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Jonathan Corbet, Shuah Khan, linux-doc
In-Reply-To: <20260615-discoverable-root_partitions-v1-0-39c78fac42e2@kernel.org>

On Mon, Jun 15, 2026 at 06:08:56PM +0200, Vincent Mailhol wrote:
> DPS [1] defines GPT partition type UUIDs for OS partitions and
> attributes that control whether such partitions should be
> automatically discovered. The specification states that:
> 
>   The OS can discover and mount the necessary file systems with a
>   non-existent or incomplete /etc/fstab file and without the root=
>   kernel command line option.
> 
> DPS is already implemented in systemd-gpt-auto-generator [2], which,
> when embedded in an initrd, indeed allows automatic detection of the
> root filesystem through its partition type UUID.
> 
> This series adds this discovery feature directly into the kernel so
> that people who are not using systemd or not using an initrd can still
> benefit from it. The implementation follows the same model as
> systemd-gpt-auto-generator:

I happen to co-maintain the DPS. It is userspace policy and complex
userspace policy at that and does not belong into the kernel.

This also implements a really tiny portion of the spec. It deals with a
lot more complex concepts such as automatic partitioning during
installation, verity, LUKS, containers. This is really not intended for
the kernel at all. I mean, it's great that this spec is being used but I
do not want this in the kernel just for the sake of auto-discovery.

The DPS is completely generic and can be implemented by tooling other
than systemd (util-linux implements it and so does refind iirc). I think
not wanting to use or build alternative userspace tooling for this is a
really weak argument for pushing this into the kernel.

^ permalink raw reply

* [PATCH 2/2] docs: nvme-multipath: document service-time I/O policy
From: Guixin Liu @ 2026-06-17 11:45 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	onathan Corbet, Shuah Khan
  Cc: linux-nvme, linux-doc
In-Reply-To: <20260617114602.2224074-1-kanie@linux.alibaba.com>

Add documentation for the service-time path selection policy, including
its algorithm overview, sysfs attributes (in_flight_bytes and
relative_throughput), and guidance on when to use it over queue-depth.

Document that setting relative_throughput to 0 makes the path a standby
that is only used when no path with a positive value is available.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
---
 Documentation/admin-guide/nvme-multipath.rst | 31 ++++++++++++++++++--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/nvme-multipath.rst b/Documentation/admin-guide/nvme-multipath.rst
index 97ca1ccef459..2acfceaf3d65 100644
--- a/Documentation/admin-guide/nvme-multipath.rst
+++ b/Documentation/admin-guide/nvme-multipath.rst
@@ -24,8 +24,8 @@ Policies
 
 All policies follow the ANA (Asymmetric Namespace Access) mechanism, meaning
 that when an optimized path is available, it will be chosen over a non-optimized
-one. Current the NVMe multipath policies include numa(default), round-robin and
-queue-depth.
+one. Current the NVMe multipath policies include numa(default), round-robin,
+queue-depth and service-time.
 
 To set the desired policy (e.g., round-robin), use one of the following methods:
    1. echo -n "round-robin" > /sys/module/nvme_core/parameters/iopolicy
@@ -70,3 +70,30 @@ When to use the queue-depth policy:
   1. High load with small I/Os: Effectively balances load across paths when
      the load is high, and I/O operations consist of small, relatively
      fixed-sized requests.
+
+
+Service-Time
+------------
+
+The service-time policy selects the path with the lowest estimated service time.
+It calculates service time as ``in_flight_bytes / relative_throughput`` for each
+path, preferring the path that would complete I/O fastest. Unlike queue-depth
+which counts requests regardless of size, service-time tracks actual bytes in
+flight, making it aware of I/O sizes.
+
+Each path exposes two sysfs attributes under
+``/sys/class/nvme/nvmeX/nvmeXcYnZ/``:
+
+  - ``in_flight_bytes`` (read-only): Current bytes in flight on this path.
+  - ``relative_throughput`` (read-write): Relative throughput weight for this
+    path, default 1. The valid range is 0-100. Set higher values for faster
+    paths. If set to 0, the path is not selected while other paths with
+    positive values are available.
+
+When to use the service-time policy:
+  1. Asymmetric Link Speeds: When paths have different bandwidths, set
+     ``relative_throughput`` proportionally (e.g., 2 for a link twice as fast)
+     to steer more traffic to faster paths.
+  2. Mixed I/O Sizes: When workloads mix small and large I/Os (e.g., 4K and
+     128K), service-time distributes load more accurately than queue-depth
+     because it accounts for actual bytes rather than request count.
-- 
2.43.7


^ permalink raw reply related

* [PATCH 0/2] nvme: Introduce service-time iopolicy
From: Guixin Liu @ 2026-06-17 11:45 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	onathan Corbet, Shuah Khan
  Cc: linux-nvme, linux-doc

Hi all,
  I developed the service-time iopolicy in nvme native
multipath, please review, all comments are wellcome.

Guixin Liu (2):
  nvme-multipath: add service-time I/O policy
  docs: nvme-multipath: document service-time I/O policy

 Documentation/admin-guide/nvme-multipath.rst |  31 +++-
 drivers/nvme/host/multipath.c                | 165 ++++++++++++++++++-
 drivers/nvme/host/nvme.h                     |   6 +
 drivers/nvme/host/sysfs.c                    |   5 +-
 4 files changed, 202 insertions(+), 5 deletions(-)

-- 
2.43.7


^ permalink raw reply

* [PATCH 1/2] nvme-multipath: add service-time I/O policy
From: Guixin Liu @ 2026-06-17 11:45 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	onathan Corbet, Shuah Khan
  Cc: linux-nvme, linux-doc
In-Reply-To: <20260617114602.2224074-1-kanie@linux.alibaba.com>

Add a new "service-time" I/O policy for NVMe native multipath, adapted
from the DM multipath service-time path selector (dm-ps-service-time).

Unlike the existing "queue-depth" policy which only counts the number of
in-flight I/Os, service-time estimates each path's service time by
dividing the total in-flight I/O size by a configurable relative
throughput weight:

    service_time = in_flight_size / relative_throughput

This provides more accurate load balancing when I/O sizes vary
significantly across paths, and allows users to assign throughput
weights to paths with different performance characteristics.

The comparison algorithm is directly adapted from dm-ps-service-time,
using cross-multiplication to avoid division and includes overflow
protection by shifting down when in-flight sizes are large.

Per-controller state:
  - in_flight_size: total bytes of in-flight I/Os (atomic_t)
  - relative_throughput: configurable weight 0-100, default 1

Sysfs interface (per path device):
  - in_flight_bytes (ro): current in-flight byte count
  - relative_throughput (rw): path throughput weight

Paths with relative_throughput set to 0 are not selected while other
paths with positive values are available, matching DM service-time
semantics.

Usage:
  echo service-time > /sys/module/nvme_core/parameters/iopolicy
  echo 4 > /sys/block/nvmeXcYnZ/relative_throughput

Test environment:
  Two QEMU/KVM VMs connected via NVMe-oF TCP, 32 vCPU / 32G RAM each.
  Target exports a 1G RAM disk (/dev/ram0) via nvmet-tcp subsystem
  "nqn.test.nvme-st" on two ports (192.168.1.20:4420 and
  192.168.2.20:4420). Initiator connects both paths to form a native
  NVMe multipath device. VM interconnects use QEMU socket networking
  (point-to-point virtual Ethernet).

  Asymmetric bandwidth is created using tc-tbf on the target side:
    Path 1 (enp0s3): no traffic shaping (measured ~237 MiB/s)
    Path 2 (enp0s4): tc qdisc add dev enp0s4 root tbf rate 500mbit
                     burst 128kb latency 5ms (measured ~56 MiB/s)
  Bandwidth ratio is approximately 4:1.

Test 1 - Uniform 4k random read (iodepth=64, numjobs=4, 15s):

  iopolicy              IOPS   BW(MiB/s)  avg_lat(us)  lat_stddev(us)
  --------------------  -----  ---------  -----------  --------------
  round-robin           59.1k  231        4316         3109
  queue-depth           59.0k  230        4323         3046
  service-time 1:1      59.3k  232        4302         3057
  service-time 4:1      61.0k  238        4179         1294

  With uniform I/O size, service-time 1:1 performs similarly to
  queue-depth since per-byte and per-request tracking are equivalent
  for fixed-size I/Os. service-time 4:1 with correct throughput
  weights improves IOPS by 3.4% and reduces latency stddev by 57%.

Test 2 - Mixed random read 4k/128k (bssplit=4k/50:128k/50,
         iodepth=64, numjobs=4, 15s):

  iopolicy              IOPS  BW(MiB/s)  avg_lat(us)  lat_stddev(us)
  --------------------  ----  ---------  -----------  --------------
  round-robin           7689  276        33209        46817
  queue-depth           7642  276        33429        45868
  service-time 1:1      7456  269        34279        47433
  service-time 4:1      7747  277        32998        27102

  With mixed I/O sizes on asymmetric paths, queue-depth treats 128K
  and 4K requests identically when counting in-flight depth, leading
  to suboptimal load distribution. service-time tracks actual bytes
  in flight, producing significantly more consistent tail latencies.
  service-time 4:1 vs queue-depth: latency stddev -41%.

Standby path (relative_throughput=0):
  Setting a path's relative_throughput to 0 correctly excludes it
  from selection. With the slow path set to 0, all I/O is directed
  to the fast path only (IO split 118052:0), verified via diskstats.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
---
 drivers/nvme/host/multipath.c | 165 +++++++++++++++++++++++++++++++++-
 drivers/nvme/host/nvme.h      |   6 ++
 drivers/nvme/host/sysfs.c     |   5 +-
 3 files changed, 173 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 263161cb8ac0..81fff2f20d23 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -69,6 +69,7 @@ static const char *nvme_iopolicy_names[] = {
 	[NVME_IOPOLICY_NUMA]	= "numa",
 	[NVME_IOPOLICY_RR]	= "round-robin",
 	[NVME_IOPOLICY_QD]      = "queue-depth",
+	[NVME_IOPOLICY_ST]	= "service-time",
 };
 
 static int iopolicy = NVME_IOPOLICY_NUMA;
@@ -83,6 +84,8 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp)
 		iopolicy = NVME_IOPOLICY_RR;
 	else if (!strncmp(val, "queue-depth", 11))
 		iopolicy = NVME_IOPOLICY_QD;
+	else if (!strncmp(val, "service-time", 12))
+		iopolicy = NVME_IOPOLICY_ST;
 	else
 		return -EINVAL;
 
@@ -97,7 +100,7 @@ static int nvme_get_iopolicy(char *buf, const struct kernel_param *kp)
 module_param_call(iopolicy, nvme_set_iopolicy, nvme_get_iopolicy,
 	&iopolicy, 0644);
 MODULE_PARM_DESC(iopolicy,
-	"Default multipath I/O policy; 'numa' (default), 'round-robin' or 'queue-depth'");
+	"Default multipath I/O policy; 'numa' (default), 'round-robin', 'queue-depth' or 'service-time'");
 
 void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys)
 {
@@ -168,11 +171,16 @@ void nvme_mpath_start_request(struct request *rq)
 {
 	struct nvme_ns *ns = rq->q->queuedata;
 	struct gendisk *disk = ns->head->disk;
+	int iopolicy = READ_ONCE(ns->head->subsys->iopolicy);
 
-	if ((READ_ONCE(ns->head->subsys->iopolicy) == NVME_IOPOLICY_QD) &&
+	if (iopolicy == NVME_IOPOLICY_QD &&
 	    !(nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE)) {
 		atomic_inc(&ns->ctrl->nr_active);
 		nvme_req(rq)->flags |= NVME_MPATH_CNT_ACTIVE;
+	} else if (iopolicy == NVME_IOPOLICY_ST &&
+		   !(nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE_BYTES)) {
+		atomic64_add(blk_rq_bytes(rq), &ns->ctrl->in_flight_size);
+		nvme_req(rq)->flags |= NVME_MPATH_CNT_ACTIVE_BYTES;
 	}
 
 	if (!blk_queue_io_stat(disk->queue) || blk_rq_is_passthrough(rq) ||
@@ -191,6 +199,10 @@ void nvme_mpath_end_request(struct request *rq)
 
 	if (nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE)
 		atomic_dec_if_positive(&ns->ctrl->nr_active);
+	if (nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE_BYTES) {
+		atomic64_sub(blk_rq_bytes(rq), &ns->ctrl->in_flight_size);
+		nvme_req(rq)->flags &= ~NVME_MPATH_CNT_ACTIVE_BYTES;
+	}
 
 	if (!(nvme_req(rq)->flags & NVME_MPATH_IO_STATS))
 		return;
@@ -427,6 +439,109 @@ static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head)
 	return best_opt ? best_opt : best_nonopt;
 }
 
+#define ST_MAX_RELATIVE_THROUGHPUT	100
+#define ST_MAX_RELATIVE_THROUGHPUT_SHIFT	7
+#define ST_MAX_INFLIGHT_SIZE \
+	((size_t)-1 >> ST_MAX_RELATIVE_THROUGHPUT_SHIFT)
+
+/*
+ * Compare estimated service time of two paths.
+ * Returns negative if ns1 is better, positive if ns2 is better, 0 if equal.
+ *
+ * Service time = (in_flight_size + incoming) / relative_throughput
+ * Cross-multiply to avoid division.
+ */
+static int nvme_st_compare_load(struct nvme_ns *ns1, struct nvme_ns *ns2,
+				size_t incoming)
+{
+	size_t sz1, sz2, st1, st2;
+	unsigned int tp1 = ns1->ctrl->relative_throughput;
+	unsigned int tp2 = ns2->ctrl->relative_throughput;
+
+	sz1 = atomic64_read(&ns1->ctrl->in_flight_size);
+	sz2 = atomic64_read(&ns2->ctrl->in_flight_size);
+
+	/* Case 1: same throughput — compare load directly */
+	if (tp1 == tp2)
+		return sz1 - sz2;
+
+	/*
+	 * Case 2a: same load — prefer higher throughput.
+	 * Case 2b: one path has zero throughput — prefer the other.
+	 */
+	if (sz1 == sz2 || !tp1 || !tp2)
+		return tp2 - tp1;
+
+	/*
+	 * Case 3: general comparison via cross-multiplication.
+	 *   st1 = (sz1 + incoming) / tp1
+	 *   st2 = (sz2 + incoming) / tp2
+	 * Equivalent (since tp > 0):
+	 *   (sz1 + incoming) * tp2 <=> (sz2 + incoming) * tp1
+	 */
+	sz1 += incoming;
+	sz2 += incoming;
+	if (unlikely(sz1 >= ST_MAX_INFLIGHT_SIZE ||
+		     sz2 >= ST_MAX_INFLIGHT_SIZE)) {
+		sz1 >>= ST_MAX_RELATIVE_THROUGHPUT_SHIFT;
+		sz2 >>= ST_MAX_RELATIVE_THROUGHPUT_SHIFT;
+	}
+	st1 = sz1 * tp2;
+	st2 = sz2 * tp1;
+	if (st1 != st2)
+		return st1 - st2;
+
+	/* Case 4: equal service time — prefer higher throughput */
+	return tp2 - tp1;
+}
+
+static struct nvme_ns *nvme_service_time_path(struct nvme_ns_head *head)
+{
+	struct nvme_ns *best_opt = NULL, *best_nonopt = NULL, *ns;
+	struct nvme_ns *fallback_opt = NULL, *fallback_nonopt = NULL;
+
+	list_for_each_entry_srcu(ns, &head->list, siblings,
+				 srcu_read_lock_held(&head->srcu)) {
+		if (nvme_path_is_disabled(ns))
+			continue;
+
+		/*
+		 * Paths with relative_throughput == 0 are only used when no
+		 * path with a positive value is available (matching DM
+		 * service-time semantics).
+		 */
+		if (!ns->ctrl->relative_throughput) {
+			if (ns->ana_state == NVME_ANA_OPTIMIZED && !fallback_opt)
+				fallback_opt = ns;
+			else if (ns->ana_state == NVME_ANA_NONOPTIMIZED &&
+				 !fallback_nonopt)
+				fallback_nonopt = ns;
+			continue;
+		}
+
+		switch (ns->ana_state) {
+		case NVME_ANA_OPTIMIZED:
+			if (!best_opt ||
+			    nvme_st_compare_load(ns, best_opt, 0) < 0)
+				best_opt = ns;
+			break;
+		case NVME_ANA_NONOPTIMIZED:
+			if (!best_nonopt ||
+			    nvme_st_compare_load(ns, best_nonopt, 0) < 0)
+				best_nonopt = ns;
+			break;
+		default:
+			break;
+		}
+	}
+
+	if (best_opt)
+		return best_opt;
+	if (best_nonopt)
+		return best_nonopt;
+	return fallback_opt ? fallback_opt : fallback_nonopt;
+}
+
 static inline bool nvme_path_is_optimized(struct nvme_ns *ns)
 {
 	return nvme_ctrl_state(ns->ctrl) == NVME_CTRL_LIVE &&
@@ -453,6 +568,8 @@ inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
 		return nvme_queue_depth_path(head);
 	case NVME_IOPOLICY_RR:
 		return nvme_round_robin_path(head);
+	case NVME_IOPOLICY_ST:
+		return nvme_service_time_path(head);
 	default:
 		return nvme_numa_path(head);
 	}
@@ -1081,6 +1198,47 @@ static ssize_t queue_depth_show(struct device *dev,
 }
 DEVICE_ATTR_RO(queue_depth);
 
+static ssize_t in_flight_bytes_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
+
+	if (ns->head->subsys->iopolicy != NVME_IOPOLICY_ST)
+		return 0;
+
+	return sysfs_emit(buf, "%lld\n", atomic64_read(&ns->ctrl->in_flight_size));
+}
+DEVICE_ATTR_RO(in_flight_bytes);
+
+static ssize_t relative_throughput_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
+
+	if (ns->head->subsys->iopolicy != NVME_IOPOLICY_ST)
+		return 0;
+
+	return sysfs_emit(buf, "%u\n", ns->ctrl->relative_throughput);
+}
+
+static ssize_t relative_throughput_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t count)
+{
+	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
+	unsigned int val;
+	int ret;
+
+	ret = kstrtouint(buf, 0, &val);
+	if (ret < 0)
+		return ret;
+	if (val > ST_MAX_RELATIVE_THROUGHPUT)
+		return -EINVAL;
+
+	ns->ctrl->relative_throughput = val;
+	return count;
+}
+DEVICE_ATTR_RW(relative_throughput);
+
 static ssize_t numa_nodes_show(struct device *dev, struct device_attribute *attr,
 		char *buf)
 {
@@ -1341,6 +1499,9 @@ int nvme_mpath_init_identify(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
 
 	/* initialize this in the identify path to cover controller resets */
 	atomic_set(&ctrl->nr_active, 0);
+	atomic64_set(&ctrl->in_flight_size, 0);
+	if (!ctrl->relative_throughput)
+		ctrl->relative_throughput = 1;
 
 	if (!ctrl->max_namespaces ||
 	    ctrl->max_namespaces > le32_to_cpu(id->nn)) {
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index ccd5e05dac98..2b2627e0d3ce 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -261,6 +261,7 @@ enum {
 	NVME_REQ_USERCMD		= (1 << 1),
 	NVME_MPATH_IO_STATS		= (1 << 2),
 	NVME_MPATH_CNT_ACTIVE		= (1 << 3),
+	NVME_MPATH_CNT_ACTIVE_BYTES	= (1 << 4),
 };
 
 static inline struct nvme_request *nvme_req(struct request *req)
@@ -426,6 +427,8 @@ struct nvme_ctrl {
 	struct timer_list anatt_timer;
 	struct work_struct ana_work;
 	atomic_t nr_active;
+	atomic64_t in_flight_size;
+	u8 relative_throughput;
 #endif
 
 #ifdef CONFIG_NVME_HOST_AUTH
@@ -477,6 +480,7 @@ enum nvme_iopolicy {
 	NVME_IOPOLICY_NUMA,
 	NVME_IOPOLICY_RR,
 	NVME_IOPOLICY_QD,
+	NVME_IOPOLICY_ST,
 };
 
 struct nvme_subsystem {
@@ -1059,6 +1063,8 @@ extern bool multipath;
 extern struct device_attribute dev_attr_ana_grpid;
 extern struct device_attribute dev_attr_ana_state;
 extern struct device_attribute dev_attr_queue_depth;
+extern struct device_attribute dev_attr_in_flight_bytes;
+extern struct device_attribute dev_attr_relative_throughput;
 extern struct device_attribute dev_attr_numa_nodes;
 extern struct device_attribute dev_attr_delayed_removal_secs;
 extern struct device_attribute subsys_attr_iopolicy;
diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
index e59758616f27..6309af224c93 100644
--- a/drivers/nvme/host/sysfs.c
+++ b/drivers/nvme/host/sysfs.c
@@ -259,6 +259,8 @@ static struct attribute *nvme_ns_attrs[] = {
 	&dev_attr_ana_grpid.attr,
 	&dev_attr_ana_state.attr,
 	&dev_attr_queue_depth.attr,
+	&dev_attr_in_flight_bytes.attr,
+	&dev_attr_relative_throughput.attr,
 	&dev_attr_numa_nodes.attr,
 	&dev_attr_delayed_removal_secs.attr,
 #endif
@@ -293,7 +295,8 @@ static umode_t nvme_ns_attrs_are_visible(struct kobject *kobj,
 		if (!nvme_ctrl_use_ana(nvme_get_ns_from_dev(dev)->ctrl))
 			return 0;
 	}
-	if (a == &dev_attr_queue_depth.attr || a == &dev_attr_numa_nodes.attr) {
+	if (a == &dev_attr_queue_depth.attr || a == &dev_attr_in_flight_bytes.attr ||
+	    a == &dev_attr_relative_throughput.attr || a == &dev_attr_numa_nodes.attr) {
 		if (nvme_disk_is_ns_head(dev_to_disk(dev)))
 			return 0;
 	}
-- 
2.43.7


^ permalink raw reply related

* Re: [PATCH net-next] docs: exclude driver and netdevsim bugs
From: Leon Romanovsky @ 2026-06-17 11:40 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms, johannes,
	corbet, skhan, workflows, linux-doc
In-Reply-To: <20260615091909.78ad2b03@kernel.org>

On Mon, Jun 15, 2026 at 09:19:09AM -0700, Jakub Kicinski wrote:
> On Mon, 15 Jun 2026 12:14:36 +0300 Leon Romanovsky wrote:
> > > +Unless explicitly excluded all bug fixes should be targeting the ``net``
> > > +tree and contain an appropriate Fixes tag.
> > > +
> > > +Obvious exclusions:
> > > +
> > > + - fixes for bugs which only exist in ``net-next`` should target ``net-next``
> > > +   (please still include the Fixes tag in the commit message)
> > > + - bugs which cannot be reached, e.g. in code paths not executed given
> > > +   current in-tree callers
> > > + - fixes for compiler warnings and typos  
> > 
> > If you decide to resubmit this patch, could you please remove "fixes for
> > compiler warnings" from the exclusion list?
> > 
> > It is quite frustrating to receive a compiler warning originating from a
> > different subsystem after the merge window, knowing it will not be
> > addressed until the next merge window (around eight weeks later).
> 
> Agreed, FWIW, but not planning to resubmit.
> I think people misunderstood that I'm __documenting what I already do__
> rather than trying to have a discussion :/

I'm pretty sure that people aren't aware of it.

Thanks

^ permalink raw reply

* Re: [PATCH v4] drm/xe/hwmon: document DG2 fan speed reporting quirk
From: Vivi, Rodrigo @ 2026-06-17 11:02 UTC (permalink / raw)
  To: Jadav, Raag, zhanwei919@gmail.com
  Cc: intel-xe@lists.freedesktop.org, corbet@lwn.net,
	dri-devel@lists.freedesktop.org, Brost, Matthew,
	thomas.hellstrom@linux.intel.com, linux-kernel@vger.kernel.org,
	linux-doc@vger.kernel.org, skhan@linuxfoundation.org
In-Reply-To: <ajJ86P9MvLmtbPpp@black.igk.intel.com>

On Wed, 2026-06-17 at 12:54 +0200, Raag Jadav wrote:
> On Wed, Jun 03, 2026 at 12:17:07AM +0800, Zhan Wei wrote:
> > On DG2 the driver always shows two fan channels, because the
> > FSC_READ_NUM_FANS command does not work on some cards. OEMs decide
> > how
> > the fans map to tach channels, so two fans can share one tach line.
> > When that happens, the second channel reads 0 RPM even though the
> > fan
> > is spinning.
> > 
> > Note this on the fan2_input ABI entry so the steady 0 RPM is not
> > mistaken for a driver bug.
> > 
> > Fixes: 28f79ac609de ("drm/xe/hwmon: expose fan speed")
> > Signed-off-by: Zhan Wei <zhanwei919@gmail.com>
> > Reviewed-by: Raag Jadav <raag.jadav@intel.com>
> 
> This one seems got lost in the noise. Any takers?

pushed, thanks for patch, review, and heads up.

> 
> Raag

^ permalink raw reply

* Re: [PATCH v4] drm/xe/hwmon: document DG2 fan speed reporting quirk
From: Raag Jadav @ 2026-06-17 10:54 UTC (permalink / raw)
  To: Zhan Wei
  Cc: matthew.brost, thomas.hellstrom, rodrigo.vivi, corbet, skhan,
	intel-xe, dri-devel, linux-doc, linux-kernel
In-Reply-To: <20260602161707.18922-1-zhanwei919@gmail.com>

On Wed, Jun 03, 2026 at 12:17:07AM +0800, Zhan Wei wrote:
> On DG2 the driver always shows two fan channels, because the
> FSC_READ_NUM_FANS command does not work on some cards. OEMs decide how
> the fans map to tach channels, so two fans can share one tach line.
> When that happens, the second channel reads 0 RPM even though the fan
> is spinning.
> 
> Note this on the fan2_input ABI entry so the steady 0 RPM is not
> mistaken for a driver bug.
> 
> Fixes: 28f79ac609de ("drm/xe/hwmon: expose fan speed")
> Signed-off-by: Zhan Wei <zhanwei919@gmail.com>
> Reviewed-by: Raag Jadav <raag.jadav@intel.com>

This one seems got lost in the noise. Any takers?

Raag

^ permalink raw reply

* Re: [PATCH 1/3] PCI: rcar-gen4: Configure AXIINTC if iMSI-RX not used
From: Manivannan Sadhasivam @ 2026-06-17 10:33 UTC (permalink / raw)
  To: Marek Vasut
  Cc: linux-pci, Yoshihiro Shimoda, Krzysztof Wilczyński,
	Bjorn Helgaas, Catalin Marinas, Conor Dooley, Geert Uytterhoeven,
	Krzysztof Kozlowski, Lorenzo Pieralisi, Marc Zyngier, Rob Herring,
	devicetree, linux-arm-kernel, linux-doc, linux-kernel,
	linux-renesas-soc
In-Reply-To: <20260617030008.154449-1-marek.vasut+renesas@mailbox.org>

On Wed, Jun 17, 2026 at 04:59:44AM +0200, Marek Vasut wrote:
> In case MSI are enabled, but DWC built-in iMSI-RX is not in use, the
> MSI are handled via GIC ITS. Configure all controller MSI registers
> fully.
> 
> Set or clear MSI capability register MSICAP0 MSI enable MSIE bit and
> PCIe Interrupt Status 0 Enable register PCIEINTSTS0EN MSI interrupt
> enable MSI_CTRL_INT bit according to MSI enable state, set both bits
> if MSI are enabled, clear both bits if MSI are disabled.
> 
> If MSI are disabled, or MSI are enabled and iMSI-RX is used, then
> deconfigure AXIINTCADDR and AXIINTCCONT to 0, which disables any
> pass through of MSI TLPs onto the AXI bus and then further into
> GIC ITS translation registers.
> 
> If MSI are enabled and iMSI-RX is not used, the configure AXIINTCADDR
> with target address of GIC ITS translation registers, and configure
> AXIINTCCONT to enable MSI TLP pass through onto AXI bus and into the
> GIC ITS. This specific configuration allows handling of MSI via the
> GIC ITS instead of integrated iMSI-RX.
> 
> Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
> Signed-off-by: Marek Vasut <marek.vasut+renesas@mailbox.org>
> ---
> NOTE: This would not be possible without prior work from Shimoda-san
> ---
> Cc: "Krzysztof Wilczyński" <kwilczynski@kernel.org>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Conor Dooley <conor+dt@kernel.org>
> Cc: Geert Uytterhoeven <geert+renesas@glider.be>
> Cc: Krzysztof Kozlowski <krzk+dt@kernel.org>
> Cc: Lorenzo Pieralisi <lpieralisi@kernel.org>
> Cc: Manivannan Sadhasivam <mani@kernel.org>
> Cc: Marc Zyngier <maz@kernel.org>
> Cc: Rob Herring <robh@kernel.org>
> Cc: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
> Cc: devicetree@vger.kernel.org
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-pci@vger.kernel.org
> Cc: linux-renesas-soc@vger.kernel.org
> ---
>  drivers/pci/controller/dwc/pcie-rcar-gen4.c | 53 +++++++++++++++++++--
>  1 file changed, 48 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/pci/controller/dwc/pcie-rcar-gen4.c b/drivers/pci/controller/dwc/pcie-rcar-gen4.c
> index 485cfa8bd9692..ba6e3bedd6d0a 100644
> --- a/drivers/pci/controller/dwc/pcie-rcar-gen4.c
> +++ b/drivers/pci/controller/dwc/pcie-rcar-gen4.c
> @@ -31,6 +31,10 @@
>  #define DEVICE_TYPE_RC		BIT(4)
>  #define BIFUR_MOD_SET_ON	BIT(0)
>  
> +/* MSI Capability */
> +#define MSICAP0			0x0050
> +#define MSICAP0_MSIE		BIT(16)
> +
>  /* PCIe Interrupt Status 0 */
>  #define PCIEINTSTS0		0x0084
>  
> @@ -55,6 +59,16 @@
>  #define APP_HOLD_PHY_RST	BIT(16)
>  #define APP_LTSSM_ENABLE	BIT(0)
>  
> +/* INTC address */
> +#define AXIINTCADDR		0x0a00
> +/* GITS GIC ITS translation register */
> +#define AXIINTCADDR_VAL		0xf1050000

As Marc pointed out, this address should be fetched from DT, not hardcoded in
the driver.

> +
> +/* INTC control & mask */
> +#define AXIINTCCONT		0x0a04
> +#define INTC_EN			BIT(31)
> +#define INTC_MASK		GENMASK(11, 2)
> +
>  /* PCIe Power Management Control */
>  #define PCIEPWRMNGCTRL		0x0070
>  #define APP_CLK_REQ_N		BIT(11)
> @@ -305,6 +319,39 @@ static struct rcar_gen4_pcie *rcar_gen4_pcie_alloc(struct platform_device *pdev)
>  	return rcar;
>  }
>  
> +static void rcar_gen4_pcie_host_msi_init(struct dw_pcie_rp *pp)
> +{
> +	struct dw_pcie *dw = to_dw_pcie_from_pp(pp);
> +	struct rcar_gen4_pcie *rcar = to_rcar_gen4_pcie(dw);
> +	u32 val;
> +
> +	/* Make sure MSICAP0 MSIE is configured. */
> +	val = dw_pcie_readl_dbi(dw, MSICAP0);
> +	if (pci_msi_enabled())
> +		val |= MSICAP0_MSIE;
> +	else
> +		val &= ~MSICAP0_MSIE;
> +	dw_pcie_writel_dbi(dw, MSICAP0, val);
> +
> +	if (!pci_msi_enabled() || pp->use_imsi_rx) {

If MSI is not enabled, then what's the point in clearing these registers (also
above)? I see it as a redundant code. Is there a necessity to clear them?

- Mani

-- 
மணிவண்ணன் சதாசிவம்

^ permalink raw reply

* Re: [PATCH 1/7] dt-bindings: adm1275: ROHM BD12780 hot-swap controller
From: Krzysztof Kozlowski @ 2026-06-17 10:28 UTC (permalink / raw)
  To: Matti Vaittinen
  Cc: Matti Vaittinen, Matti Vaittinen, Guenter Roeck, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Jonathan Corbet, Shuah Khan,
	Wensheng Wang, Ashish Yadav, Kim Seer Paller, Cedric Encarnacion,
	Chris Packham, Yuxi Wang, Charles Hsu, ChiShih Tsai, linux-hwmon,
	devicetree, linux-kernel, linux-doc
In-Reply-To: <d63c4df5e9df845bc4f94b4abdcd068a23929974.1781591132.git.mazziesaccount@gmail.com>

On Tue, Jun 16, 2026 at 09:35:35AM +0300, Matti Vaittinen wrote:
 +
> +  Datasheets:
> +    https://fscdn.rohm.com/en/products/databook/datasheet/ic/power/power_switch/bd12780muv-lb-e.pdf
> +    https://fscdn.rohm.com/en/products/databook/datasheet/ic/power/power_switch/bd12780amuv-lb-e.pdf
> +
>  properties:
>    compatible:
> -    enum:
> -      - adi,adm1075
> -      - adi,adm1272
> -      - adi,adm1273
> -      - adi,adm1275
> -      - adi,adm1276
> -      - adi,adm1278
> -      - adi,adm1281
> -      - adi,adm1293
> -      - adi,adm1294
> -      - silergy,mc09c
> +    oneOf:
> +      - items:
> +          enum:


s/items/enum/, so:

oneOf:
  - enum:
  ....


> +            - adi,adm1075
> +            - adi,adm1272
> +            - adi,adm1273
> +            - adi,adm1275
> +            - adi,adm1276
> +            - adi,adm1278
> +            - adi,adm1281
> +            - adi,adm1293
> +            - adi,adm1294
> +            - rohm,bd12780
> +            - silergy,mc09c
> +
> +# Require BD12780 as a fall-back for BD12780A.

No need for the comment, schema is quite explicit.

> +      - items:
> +          - const: rohm,bd12780a
> +          - const: rohm,bd12780

Best regards,
Krzysztof


^ permalink raw reply

* Re: [PATCH] drm/doc: recommend forking drm/kernel rather than uploading a distinct copy
From: Vignesh Raman @ 2026-06-17 10:26 UTC (permalink / raw)
  To: Eric Engestrom, Helen Koike, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, Jonathan Corbet,
	dri-devel, linux-doc, linux-kernel
In-Reply-To: <b7f86ada-a74d-4fb2-83d2-5b4ef18e00c4@collabora.com>

Hi,

On 20/02/26 11:08, Vignesh Raman wrote:
> Hi Eric,
> 
> On 19/02/26 19:26, Eric Engestrom wrote:
>> Signed-off-by: Eric Engestrom <eric@engestrom.ch>
>> ---
>>   Documentation/gpu/automated_testing.rst | 3 ++-
>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git ./Documentation/gpu/automated_testing.rst ./Documentation/ 
>> gpu/automated_testing.rst
>> index 62aa3ede02a5df3f590b..8a7328aef10ef39ee329 100644
>> --- ./Documentation/gpu/automated_testing.rst
>> +++ ./Documentation/gpu/automated_testing.rst
>> @@ -99,7 +99,8 @@ How to enable automated testing on your tree
>>   ============================================
>>   1. Create a Linux tree in https://gitlab.freedesktop.org/ if you 
>> don't have one
>> -yet
>> +yet, by forking https://gitlab.freedesktop.org/drm/kernel (this 
>> allows GitLab
>> +to internally track that these are the same git objects).
> 
> Reviewed-by: Vignesh Raman <vignesh.raman@collabora.com>

Applied to drm-misc-next

Thanks.

> 
> Regards,
> Vignesh
> 
>>   2. In your kernel repo's configuration (eg.
>>   https://gitlab.freedesktop.org/janedoe/linux/-/settings/ci_cd), 
>> change the
> 


^ permalink raw reply

* Re: [PATCH v9 0/6] mm/memory-failure: add panic option for unrecoverable pages
From: Breno Leitao @ 2026-06-17  9:40 UTC (permalink / raw)
  To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Naoya Horiguchi, Jonathan Corbet, Shuah Khan,
	Liam R. Howlett, lance.yang, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest,
	linux-trace-kernel, kernel-team
In-Reply-To: <20260609-ecc_panic-v9-0-432a74002e74@debian.org>

On Tue, Jun 09, 2026 at 03:56:54AM -0700, Breno Leitao wrote:
> A multi-bit ECC error on a kernel-owned page that the memory failure
> handler cannot recover is currently swallowed: PG_hwpoison is set, the
> event is logged, and the kernel keeps running.  The corrupted memory
> remains accessible to the kernel and either drives silent data
> corruption or surfaces seconds-to-minutes later as an apparently
> unrelated crash.  In a large fleet that delayed, unattributable crash
> turns into significant engineering effort to root-cause; in a kdump
> configuration, by the time the crash happens the original error
> context (faulting PFN, MCE/GHES record, page state) is long gone.
> 
> This series adds an opt-in sysctl,
> vm.panic_on_unrecoverable_memory_failure, that converts an
> unrecoverable kernel-page hwpoison event into an immediate panic with
> a clean dmesg/vmcore that still contains the original failure
> context.  The default is disabled so existing workloads see no
> change.
> 
> There is a selftest that test different cases, and I tested it using
> the following variants:
> 
>   ┌─────────┬──────────┬───────────────────────────────────────────────────────────┐
>   │ Variant │   PFN    │                          Result                           │
>   ├─────────┼──────────┼───────────────────────────────────────────────────────────┤
>   │ rodata  │ 0x2600   │ Panic with "Memory failure: 0x2600: unrecoverable page"   │
>   ├─────────┼──────────┼───────────────────────────────────────────────────────────┤
>   │ slab    │ 0x100032 │ Panic with "Memory failure: 0x100032: unrecoverable page" │
>   ├─────────┼──────────┼───────────────────────────────────────────────────────────┤
>   │ pgtable │ 0x100000 │ Panic with "Memory failure: 0x100000: unrecoverable page" │
>   └─────────┴──────────┴───────────────────────────────────────────────────────────┘
> 
> Each one shows the same call trace, exactly the path the series builds:
> 
>   hard_offline_page_store
>     → memory_failure
>       → action_result
>         → panic("Memory failure: %#lx: unrecoverable page")

Debugging another issue earlier today, just found a kernel crash that is
hitting a ignored page later in the day, and randomly misbehaving/crashing.

 Memory failure: 0x140ae: unhandlable page.
 Memory failure: 0x140ae: recovery action for get hwpoison page: Ignored                     <-- Ignored 
 loop0: detected capacity change from 0 to 15241056
 EDAC MC0: 1 UE multi-bit ECC on LP5x_0 LP5x_0 (node:0 card:0 module:0 rank:0 bank:2 device:28 row:42700 column:96
 {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 308
 {3}[Hardware Error]: event severity: recoverable
 {3}[Hardware Error]:  imprecise tstamp: 2026-06-16 02:50:03
 {3}[Hardware Error]:  Error 0, type: recoverable
 {3}[Hardware Error]:   section_type: memory error
 {3}[Hardware Error]:   physical_address: 0x0000000aeccde180
 {3}[Hardware Error]:   physical_address_mask: 0xfffffffffffff000
 {3}[Hardware Error]:   node:0 card:0 module:0 rank:0 bank:2 device:28 row:42700 column:960 requestor_id:0x0000000
 {3}[Hardware Error]:   error_type: 3, multi-bit ECC
 {3}[Hardware Error]:   DIMM location: LP5x_0 LP5x_0
 Memory failure: 0xaeccd: recovery action for dirty LRU page: Recovered

 Internal error: synchronous external abort: 0000000096000410 [#1]  SMP
 Modules linked in: ghes_edac(E) squashfs(E) act_gact(E) sch_fq(E) tcp_diag(E) inet_diag(E) cls_bpf(E) evdev(E) sm
 CPU: 51 UID: 0 PID: 1 Comm: systemd Kdump: loaded Tainted: G   M       OE K     6.16.1-0_fbk2_0_gf40efc324cc8 #1
 Tainted: [M]=MACHINE_CHECK, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE, [K]=LIVEPATCH
 pstate: 834010c9 (Nzcv daIF +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
 pc : clear_inode+0x34/0x108
 lr : proc_evict_inode.llvm.1771226604092943895+0x28/0x68
 sp : ffff800083f6f8d0
 x29: ffff800083f6f8e0 x28: 0000000000000011 x27: ffff0000c1378788
 x26: ffffffffffffffff x25: ffff800082747de0 x24: ffff0000c0ae9898
 x23: ffff8000819155f8 x22: ffff0000c0ae9888 x21: ffff0000c0ae9808
 x20: ffff0000c0ae9818 x19: ffff0000c0ae9788 x18: 000000000000001c
 x17: 0000000000000018 x16: 0000000000000040 x15: 0000000000000000
 x14: 0000000000000001 x13: 0000000000000000 x12: 0000000000002710
 x11: ffff0000c0ae9898 x10: ffff0000c1299b58 x9 : 0000000000000001
 x8 : ffff0000c0ae9900 x7 : ffff8000828db000 x6 : 0000000000005040
 x5 : ffffffffffffffff x4 : ffffffdfc05c8aa0 x3 : ffff000126470000
 x2 : ffffffffffffffff x1 : 0000000000000000 x0 : ffff0000c0ae9788
 Call trace:
  clear_inode+0x34/0x108 (P)
  proc_evict_inode.llvm.1771226604092943895+0x28/0x68
  evict+0xec/0x328
  iput+0xa8/0x310
  dentry_unlink_inode+0xa4/0x188
  __dentry_kill+0x74/0x358
  shrink_dentry_list+0xc8/0x198
 ....

^ permalink raw reply

* Re: [PATCH v4 3/5] rpmsg: virtio_rpmsg_bus: get buffer size from config space
From: Arnaud POULIQUEN @ 2026-06-17  9:15 UTC (permalink / raw)
  To: Tanmay Shah, andersson, mathieu.poirier, corbet, skhan
  Cc: linux-remoteproc, linux-doc, linux-kernel
In-Reply-To: <20260615202007.3484668-4-tanmay.shah@amd.com>

Hi Tanmay,

On 6/15/26 22:20, Tanmay Shah wrote:
> 512 bytes isn't always suitable for all case, let firmware
> maker decide the best value from resource table.
> enable by VIRTIO_RPMSG_F_BUFSZ feature bit.
> 
> Signed-off-by: Tanmay Shah <tanmay.shah@amd.com>
> ---
> 
> Changes in v4: squash to virtio rpmsg config patch
>    - Introduce new patch to modify rpmsg.rst documentation
>    - check version is always 1.
>    - check size field is same as size of struct virtio_rpmsg_config
>    - introduce alignment field
>    - check alignment field is power of 2
>    - check tx and rx buf size is aligned with alignment passed in the
>      structure
> 
> Changes in v3:
>    - change version field from u16 to u8
>    - introduce size field in the rpmsg_virtio_config structure
>    - check version field is set to any non-zero value.
>    - check size field is not 0.
>    - Remove field for private config, as not needed for now.
>    - add documentation of rpmsg_virtio_config structure
> 
>   drivers/rpmsg/virtio_rpmsg_bus.c   | 129 ++++++++++++++++++++++++-----
>   include/linux/rpmsg/virtio_rpmsg.h |  50 +++++++++++
>   2 files changed, 160 insertions(+), 19 deletions(-)
>   create mode 100644 include/linux/rpmsg/virtio_rpmsg.h
> 
> diff --git a/drivers/rpmsg/virtio_rpmsg_bus.c b/drivers/rpmsg/virtio_rpmsg_bus.c
> index 99df1ae07055..a59925f870a4 100644
> --- a/drivers/rpmsg/virtio_rpmsg_bus.c
> +++ b/drivers/rpmsg/virtio_rpmsg_bus.c
> @@ -15,11 +15,13 @@
>   #include <linux/idr.h>
>   #include <linux/jiffies.h>
>   #include <linux/kernel.h>
> +#include <linux/log2.h>
>   #include <linux/module.h>
>   #include <linux/mutex.h>
>   #include <linux/rpmsg.h>
>   #include <linux/rpmsg/byteorder.h>
>   #include <linux/rpmsg/ns.h>
> +#include <linux/rpmsg/virtio_rpmsg.h>
>   #include <linux/scatterlist.h>
>   #include <linux/slab.h>
>   #include <linux/sched.h>
> @@ -39,7 +41,8 @@
>    * @tx_bufs:	kernel address of tx buffers
>    * @num_rx_buf: total number of rx buffers
>    * @num_tx_buf: total number of tx buffers
> - * @buf_size:   size of one rx or tx buffer
> + * @rx_buf_size: size of one rx buffer
> + * @tx_buf_size: size of one tx buffer
>    * @last_tx_buf: index of last tx buffer used
>    * @bufs_dma:	dma base addr of the buffers
>    * @tx_lock:	protects svq and tx_bufs, to allow concurrent senders.
> @@ -59,7 +62,8 @@ struct virtproc_info {
>   	void *rx_bufs, *tx_bufs;
>   	unsigned int num_rx_buf;
>   	unsigned int num_tx_buf;
> -	unsigned int buf_size;
> +	unsigned int rx_buf_size;
> +	unsigned int tx_buf_size;
>   	int last_tx_buf;
>   	dma_addr_t bufs_dma;
>   	struct mutex tx_lock;
> @@ -68,9 +72,6 @@ struct virtproc_info {
>   	wait_queue_head_t sendq;
>   };
>   
> -/* The feature bitmap for virtio rpmsg */
> -#define VIRTIO_RPMSG_F_NS	0 /* RP supports name service notifications */
> -
>   /**
>    * struct rpmsg_hdr - common header for all rpmsg messages
>    * @src: source address
> @@ -128,7 +129,7 @@ struct virtio_rpmsg_channel {
>    * processor.
>    */
>   #define MAX_RPMSG_NUM_BUFS	(256)
> -#define MAX_RPMSG_BUF_SIZE	(512)
> +#define DEFAULT_RPMSG_BUF_SIZE	(512)
>   
>   /*
>    * Local addresses are dynamically allocated on-demand.
> @@ -444,7 +445,7 @@ static void *get_a_tx_buf(struct virtproc_info *vrp)
>   
>   	/* either pick the next unused tx buffer */
>   	if (vrp->last_tx_buf < vrp->num_tx_buf)
> -		ret = vrp->tx_bufs + vrp->buf_size * vrp->last_tx_buf++;
> +		ret = vrp->tx_bufs + vrp->tx_buf_size * vrp->last_tx_buf++;
>   	/* or recycle a used one */
>   	else
>   		ret = virtqueue_get_buf(vrp->svq, &len);
> @@ -514,7 +515,7 @@ static int rpmsg_send_offchannel_raw(struct rpmsg_device *rpdev,
>   	 * messaging), or to improve the buffer allocator, to support
>   	 * variable-length buffer sizes.
>   	 */
> -	if (len > vrp->buf_size - sizeof(struct rpmsg_hdr)) {
> +	if (len > vrp->tx_buf_size - sizeof(struct rpmsg_hdr)) {
>   		dev_err(dev, "message is too big (%d)\n", len);
>   		return -EMSGSIZE;
>   	}
> @@ -647,7 +648,7 @@ static ssize_t virtio_rpmsg_get_mtu(struct rpmsg_endpoint *ept)
>   	struct rpmsg_device *rpdev = ept->rpdev;
>   	struct virtio_rpmsg_channel *vch = to_virtio_rpmsg_channel(rpdev);
>   
> -	return vch->vrp->buf_size - sizeof(struct rpmsg_hdr);
> +	return vch->vrp->tx_buf_size - sizeof(struct rpmsg_hdr);
>   }
>   
>   static int rpmsg_recv_single(struct virtproc_info *vrp, struct device *dev,
> @@ -673,7 +674,7 @@ static int rpmsg_recv_single(struct virtproc_info *vrp, struct device *dev,
>   	 * We currently use fixed-sized buffers, so trivially sanitize
>   	 * the reported payload length.
>   	 */
> -	if (len > vrp->buf_size ||
> +	if (len > vrp->rx_buf_size ||
>   	    msg_len > (len - sizeof(struct rpmsg_hdr))) {
>   		dev_warn(dev, "inbound msg too big: (%d, %d)\n", len, msg_len);
>   		return -EINVAL;
> @@ -706,7 +707,7 @@ static int rpmsg_recv_single(struct virtproc_info *vrp, struct device *dev,
>   		dev_warn_ratelimited(dev, "msg received with no recipient\n");
>   
>   	/* publish the real size of the buffer */
> -	rpmsg_sg_init(&sg, msg, vrp->buf_size);
> +	rpmsg_sg_init(&sg, msg, vrp->rx_buf_size);
>   
>   	/* add the buffer back to the remote processor's virtqueue */
>   	err = virtqueue_add_inbuf(vrp->rvq, &sg, 1, msg, GFP_KERNEL);
> @@ -820,10 +821,13 @@ static int rpmsg_probe(struct virtio_device *vdev)
>   	struct virtproc_info *vrp;
>   	struct virtio_rpmsg_channel *vch = NULL;
>   	struct rpmsg_device *rpdev_ns, *rpdev_ctrl;
> +	u16 rpmsg_buf_align = 0;
>   	void *bufs_va;
>   	int err = 0, i;
>   	size_t total_buf_space;
>   	bool notify;
> +	u8 version;
> +	u16 size;
>   
>   	vrp = kzalloc_obj(*vrp);
>   	if (!vrp)
> @@ -855,9 +859,90 @@ static int rpmsg_probe(struct virtio_device *vdev)
>   	else
>   		vrp->num_tx_buf = MAX_RPMSG_NUM_BUFS;
>   
> -	vrp->buf_size = MAX_RPMSG_BUF_SIZE;
> +	/*
> +	 * If VIRTIO_RPMSG_F_BUFSZ feature is supported, then configure buf
> +	 * size from virtio device config space from the resource table.
> +	 * If the feature is not supported, then assign default buf size.
> +	 */
> +	if (virtio_has_feature(vdev, VIRTIO_RPMSG_F_BUFSZ)) {
> +		virtio_cread(vdev, struct virtio_rpmsg_config,
> +			     version, &version);
> +
> +		/* for now we support only v1 */
> +		if (version != RPMSG_VDEV_CONFIG_V1) {
> +			dev_err(&vdev->dev,
> +				"unsupported vdev config version %u\n", version);
> +			err = -EINVAL;
> +			goto vqs_del;
> +		}
> +
> +		/* size of the config space must match */
> +		virtio_cread(vdev, struct virtio_rpmsg_config,
> +			     size, &size);
> +		if (size != sizeof(struct virtio_rpmsg_config)) {
> +			dev_err(&vdev->dev, "invalid size of vdev config %u\n",
> +				size);
> +			err = -EINVAL;
> +			goto vqs_del;
> +		}
>   
> -	total_buf_space = (vrp->num_rx_buf + vrp->num_tx_buf) * vrp->buf_size;
> +		/*
> +		 * Optional alignment applied to each buffer size and to the TX
> +		 * buffer base address (e.g. to align buffers on a cache line).
> +		 * It must be a power of two; zero means no extra alignment.
> +		 */
> +		virtio_cread(vdev, struct virtio_rpmsg_config,
> +			     rpmsg_buf_align, &rpmsg_buf_align);
> +		if (rpmsg_buf_align && !is_power_of_2(rpmsg_buf_align)) {
> +			dev_err(&vdev->dev,
> +				"bad vdev config: rpmsg_buf_align %u is not a power of two\n",
> +				rpmsg_buf_align);
> +			err = -EINVAL;
> +			goto vqs_del;
> +		}
> +
> +		/* note: tx and rx are defined from remote view */
> +		virtio_cread(vdev, struct virtio_rpmsg_config,
> +			     txbuf_size, &vrp->rx_buf_size);
> +		virtio_cread(vdev, struct virtio_rpmsg_config,
> +			     rxbuf_size, &vrp->tx_buf_size);
> +
> +		/* The buffers must hold at least the rpmsg header */
> +		if (vrp->rx_buf_size < sizeof(struct rpmsg_hdr) ||
> +		    vrp->tx_buf_size < sizeof(struct rpmsg_hdr)) {
> +			dev_err(&vdev->dev,
> +				"bad vdev config: rx buf sz = %u, tx buf sz = %u\n",
> +				vrp->rx_buf_size, vrp->tx_buf_size);
> +			err = -EINVAL;
> +			goto vqs_del;
> +		}
> +
> +		/*
> +		 * The buffer size must be aligned to the provided alignment for
> +		 * so that the start address of tx bufs can be aligned.
> +		 */

'tx' to remove as  it also concerns Rx buffers


What about removing this check to manage alignment during buffer allocation?

For example, if the alignment is on a 64-bit address and the tx_buffer 
and rx_buffer sizes are 40 bytes, 48 bytes can be allocated in memory 
for each buffer, and the virtio descriptor can be filled with aligned 
addresses.

In other words, the rpmsg_buf_align field contains the alignment 
constraint from the remote processor. If the Linux kernel wants to 
impose another alignment constraint, it must test or update 
rpmsg_buf_align, but it must not impose alignment on the buffer size.


> +		if (rpmsg_buf_align &&
> +		    (!IS_ALIGNED(vrp->rx_buf_size, rpmsg_buf_align) ||
> +		     !IS_ALIGNED(vrp->tx_buf_size, rpmsg_buf_align))) {
> +			dev_err(&vdev->dev,
> +				"bad vdev config: buf sizes (rx %u, tx %u) not aligned to %u\n",
> +				vrp->rx_buf_size, vrp->tx_buf_size,
> +				rpmsg_buf_align);
> +			err = -EINVAL;
> +			goto vqs_del;
> +		}
> +
> +		dev_dbg(&vdev->dev,
> +			"vdev config: ver=%u, align=0x%x, rx sz = 0x%x, tx sz = 0x%x\n",
> +			version, rpmsg_buf_align, vrp->rx_buf_size,
> +			vrp->tx_buf_size);
> +	} else {
> +		vrp->rx_buf_size = DEFAULT_RPMSG_BUF_SIZE;
> +		vrp->tx_buf_size = DEFAULT_RPMSG_BUF_SIZE;
> +	}
> +
> +	total_buf_space = (vrp->num_rx_buf * vrp->rx_buf_size) +
> +			  (vrp->num_tx_buf * vrp->tx_buf_size);
>   
>   	/* allocate coherent memory for the buffers */
>   	bufs_va = dma_alloc_coherent(vdev->dev.parent,
> @@ -874,15 +959,20 @@ static int rpmsg_probe(struct virtio_device *vdev)
>   	/* first part of the buffers is dedicated for RX */
>   	vrp->rx_bufs = bufs_va;
>   
> -	/* and second part is dedicated for TX */
> -	vrp->tx_bufs = bufs_va + vrp->num_rx_buf * vrp->buf_size;
> +	/*
> +	 * Here buf_va is aligned to a page. Also rx buf size is aligned with
> +	 * cache line alignment provided by the firmware, so tx buf's start
> +	 * address is guranteed to be aligned with the alignment provided by
> +	 * the firmware.
> +	 */
> +	vrp->tx_bufs = bufs_va + (vrp->num_rx_buf * vrp->rx_buf_size);
>   
>   	/* set up the receive buffers */
>   	for (i = 0; i < vrp->num_rx_buf; i++) {
>   		struct scatterlist sg;
> -		void *cpu_addr = vrp->rx_bufs + i * vrp->buf_size;
> +		void *cpu_addr = vrp->rx_bufs + i * vrp->rx_buf_size;
>   
> -		rpmsg_sg_init(&sg, cpu_addr, vrp->buf_size);
> +		rpmsg_sg_init(&sg, cpu_addr, vrp->rx_buf_size);
>   
>   		err = virtqueue_add_inbuf(vrp->rvq, &sg, 1, cpu_addr,
>   					  GFP_KERNEL);
> @@ -965,8 +1055,8 @@ static int rpmsg_remove_device(struct device *dev, void *data)
>   static void rpmsg_remove(struct virtio_device *vdev)
>   {
>   	struct virtproc_info *vrp = vdev->priv;
> -	unsigned int num_bufs = vrp->num_rx_buf + vrp->num_tx_buf;
> -	size_t total_buf_space = num_bufs * vrp->buf_size;
> +	size_t total_buf_space = (vrp->num_rx_buf * vrp->rx_buf_size) +
> +				 (vrp->num_tx_buf * vrp->tx_buf_size);
>   	int ret;
>   
>   	virtio_reset_device(vdev);
> @@ -992,6 +1082,7 @@ static struct virtio_device_id id_table[] = {
>   
>   static unsigned int features[] = {
>   	VIRTIO_RPMSG_F_NS,
> +	VIRTIO_RPMSG_F_BUFSZ,
>   };
>   
>   static struct virtio_driver virtio_ipc_driver = {
> diff --git a/include/linux/rpmsg/virtio_rpmsg.h b/include/linux/rpmsg/virtio_rpmsg.h
> new file mode 100644
> index 000000000000..7e14da68fd17
> --- /dev/null
> +++ b/include/linux/rpmsg/virtio_rpmsg.h
> @@ -0,0 +1,50 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (C) Pinecone Inc. 2019
> + * Copyright (C) Xiang Xiao <xiaoxiang@pinecone.net>
> + * Copyright (C) Advanced Micro Devices, Inc. 2026
> + */
> +
> +#ifndef _LINUX_VIRTIO_RPMSG_H
> +#define _LINUX_VIRTIO_RPMSG_H
> +
> +#include <linux/types.h>
> +#include <linux/virtio_types.h>
> +
> +/* The feature bitmap for virtio rpmsg */
> +#define VIRTIO_RPMSG_F_NS	0 /* RP supports name service notifications */
> +#define VIRTIO_RPMSG_F_BUFSZ	1 /* RP get buffer size from config space */
> +
> +/* Version of struct virtio_rpmsg_config understood by this driver */
> +#define RPMSG_VDEV_CONFIG_V1	1
> +
> +/**
> + * struct virtio_rpmsg_config - config space for rpmsg virtio device
> + *
> + * @version:	version of this structure, currently %RPMSG_VDEV_CONFIG_V1.
> + * @reserved:	reserved for padding, must be zero.
> + * @size:	size of this structure in bytes.
> + * @rpmsg_buf_align:	required alignment in bytes for each buffer. Must be a
> + *		power of two so that both the buffer sizes and the TX buffer
> + *		base address can be aligned (e.g. to a cache line).
> + * @reserved1:	reserved for padding, must be zero. Keeps the following 32-bit
> + *		fields naturally aligned.
> + * @txbuf_size:	Tx buf size from remote's view. For Linux this is rx buf size.
> + * @rxbuf_size:	Rx buf size from remote's view. For Linux this is tx buf size.
> + *
> + * This is the configuration structure shared by the device and the driver,
> + * read when %VIRTIO_RPMSG_F_BUFSZ is negotiated. The fields are laid out so
> + * the structure is naturally 32-bit aligned.
> + */
> +struct virtio_rpmsg_config {
> +	u8 version;
> +	u8 reserved;

Why about defining the version type to u16 to avoid the reserved field?

> +	__virtio16 size;
> +	__virtio16 rpmsg_buf_align;
> +	__virtio16 reserved1;

Seems useless if __packed prevents the compiler from inserting extra padding
bytes between fields,

> +	/* The tx/rx individual buffer size (if VIRTIO_RPMSG_F_BUFSZ) */
> +	__virtio32 txbuf_size;
> +	__virtio32 rxbuf_size;
> +} __packed;

proposal

+struct virtio_rpmsg_config {
+	__virtio16 version;
+	__virtio16 size;
+	/* The tx/rx individual buffer size (if VIRTIO_RPMSG_F_BUFSZ) */
+	__virtio32 txbuf_size;
+	__virtio32 rxbuf_size;
+	__virtio16 rpmsg_buf_align;
+} __packed;
+

Regards,
Arnaud

> +
> +#endif /* _LINUX_VIRTIO_RPMSG_H */


^ permalink raw reply

* [PATCH] Docs/translations/it_IT: update current minimal requirements
From: Doehyun Baek @ 2026-06-17  8:53 UTC (permalink / raw)
  To: linux-doc; +Cc: Federico Vaga, Jonathan Corbet, Shuah Khan, Doehyun Baek

Update the Italian translation of the current minimal requirements table to
match Documentation/process/changes.rst.  The translated table still listed
older versions for Rust, bindgen, pahole, Sphinx, and Python.

Signed-off-by: Doehyun Baek <doehyunbaek@gmail.com>
Cc: Federico Vaga <federico.vaga@vaga.pv.it>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <skhan@linuxfoundation.org>
---
 Documentation/translations/it_IT/process/changes.rst | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/Documentation/translations/it_IT/process/changes.rst b/Documentation/translations/it_IT/process/changes.rst
index 7ee54c972418..1f89bae7b6c2 100644
--- a/Documentation/translations/it_IT/process/changes.rst
+++ b/Documentation/translations/it_IT/process/changes.rst
@@ -34,14 +34,14 @@ PC Card, per esempio, probabilmente non dovreste preoccuparvi di pcmciautils.
 ====================== =================  ========================================
 GNU C                  8.1                gcc --version
 Clang/LLVM (optional)  17.0.1             clang --version
-Rust (opzionale)       1.78.0             rustc --version
-bindgen (opzionale)    0.65.1             bindgen --version
+Rust (opzionale)       1.85.0             rustc --version
+bindgen (opzionale)    0.71.1             bindgen --version
 GNU make               4.0                make --version
 bash                   4.2                bash --version
 binutils               2.30               ld -v
 flex                   2.5.35             flex --version
 bison                  2.0                bison --version
-pahole                 1.16               pahole --version
+pahole                 1.26               pahole --version
 util-linux             2.10o              mount --version
 kmod                   13                 depmod -V
 e2fsprogs              1.41.4             e2fsck -V
@@ -60,12 +60,12 @@ mcelog                 0.6                mcelog --version
 iptables               1.4.2              iptables -V
 openssl & libcrypto    1.0.0              openssl version
 bc                     1.06.95            bc --version
-Sphinx\ [#f1]_         2.4.4              sphinx-build --version
+Sphinx\ [#f1]_         3.4.3              sphinx-build --version
 cpio                   any                cpio --version
 GNU tar                1.28               tar --version
 gtags (opzionale)      6.6.5              gtags --version
 mkimage (opzionale)    2017.01            mkimage --version
-Python (opzionale)     3.5.x              python3 --version
+Python (opzionale)     3.9.x              python3 --version
 ====================== =================  ========================================
 
 .. [#f1] Sphinx è necessario solo per produrre la documentazione del Kernel
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH v18 net-next 01/11] net/nebula-matrix: add minimum nbl build framework
From: Uwe Kleine-König @ 2026-06-17  8:40 UTC (permalink / raw)
  To: illusion.wang
  Cc: dimon.zhao, alvin.wang, sam.chen, netdev, andrew+netdev, corbet,
	kuba, horms, linux-doc, pabeni, vadim.fedorenko, lukas.bulwahn,
	edumazet, enelsonmoore, skhan, hkallweit1, open list
In-Reply-To: <20260611044916.2383-2-illusion.wang@nebula-matrix.com>

[-- Attachment #1: Type: text/plain, Size: 3848 bytes --]

On Thu, Jun 11, 2026 at 12:49:00PM +0800, illusion.wang wrote:
> +static int nbl_probe(struct pci_dev *pdev,
> +		     const struct pci_device_id *id)
> +{
> +	return 0;
> +}
> +
> +static void nbl_remove(struct pci_dev *pdev)
> +{
> +}
> [...]
> +static const struct pci_device_id nbl_id_table[] = {
> +	{ PCI_DEVICE(NBL_VENDOR_ID, NBL_DEVICE_ID_M18110),
> +	  .driver_data = BIT(NBL_CAP_HAS_NET_BIT) | BIT(NBL_CAP_IS_NIC_BIT) |
> +			 BIT(NBL_CAP_IS_LEONIS_BIT) },
> +	{ PCI_DEVICE(NBL_VENDOR_ID, NBL_DEVICE_ID_M18110_LX),
> +	  .driver_data = BIT(NBL_CAP_HAS_NET_BIT) | BIT(NBL_CAP_IS_NIC_BIT) |
> +			 BIT(NBL_CAP_IS_LEONIS_BIT) },
> +	{ PCI_DEVICE(NBL_VENDOR_ID, NBL_DEVICE_ID_M18110_BASE_T),
> +	  .driver_data = BIT(NBL_CAP_HAS_NET_BIT) | BIT(NBL_CAP_IS_NIC_BIT) |
> +			 BIT(NBL_CAP_IS_LEONIS_BIT) },
> +	{ PCI_DEVICE(NBL_VENDOR_ID, NBL_DEVICE_ID_M18110_LX_BASE_T),
> +	  .driver_data = BIT(NBL_CAP_HAS_NET_BIT) | BIT(NBL_CAP_IS_NIC_BIT) |
> +			 BIT(NBL_CAP_IS_LEONIS_BIT) },
> +	{ PCI_DEVICE(NBL_VENDOR_ID, NBL_DEVICE_ID_M18110_OCP),
> +	  .driver_data = BIT(NBL_CAP_HAS_NET_BIT) | BIT(NBL_CAP_IS_NIC_BIT) |
> +			 BIT(NBL_CAP_IS_LEONIS_BIT) },
> +	{ PCI_DEVICE(NBL_VENDOR_ID, NBL_DEVICE_ID_M18110_LX_OCP),
> +	  .driver_data = BIT(NBL_CAP_HAS_NET_BIT) | BIT(NBL_CAP_IS_NIC_BIT) |
> +			 BIT(NBL_CAP_IS_LEONIS_BIT) },
> +	{ PCI_DEVICE(NBL_VENDOR_ID, NBL_DEVICE_ID_M18110_BASE_T_OCP),
> +	  .driver_data = BIT(NBL_CAP_HAS_NET_BIT) | BIT(NBL_CAP_IS_NIC_BIT) |
> +			 BIT(NBL_CAP_IS_LEONIS_BIT) },
> +	{ PCI_DEVICE(NBL_VENDOR_ID, NBL_DEVICE_ID_M18110_LX_BASE_T_OCP),
> +	  .driver_data = BIT(NBL_CAP_HAS_NET_BIT) | BIT(NBL_CAP_IS_NIC_BIT) |
> +			 BIT(NBL_CAP_IS_LEONIS_BIT) },
> +	{ PCI_DEVICE(NBL_VENDOR_ID, NBL_DEVICE_ID_M18000),
> +	  .driver_data = BIT(NBL_CAP_HAS_NET_BIT) | BIT(NBL_CAP_IS_NIC_BIT) |
> +			 BIT(NBL_CAP_IS_LEONIS_BIT) },
> +	{ PCI_DEVICE(NBL_VENDOR_ID, NBL_DEVICE_ID_M18000_LX),
> +	  .driver_data = BIT(NBL_CAP_HAS_NET_BIT) | BIT(NBL_CAP_IS_NIC_BIT) |
> +			 BIT(NBL_CAP_IS_LEONIS_BIT) },
> +	{ PCI_DEVICE(NBL_VENDOR_ID, NBL_DEVICE_ID_M18000_BASE_T),
> +	  .driver_data = BIT(NBL_CAP_HAS_NET_BIT) | BIT(NBL_CAP_IS_NIC_BIT) |
> +			 BIT(NBL_CAP_IS_LEONIS_BIT) },
> +	{ PCI_DEVICE(NBL_VENDOR_ID, NBL_DEVICE_ID_M18000_LX_BASE_T),
> +	  .driver_data = BIT(NBL_CAP_HAS_NET_BIT) | BIT(NBL_CAP_IS_NIC_BIT) |
> +			 BIT(NBL_CAP_IS_LEONIS_BIT) },
> +	{ PCI_DEVICE(NBL_VENDOR_ID, NBL_DEVICE_ID_M18000_OCP),
> +	  .driver_data = BIT(NBL_CAP_HAS_NET_BIT) | BIT(NBL_CAP_IS_NIC_BIT) |
> +			 BIT(NBL_CAP_IS_LEONIS_BIT) },
> +	{ PCI_DEVICE(NBL_VENDOR_ID, NBL_DEVICE_ID_M18000_LX_OCP),
> +	  .driver_data = BIT(NBL_CAP_HAS_NET_BIT) | BIT(NBL_CAP_IS_NIC_BIT) |
> +			 BIT(NBL_CAP_IS_LEONIS_BIT) },
> +	{ PCI_DEVICE(NBL_VENDOR_ID, NBL_DEVICE_ID_M18000_BASE_T_OCP),
> +	  .driver_data = BIT(NBL_CAP_HAS_NET_BIT) | BIT(NBL_CAP_IS_NIC_BIT) |
> +			 BIT(NBL_CAP_IS_LEONIS_BIT) },
> +	{ PCI_DEVICE(NBL_VENDOR_ID, NBL_DEVICE_ID_M18000_LX_BASE_T_OCP),
> +	  .driver_data = BIT(NBL_CAP_HAS_NET_BIT) | BIT(NBL_CAP_IS_NIC_BIT) |
> +			 BIT(NBL_CAP_IS_LEONIS_BIT) },
> +	/* required as sentinel */
> +	{
> +		0,

Please drop this zero. The most usual style is `{ }`.

> +	}
> +};
> +MODULE_DEVICE_TABLE(pci, nbl_id_table);
> +
> +static struct pci_driver nbl_driver = {
> +	.name = NBL_DRIVER_NAME,
> +	.id_table = nbl_id_table,
> +	.probe = nbl_probe,
> +	.remove = nbl_remove,
> +};

The pci bus probe function has (pci_device_probe() ->
__pci_device_probe()):

        int error = 0;

        if (drv->probe) {
		...
	}
	return error;

So given that the probe function does nothing apart from returning zero,
you can just drop .probe(). (There is an additional check against
.id_table, but I'm pretty sure that isn't relevant because
pci_bus_match() already makes sure that there is a match.) The same is
true for .remove().

Best regards
Uwe

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH 1/3] PCI: rcar-gen4: Configure AXIINTC if iMSI-RX not used
From: Geert Uytterhoeven @ 2026-06-17  8:26 UTC (permalink / raw)
  To: Marek Vasut
  Cc: linux-pci, Yoshihiro Shimoda, Krzysztof Wilczyński,
	Bjorn Helgaas, Catalin Marinas, Conor Dooley, Geert Uytterhoeven,
	Krzysztof Kozlowski, Lorenzo Pieralisi, Manivannan Sadhasivam,
	Marc Zyngier, Rob Herring, devicetree, linux-arm-kernel,
	linux-doc, linux-kernel, linux-renesas-soc
In-Reply-To: <20260617030008.154449-1-marek.vasut+renesas@mailbox.org>

Hi Marek,

On Wed, 17 Jun 2026 at 05:00, Marek Vasut
<marek.vasut+renesas@mailbox.org> wrote:
> In case MSI are enabled, but DWC built-in iMSI-RX is not in use, the
> MSI are handled via GIC ITS. Configure all controller MSI registers
> fully.
>
> Set or clear MSI capability register MSICAP0 MSI enable MSIE bit and
> PCIe Interrupt Status 0 Enable register PCIEINTSTS0EN MSI interrupt
> enable MSI_CTRL_INT bit according to MSI enable state, set both bits
> if MSI are enabled, clear both bits if MSI are disabled.
>
> If MSI are disabled, or MSI are enabled and iMSI-RX is used, then
> deconfigure AXIINTCADDR and AXIINTCCONT to 0, which disables any
> pass through of MSI TLPs onto the AXI bus and then further into
> GIC ITS translation registers.
>
> If MSI are enabled and iMSI-RX is not used, the configure AXIINTCADDR
> with target address of GIC ITS translation registers, and configure
> AXIINTCCONT to enable MSI TLP pass through onto AXI bus and into the
> GIC ITS. This specific configuration allows handling of MSI via the
> GIC ITS instead of integrated iMSI-RX.
>
> Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
> Signed-off-by: Marek Vasut <marek.vasut+renesas@mailbox.org>

Thanks for your patch!

> --- a/drivers/pci/controller/dwc/pcie-rcar-gen4.c
> +++ b/drivers/pci/controller/dwc/pcie-rcar-gen4.c
> @@ -31,6 +31,10 @@
>  #define DEVICE_TYPE_RC         BIT(4)
>  #define BIFUR_MOD_SET_ON       BIT(0)
>
> +/* MSI Capability */
> +#define MSICAP0                        0x0050
> +#define MSICAP0_MSIE           BIT(16)
> +
>  /* PCIe Interrupt Status 0 */
>  #define PCIEINTSTS0            0x0084
>
> @@ -55,6 +59,16 @@
>  #define APP_HOLD_PHY_RST       BIT(16)
>  #define APP_LTSSM_ENABLE       BIT(0)
>
> +/* INTC address */
> +#define AXIINTCADDR            0x0a00
> +/* GITS GIC ITS translation register */
> +#define AXIINTCADDR_VAL                0xf1050000
> +
> +/* INTC control & mask */
> +#define AXIINTCCONT            0x0a04
> +#define INTC_EN                        BIT(31)
> +#define INTC_MASK              GENMASK(11, 2)
> +
>  /* PCIe Power Management Control */
>  #define PCIEPWRMNGCTRL         0x0070
>  #define APP_CLK_REQ_N          BIT(11)
> @@ -305,6 +319,39 @@ static struct rcar_gen4_pcie *rcar_gen4_pcie_alloc(struct platform_device *pdev)
>         return rcar;
>  }
>
> +static void rcar_gen4_pcie_host_msi_init(struct dw_pcie_rp *pp)
> +{
> +       struct dw_pcie *dw = to_dw_pcie_from_pp(pp);
> +       struct rcar_gen4_pcie *rcar = to_rcar_gen4_pcie(dw);
> +       u32 val;
> +
> +       /* Make sure MSICAP0 MSIE is configured. */
> +       val = dw_pcie_readl_dbi(dw, MSICAP0);
> +       if (pci_msi_enabled())
> +               val |= MSICAP0_MSIE;
> +       else
> +               val &= ~MSICAP0_MSIE;
> +       dw_pcie_writel_dbi(dw, MSICAP0, val);
> +
> +       if (!pci_msi_enabled() || pp->use_imsi_rx) {
> +               /* Clear AXIINTC mapping. */
> +               writel(0, rcar->base + AXIINTCADDR);
> +               writel(0, rcar->base + AXIINTCCONT);
> +       } else {
> +               /* Point AXIINTC to GIC ITS and enable. */
> +               writel(AXIINTCADDR_VAL, rcar->base + AXIINTCADDR);
> +               writel(INTC_EN | INTC_MASK, rcar->base + AXIINTCCONT);
> +       }
> +
> +       /* Configure MSI interrupt signal */
> +       val = readl(rcar->base + PCIEINTSTS0EN);
> +       if (pci_msi_enabled())
> +               val |= MSI_CTRL_INT;
> +       else
> +               val &= ~MSI_CTRL_INT;
> +       writel(val, rcar->base + PCIEINTSTS0EN);
> +}
> +
>  static int rcar_gen4_pcie_enable_device(struct pci_host_bridge *bridge,

FTR, this has a contextual dependency on "[PATCH v2] PCI: rcar-gen4:
Limit Max_Read_Request_Size and Max_Payload_Size to 256 Bytes"
(https://lore.kernel.org/all/20260519195219.189323-1-marek.vasut+renesas@mailbox.org).

>                                         struct pci_dev *dev)
>  {

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply

* Re: [PATCH 2/4] mm/slub: preserve previous object lifetime in user tracking
From: Hao Li @ 2026-06-17  7:54 UTC (permalink / raw)
  To: Pengpeng Hou
  Cc: Vlastimil Babka, Andrew Morton, linux-mm, Harry Yoo,
	Christoph Lameter, David Rientjes, Roman Gushchin,
	David Hildenbrand, Lorenzo Stoakes, liam, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jonathan Corbet, Shuah Khan,
	linux-doc, linux-kernel
In-Reply-To: <20260616141410.52117-3-pengpeng@iscas.ac.cn>

On Tue, Jun 16, 2026 at 10:14:08PM +0800, Pengpeng Hou wrote:
> SLAB_STORE_USER stores one allocation track and one free track for an
> object.  When that object is reused, the next allocation overwrites the
> allocation track.  If a stale pointer from the previous lifetime is later
> freed or otherwise reported, the free/check report can contain the victim
> allocation and the stale operation while the previous completed alloc/free
> pair has already been overwritten.
> 
> Keep one previous completed lifetime in the existing user tracking
> metadata.  When an object is allocated and the current allocation/free
> tracks both exist, copy that completed lifetime to the previous-lifetime
> slots before recording the new allocation.  Clear the current free track
> when the new allocation begins so the current lifetime does not continue
> to display a free from the old lifetime.
> 
> Print the previous object lifetime when it is available.  This is
> diagnostic information only; it does not infer semantic ownership or
> identify the root cause of a use-after-free.
> 
> Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
> ---
>  mm/slub.c | 66 +++++++++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 55 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 43d4febd5bf2..358f42e92207 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -327,7 +327,13 @@ struct track {
>  	unsigned long when;	/* When did the operation occur */
>  };
>  
> -enum track_item { TRACK_ALLOC, TRACK_FREE, TRACK_NR };
> +enum track_item {
> +	TRACK_ALLOC,
> +	TRACK_FREE,
> +	TRACK_PREV_ALLOC,
> +	TRACK_PREV_FREE,
> +	TRACK_NR,
> +};
>  
>  static inline unsigned int user_tracking_size(slab_flags_t flags)
>  {
> @@ -1080,12 +1086,37 @@ static void set_track_update(struct kmem_cache *s, void *object,
>  	p->when = jiffies;
>  }
>  
> -static __always_inline void set_track(struct kmem_cache *s, void *object,
> -				      enum track_item alloc, unsigned long addr, gfp_t gfp_flags)
> +static bool track_has_record(const struct track *t)
> +{
> +	return t->addr;
> +}

how about inline it

> +
> +static void clear_track(struct kmem_cache *s, void *object,
> +			enum track_item track)
> +{
> +	memset(get_track(s, object, track), 0, sizeof(struct track));
> +}
> +
> +static void save_previous_lifetime(struct kmem_cache *s, void *object)
> +{
> +	struct track *alloc = get_track(s, object, TRACK_ALLOC);
> +	struct track *free = get_track(s, object, TRACK_FREE);
> +
> +	if (!track_has_record(alloc) || !track_has_record(free))
> +		return;
> +
> +	*get_track(s, object, TRACK_PREV_ALLOC) = *alloc;
> +	*get_track(s, object, TRACK_PREV_FREE) = *free;

Maybe we can use memcpy instead of copying them one by one.

> +}
> +
> +static __always_inline void set_alloc_track(struct kmem_cache *s, void *object,
> +					    unsigned long addr, gfp_t gfp_flags)
>  {
>  	depot_stack_handle_t handle = set_track_prepare(gfp_flags);
>  
> -	set_track_update(s, object, alloc, addr, handle);
> +	save_previous_lifetime(s, object);
> +	set_track_update(s, object, TRACK_ALLOC, addr, handle);
> +	clear_track(s, object, TRACK_FREE);

sashiko has a comment:

https://sashiko.dev/#/patchset/20260616141410.52117-1-pengpeng%40iscas.ac.cn

It seems a simple fix could be removing clear_track() and allow the stale free
track.

>  }
>  
>  static void init_tracking(struct kmem_cache *s, void *object)
> @@ -1120,11 +1151,22 @@ static void print_track(const char *s, struct track *t, unsigned long pr_time)
>  void print_tracking(struct kmem_cache *s, void *object)
>  {
>  	unsigned long pr_time = jiffies;
> +	struct track *prev_alloc;
> +	struct track *prev_free;
> +
>  	if (!(s->flags & SLAB_STORE_USER))
>  		return;
>  
>  	print_track("Allocated", get_track(s, object, TRACK_ALLOC), pr_time);
>  	print_track("Freed", get_track(s, object, TRACK_FREE), pr_time);
> +
> +	prev_alloc = get_track(s, object, TRACK_PREV_ALLOC);
> +	prev_free = get_track(s, object, TRACK_PREV_FREE);
> +	if (track_has_record(prev_alloc) || track_has_record(prev_free)) {
> +		pr_err("Previous object lifetime:\n");
> +		print_track("Previously allocated", prev_alloc, pr_time);
> +		print_track("Previously freed", prev_free, pr_time);
> +	}
>  }
>  
>  static void print_slab_info(const struct slab *slab)
> @@ -1371,10 +1413,12 @@ check_bytes_and_report(struct kmem_cache *s, struct slab *slab,
>   *
>   * [Metadata starts at object + s->inuse]
>   *   - A. freelist pointer (if freeptr_outside_object)
> - *   - B. alloc tracking (SLAB_STORE_USER)
> - *   - C. free tracking (SLAB_STORE_USER)
> - *   - D. original request size (SLAB_KMALLOC && SLAB_STORE_USER)
> - *   - E. KASAN metadata (if enabled)
> + *   - B. current alloc tracking (SLAB_STORE_USER)
> + *   - C. current free tracking (SLAB_STORE_USER)
> + *   - D. previous alloc tracking (SLAB_STORE_USER)
> + *   - E. previous free tracking (SLAB_STORE_USER)
> + *   - F. original request size (SLAB_KMALLOC && SLAB_STORE_USER)
> + *   - G. KASAN metadata (if enabled)
>   *
>   * [Mandatory padding] (if CONFIG_SLUB_DEBUG && SLAB_RED_ZONE)
>   *   - One mandatory debug word to guarantee a minimum poisoned gap
> @@ -2029,8 +2073,8 @@ static inline void slab_pad_check(struct kmem_cache *s, struct slab *slab) {}
>  static inline int check_object(struct kmem_cache *s, struct slab *slab,
>  			void *object, u8 val) { return 1; }
>  static inline depot_stack_handle_t set_track_prepare(gfp_t gfp_flags) { return 0; }
> -static inline void set_track(struct kmem_cache *s, void *object,
> -			     enum track_item alloc, unsigned long addr, gfp_t gfp_flags) {}
> +static inline void set_alloc_track(struct kmem_cache *s, void *object,
> +				   unsigned long addr, gfp_t gfp_flags) {}
>  static inline void add_full(struct kmem_cache *s, struct kmem_cache_node *n,
>  					struct slab *slab) {}
>  static inline void remove_full(struct kmem_cache *s, struct kmem_cache_node *n,
> @@ -4522,7 +4566,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  
>  success:
>  	if (kmem_cache_debug_flags(s, SLAB_STORE_USER))
> -		set_track(s, object, TRACK_ALLOC, addr, gfpflags);
> +		set_alloc_track(s, object, addr, gfpflags);
>  
>  	return object;
>  }
> -- 
> 2.43.0
> 
-- 
Thanks,
Hao

^ permalink raw reply

* Re: [PATCH 1/3] PCI: rcar-gen4: Configure AXIINTC if iMSI-RX not used
From: Marc Zyngier @ 2026-06-17  7:28 UTC (permalink / raw)
  To: Marek Vasut
  Cc: linux-pci, Yoshihiro Shimoda, Krzysztof Wilczyński,
	Bjorn Helgaas, Catalin Marinas, Conor Dooley, Geert Uytterhoeven,
	Krzysztof Kozlowski, Lorenzo Pieralisi, Manivannan Sadhasivam,
	Rob Herring, devicetree, linux-arm-kernel, linux-doc,
	linux-kernel, linux-renesas-soc
In-Reply-To: <20260617030008.154449-1-marek.vasut+renesas@mailbox.org>

On Wed, 17 Jun 2026 03:59:44 +0100,
Marek Vasut <marek.vasut+renesas@mailbox.org> wrote:
> 
> In case MSI are enabled, but DWC built-in iMSI-RX is not in use, the
> MSI are handled via GIC ITS. Configure all controller MSI registers
> fully.
> 
> Set or clear MSI capability register MSICAP0 MSI enable MSIE bit and
> PCIe Interrupt Status 0 Enable register PCIEINTSTS0EN MSI interrupt
> enable MSI_CTRL_INT bit according to MSI enable state, set both bits
> if MSI are enabled, clear both bits if MSI are disabled.
> 
> If MSI are disabled, or MSI are enabled and iMSI-RX is used, then
> deconfigure AXIINTCADDR and AXIINTCCONT to 0, which disables any
> pass through of MSI TLPs onto the AXI bus and then further into
> GIC ITS translation registers.
> 
> If MSI are enabled and iMSI-RX is not used, the configure AXIINTCADDR
> with target address of GIC ITS translation registers, and configure
> AXIINTCCONT to enable MSI TLP pass through onto AXI bus and into the
> GIC ITS. This specific configuration allows handling of MSI via the
> GIC ITS instead of integrated iMSI-RX.
> 
> Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
> Signed-off-by: Marek Vasut <marek.vasut+renesas@mailbox.org>
> ---
> NOTE: This would not be possible without prior work from Shimoda-san
> ---
> Cc: "Krzysztof Wilczyński" <kwilczynski@kernel.org>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Conor Dooley <conor+dt@kernel.org>
> Cc: Geert Uytterhoeven <geert+renesas@glider.be>
> Cc: Krzysztof Kozlowski <krzk+dt@kernel.org>
> Cc: Lorenzo Pieralisi <lpieralisi@kernel.org>
> Cc: Manivannan Sadhasivam <mani@kernel.org>
> Cc: Marc Zyngier <maz@kernel.org>
> Cc: Rob Herring <robh@kernel.org>
> Cc: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
> Cc: devicetree@vger.kernel.org
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-pci@vger.kernel.org
> Cc: linux-renesas-soc@vger.kernel.org
> ---
>  drivers/pci/controller/dwc/pcie-rcar-gen4.c | 53 +++++++++++++++++++--
>  1 file changed, 48 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/pci/controller/dwc/pcie-rcar-gen4.c b/drivers/pci/controller/dwc/pcie-rcar-gen4.c
> index 485cfa8bd9692..ba6e3bedd6d0a 100644
> --- a/drivers/pci/controller/dwc/pcie-rcar-gen4.c
> +++ b/drivers/pci/controller/dwc/pcie-rcar-gen4.c
> @@ -31,6 +31,10 @@
>  #define DEVICE_TYPE_RC		BIT(4)
>  #define BIFUR_MOD_SET_ON	BIT(0)
>  
> +/* MSI Capability */
> +#define MSICAP0			0x0050
> +#define MSICAP0_MSIE		BIT(16)
> +
>  /* PCIe Interrupt Status 0 */
>  #define PCIEINTSTS0		0x0084
>  
> @@ -55,6 +59,16 @@
>  #define APP_HOLD_PHY_RST	BIT(16)
>  #define APP_LTSSM_ENABLE	BIT(0)
>  
> +/* INTC address */
> +#define AXIINTCADDR		0x0a00
> +/* GITS GIC ITS translation register */
> +#define AXIINTCADDR_VAL		0xf1050000

Wouldn't it be preferable to source the address from the device tree,
rather than hardcoding this?

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply

* Re: [PATCH 2/3] irqchip/gic-v3: Add Renesas R-Car Gen4 erratum workaround
From: Marc Zyngier @ 2026-06-17  7:24 UTC (permalink / raw)
  To: Marek Vasut
  Cc: linux-pci, Yoshihiro Shimoda, Krzysztof Wilczyński,
	Bjorn Helgaas, Catalin Marinas, Conor Dooley, Geert Uytterhoeven,
	Krzysztof Kozlowski, Lorenzo Pieralisi, Manivannan Sadhasivam,
	Rob Herring, devicetree, linux-arm-kernel, linux-doc,
	linux-kernel, linux-renesas-soc
In-Reply-To: <20260617030008.154449-2-marek.vasut+renesas@mailbox.org>

On Wed, 17 Jun 2026 03:59:45 +0100,
Marek Vasut <marek.vasut+renesas@mailbox.org> wrote:
> 
> Renesas R-Car S4/V4H/V4M GIC600 integration has address width for AXI
> or APB interface configured to 32 bit, it can therefore access only
> the first 4 GiB of physical address space. This information comes from
> R-Car V4H Interface Specification sheet, there is currently no technical
> update number assigned to this limitation. Further input from hardware
> engineer indicates that this limitation also applies to R-Car S4 and V4M.
> Name the limitation GEN4GICITS1, and add a driver quirk to mitigate this
> limitation.
> 
> Note that the 0x0201743b GIC600 ID is not Renesas-specific, it is
> common for many ARM GICv3 implementations. Therefore, add an extra

Not quite. It designates GIC600 unambiguously. It is just that GIC600
is integrated in zillions of SoCs, most of which don't have this
problem (the machine I'm typing this from has a GIC600 *and* 96GB of
RAM).

> of_machine_is_compatible() check.
> 
> The GIC600 implementation in R-Car S4/V4H/V4M is r1p6.

Is this relevant?

> 
> Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
> Signed-off-by: Marek Vasut <marek.vasut+renesas@mailbox.org>
> ---
> NOTE: This would not be possible without prior work from Shimoda-san
>       https://lore.kernel.org/all/20240214052050.1966439-1-yoshihiro.shimoda.uh@renesas.com/
> ---
> Cc: "Krzysztof Wilczyński" <kwilczynski@kernel.org>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Conor Dooley <conor+dt@kernel.org>
> Cc: Geert Uytterhoeven <geert+renesas@glider.be>
> Cc: Krzysztof Kozlowski <krzk+dt@kernel.org>
> Cc: Lorenzo Pieralisi <lpieralisi@kernel.org>
> Cc: Manivannan Sadhasivam <mani@kernel.org>
> Cc: Marc Zyngier <maz@kernel.org>
> Cc: Rob Herring <robh@kernel.org>
> Cc: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
> Cc: devicetree@vger.kernel.org
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-pci@vger.kernel.org
> Cc: linux-renesas-soc@vger.kernel.org
> ---
>  Documentation/arch/arm64/silicon-errata.rst |  1 +
>  arch/arm64/Kconfig                          |  9 +++++++++
>  drivers/irqchip/irq-gic-v3-its.c            | 20 ++++++++++++++++++++
>  3 files changed, 30 insertions(+)
> 
> diff --git a/Documentation/arch/arm64/silicon-errata.rst b/Documentation/arch/arm64/silicon-errata.rst
> index 014aa1c215a16..b0c68b64f5ac2 100644
> --- a/Documentation/arch/arm64/silicon-errata.rst
> +++ b/Documentation/arch/arm64/silicon-errata.rst
> @@ -352,6 +352,7 @@ stable kernels.
>  +----------------+-----------------+-----------------+-----------------------------+
>  | Qualcomm Tech. | Kryo4xx Gold    | N/A             | ARM64_ERRATUM_1286807       |
>  +----------------+-----------------+-----------------+-----------------------------+
> +| Renesas        | S4/V4H/V4M      | N/A             | RENESAS_ERRATUM_GEN4GICITS1 |
>  +----------------+-----------------+-----------------+-----------------------------+
>  | Rockchip       | RK3588          | #3588001        | ROCKCHIP_ERRATUM_3588001    |
>  +----------------+-----------------+-----------------+-----------------------------+
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index b3afe0688919b..b9e17ce475e61 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -1382,6 +1382,15 @@ config NVIDIA_CARMEL_CNP_ERRATUM
>  
>  	  If unsure, say Y.
>  
> +config RENESAS_ERRATUM_GEN4GICITS1
> +	bool "Renesas R-Car Gen4: GIC600 can not access physical addresses above 4 GiB"
> +	default y
> +	help
> +	  The Renesas R-Car Gen4 S4/V4H/V4M GIC600 SoC integrations have AXI
> +	  addressing limited to the first 32-bit of physical address space.
> +
> +	  If unsure, say Y.
> +
>  config ROCKCHIP_ERRATUM_3568002
>  	bool "Rockchip 3568002: GIC600 can not access physical addresses higher than 4GB"
>  	default y
> diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
> index b57d81ad33a0a..ec3756f29cf1a 100644
> --- a/drivers/irqchip/irq-gic-v3-its.c
> +++ b/drivers/irqchip/irq-gic-v3-its.c
> @@ -4901,6 +4901,18 @@ static bool __maybe_unused its_enable_rk3568002(void *data)
>  	return true;
>  }
>  
> +static bool __maybe_unused its_enable_renesas_gen4(void *data)
> +{
> +	if (!of_machine_is_compatible("renesas,r8a779f0") &&
> +	    !of_machine_is_compatible("renesas,r8a779g0") &&
> +	    !of_machine_is_compatible("renesas,r8a779h0"))
> +		return false;
> +
> +	gfp_flags_quirk |= GFP_DMA32;
> +
> +	return true;
> +}
> +
>  static const struct gic_quirk its_quirks[] = {
>  #ifdef CONFIG_CAVIUM_ERRATUM_22375
>  	{
> @@ -4975,6 +4987,14 @@ static const struct gic_quirk its_quirks[] = {
>  		.mask   = 0xffffffff,
>  		.init   = its_enable_rk3568002,
>  	},
> +#endif
> +#ifdef CONFIG_RENESAS_ERRATUM_GEN4GICITS1
> +	{
> +		.desc   = "ITS: Renesas R-Car Gen4 GIC600 32-bit limit",
> +		.iidr   = 0x0201743b,
> +		.mask   = 0xffffffff,
> +		.init   = its_enable_renesas_gen4,
> +	},
>  #endif
>  	{
>  	}


Honestly, that's a bit too much copy-paste for my taste. Just refactor
the erratum handling to be more generic, something like this:

diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
index 291d7668cc8da..380c4758647d2 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -4894,10 +4894,17 @@ static bool __maybe_unused its_enable_quirk_hip09_162100801(void *data)
 	return true;
 }
 
-static bool __maybe_unused its_enable_rk3568002(void *data)
+static const char * const dma_impaired_platforms[] = {
+#ifdef CONFIG_ROCKCHIP_ERRATUM_3568002
+	"rockchip,rk3566",
+	"rockchip,rk3568",
+#endif
+	NULL,
+};
+
+static bool __maybe_unused its_enable_dma32(void *data)
 {
-	if (!of_machine_is_compatible("rockchip,rk3566") &&
-	    !of_machine_is_compatible("rockchip,rk3568"))
+	if (!of_machine_compatible_match(dma_impaired_platforms))
 		return false;
 
 	gfp_flags_quirk |= GFP_DMA32;
@@ -4972,14 +4979,12 @@ static const struct gic_quirk its_quirks[] = {
 		.property = "dma-noncoherent",
 		.init   = its_set_non_coherent,
 	},
-#ifdef CONFIG_ROCKCHIP_ERRATUM_3568002
 	{
-		.desc   = "ITS: Rockchip erratum RK3568002",
+		.desc   = "ITS: Broken GIC600 integration limited to 32bit PA",
 		.iidr   = 0x0201743b,
 		.mask   = 0xffffffff,
-		.init   = its_enable_rk3568002,
+		.init   = its_enable_dma32,
 	},
-#endif
 	{
 	}
 };

Then add the two lines you need in a separate patch.

In the future, please provide a cover letter when you have more than a
single patch (git will happily generate one for you).

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply related

* Re: [PATCH 2/3] irqchip/gic-v3: Add Renesas R-Car Gen4 erratum workaround
From: Geert Uytterhoeven @ 2026-06-17  7:09 UTC (permalink / raw)
  To: Marek Vasut
  Cc: linux-pci, Yoshihiro Shimoda, Krzysztof Wilczyński,
	Bjorn Helgaas, Catalin Marinas, Conor Dooley, Geert Uytterhoeven,
	Krzysztof Kozlowski, Lorenzo Pieralisi, Manivannan Sadhasivam,
	Marc Zyngier, Rob Herring, devicetree, linux-arm-kernel,
	linux-doc, linux-kernel, linux-renesas-soc
In-Reply-To: <20260617030008.154449-2-marek.vasut+renesas@mailbox.org>

Hi Marek,

On Wed, 17 Jun 2026 at 05:00, Marek Vasut
<marek.vasut+renesas@mailbox.org> wrote:
> Renesas R-Car S4/V4H/V4M GIC600 integration has address width for AXI
> or APB interface configured to 32 bit, it can therefore access only
> the first 4 GiB of physical address space. This information comes from
> R-Car V4H Interface Specification sheet, there is currently no technical
> update number assigned to this limitation. Further input from hardware
> engineer indicates that this limitation also applies to R-Car S4 and V4M.
> Name the limitation GEN4GICITS1, and add a driver quirk to mitigate this
> limitation.
>
> Note that the 0x0201743b GIC600 ID is not Renesas-specific, it is
> common for many ARM GICv3 implementations. Therefore, add an extra
> of_machine_is_compatible() check.
>
> The GIC600 implementation in R-Car S4/V4H/V4M is r1p6.
>
> Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
> Signed-off-by: Marek Vasut <marek.vasut+renesas@mailbox.org>

Thanks for your patch!

> --- a/drivers/irqchip/irq-gic-v3-its.c
> +++ b/drivers/irqchip/irq-gic-v3-its.c
> @@ -4901,6 +4901,18 @@ static bool __maybe_unused its_enable_rk3568002(void *data)
>         return true;
>  }
>
> +static bool __maybe_unused its_enable_renesas_gen4(void *data)
> +{
> +       if (!of_machine_is_compatible("renesas,r8a779f0") &&
> +           !of_machine_is_compatible("renesas,r8a779g0") &&
> +           !of_machine_is_compatible("renesas,r8a779h0"))

of_machine_compatible_match() with an array of strings might generate
smaller code (I didn't check if 3 entries is enough to trip the balance).

> +               return false;
> +
> +       gfp_flags_quirk |= GFP_DMA32;
> +
> +       return true;
> +}
> +
>  static const struct gic_quirk its_quirks[] = {
>  #ifdef CONFIG_CAVIUM_ERRATUM_22375
>         {

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply

* Re: [PATCH 0/3] mm/zram: route block swap I/O through swap_ops
From: Christoph Hellwig @ 2026-06-17  6:19 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Jianyue Wu, Christoph Hellwig, Andrew Morton, Chris Li,
	Baoquan He, Nhat Pham, Barry Song, Kairui Song, Kemeng Shi,
	Youngjun Park, Minchan Kim, Jens Axboe, Matthew Wilcox (Oracle),
	Jan Kara, linux-mm, linux-kernel, linux-block, linux-doc,
	Brian Geffon
In-Reply-To: <ajIYFtADxQDq8q1P@google.com>

On Wed, Jun 17, 2026 at 12:46:53PM +0900, Sergey Senozhatsky wrote:
> Those are fantastic questions, thank you for asking them.
> Can we elaborate on zram being a "legacy interface"?

Compression is functionality that fundamentally belongs into the core
swap code, not a virtual block device.  Between the backing store
less zswap and the virtual swap layer, the core swap code is not getting
to the point where don't need to rely on hacks like a compressing
ramdisk.

^ permalink raw reply

* Re: [PATCH 0/3] mm/zram: route block swap I/O through swap_ops
From: Christoph Hellwig @ 2026-06-17  6:17 UTC (permalink / raw)
  To: Jianyue Wu
  Cc: Christoph Hellwig, Andrew Morton, Chris Li, Baoquan He, Nhat Pham,
	Barry Song, Kairui Song, Kemeng Shi, Youngjun Park, Minchan Kim,
	Sergey Senozhatsky, Jens Axboe, Matthew Wilcox (Oracle), Jan Kara,
	linux-mm, linux-kernel, linux-block, linux-doc
In-Reply-To: <CAJxJ_jhK+zkpjhs3YsQ9RoasKYh+E0NweQci0sPAEY1ne5LmBA@mail.gmail.com>

On Wed, Jun 17, 2026 at 11:38:02AM +0800, Jianyue Wu wrote:
> Before I rework or drop the RFC, could you outline how you see that
> core-side model working? In particular:
>   - How should a compressed backend like zram or future block device
>     plug into swap_iocb / swap_ops?

I don't think that is the right layer.  The virtual swap layer that is
currently in the process of being upstreamed is the right level, and
the actual swap devices or swap files are just a dumb backend for what
they higher level code does.

>   - What role do you expect zram to keep while the legacy block interface
>     remains: current block swap only, or something else?

For now we'll need to keep it working as-is.  It is heavily used in
android and potentially elsewhere.  Once we have zswap fully working
in the virtual swap layer world it might make sense to say never
compress again in zram when REQ_SWAP is set (or maybe a new
REQ_COPRESSED) so that we can use the core compression code without
breaking existing setups.


^ permalink raw reply

* Re: [PATCH 0/3] mm/zram: route block swap I/O through swap_ops
From: Sergey Senozhatsky @ 2026-06-17  6:10 UTC (permalink / raw)
  To: Jianyue Wu, Christoph Hellwig
  Cc: Sergey Senozhatsky, Andrew Morton, Chris Li, Baoquan He,
	Nhat Pham, Barry Song, Kairui Song, Kemeng Shi, Youngjun Park,
	Minchan Kim, Jens Axboe, Matthew Wilcox (Oracle), Jan Kara,
	linux-mm, linux-kernel, linux-block, linux-doc, Brian Geffon
In-Reply-To: <CAJxJ_jiM_-a52EOm896FXkdH+wRxjSHJx+MW6b-ewNLVkp4uSw@mail.gmail.com>

Hi,

On (26/06/17 13:44), Jianyue Wu wrote:
> Hello Sergey,
> 
> On Wed, Jun 17, 2026 at 11:46 AM Sergey Senozhatsky
> <senozhatsky@chromium.org> wrote:
> > Can we elaborate on zram being a "legacy interface"?
> My previous wording was ambiguous. Actually I didn't mean it is a
> legacy interface.

Oh, your wording wasn't ambiguous.  I simply forgot to direct my
previous email to Christoph.

^ permalink raw reply

* Re: [PATCH 7/7] hwmon: adm1275: Support module auto-loading
From: Matti Vaittinen @ 2026-06-17  6:00 UTC (permalink / raw)
  To: Guenter Roeck, Matti Vaittinen, Matti Vaittinen
  Cc: Rob Herring, Krzysztof Kozlowski, Conor Dooley, Jonathan Corbet,
	Shuah Khan, Wensheng Wang, Ashish Yadav, Kim Seer Paller,
	Cedric Encarnacion, Chris Packham, Yuxi Wang, Charles Hsu,
	ChiShih Tsai, linux-hwmon, devicetree, linux-kernel, linux-doc
In-Reply-To: <f080e20e-6ec7-4744-9794-0a92d03f48d8@roeck-us.net>

On 16/06/2026 17:04, Guenter Roeck wrote:
> On 6/15/26 23:47, Matti Vaittinen wrote:
>> From: Matti Vaittinen <mazziesaccount@gmail.com>
>>
>> Populating the spi_device_id -table is not enough to make the
>> driver module automatically load when device-tree node for the bd12780
>> is parsed at boot.
>>
>> Adding the of_device_id tables causes the driver module to be
>> automatically load at boot. Testing has been done with rather old Debian
>> system.
>>
>> When inspecting the generated module-aliases with the insmod, following
>> entries seem to be the difference:
>>
>> alias:          of:N*T*Crohm,bd12780C*
>> alias:          of:N*T*Crohm,bd12780
>>
>> I suspect these are required for the module loading to work.
>>
>> Signed-off-by: Matti Vaittinen <mazziesaccount@gmail.com>
>>
>> ---
>>
>> I did not add of_device_ids for other supported ICs as I can't verify it
>> doesn't cause side-effects. Please let me know if you think those IDs
>> should be added as well. I would be glad if I got more educated opinion
>> on adding the of-IDs :) (I can squash this to 3/7 and 6/7 in next
>> revision, and add own patch for adding of-IDs for other ICs if
>> required).
>>
> 
> I don't know what those side effects might be. I am much more concerned
> about side effects of having some of the devices in adm1275_of_match
> and some in adm1275_id. So, yes, please add a patch to provide
> adm1275_of_match for all chips supported by the driver.
It's nice to have an opinion on this as I was really unsure what is the 
right way forward. Thanks for all the help this far. I'll do that in v2.

Yours,
	-- Matti

-- 
Matti Vaittinen
Linux kernel developer at ROHM Semiconductors
Oulu Finland

~~ When things go utterly wrong vim users can always type :help! ~~

^ permalink raw reply

* Re: [PATCH 6/7] hwmon: adm1275: Support ROHM BD12790
From: Matti Vaittinen @ 2026-06-17  5:57 UTC (permalink / raw)
  To: Guenter Roeck, Matti Vaittinen, Matti Vaittinen
  Cc: Rob Herring, Krzysztof Kozlowski, Conor Dooley, Jonathan Corbet,
	Shuah Khan, Wensheng Wang, Ashish Yadav, Kim Seer Paller,
	Cedric Encarnacion, Chris Packham, Yuxi Wang, Charles Hsu,
	ChiShih Tsai, linux-hwmon, devicetree, linux-kernel, linux-doc
In-Reply-To: <d66b9de3-db06-4f83-9c2a-b45e341bfc9c@roeck-us.net>

On 16/06/2026 17:15, Guenter Roeck wrote:
> On 6/15/26 23:44, Matti Vaittinen wrote:
>> From: Matti Vaittinen <mazziesaccount@gmail.com>
>>
>> Add support for ROHM BD12790 hot-swap controller which is largely
>> similar to Analog Devices adm1272.
>>
>> The BD12790 uses the same selectable 60V/100V voltage ranges and
>> 15mV/30mV current-sense ranges as the ADM1272, and the same VRANGE
>> (bit 5) and IRANGE (bit 0) layout in PMON_CONFIG. It therefore uses
>> a dedicated coefficient table that mirrors adm1272_coefficients, with
>> the following differences derived from BD12790 datasheet Table 1 (p.18):
>> - power 60V/30mV: m=17560 (vs. 17561)
>> - power 100V/30mV: m=10536 (vs. 10535)
>> - temperature: b=31880 (vs. 31871, reflecting T[11:0] = 4.2*T + 3188)
>>
>> Signed-off-by: Matti Vaittinen <mazziesaccount@gmail.com>
>> Assisted-by: GitHub Copilot:claude-sonnet-4.6
>>
>> ---

// snip

>> -/* The BD12780 data sheets mark TSFILT bit as reserved. */
>> -#define BD12780_PMON_DEFCONFIG        (ADM1278_VOUT_EN | 
>> ADM1278_TEMP1_EN)
>> +/* The BD127x0 data sheets mark TSFILT bit as reserved. */
>> +#define BD127X0_PMON_DEFCONFIG        (ADM1278_VOUT_EN | 
>> ADM1278_TEMP1_EN)
> 
> Please don't use such placeholders. Just use BD12780_PMON_DEFCONFIG
> for both chips, similar to how the defines for all other chips
> are handled.

Ok, thanks.

Yours,
	-- Matti

-- 
Matti Vaittinen
Linux kernel developer at ROHM Semiconductors
Oulu Finland

~~ When things go utterly wrong vim users can always type :help! ~~

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox