Linux Documentation
 help / color / mirror / Atom feed
* [PATCH v7 2/2] hwmon: add AMD Promontory 21 xHCI temperature sensor support
From: Jihong Min @ 2026-05-19  0:07 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Mathias Nyman
  Cc: Guenter Roeck, Jonathan Corbet, Shuah Khan, Mario Limonciello,
	Basavaraj Natikar, Michal Pecio, Mario Limonciello,
	Yaroslav Isakov, linux-usb, linux-hwmon, linux-doc, linux-pci,
	linux-kernel, Jihong Min
In-Reply-To: <20260519000732.2334711-1-hurryman2212@gmail.com>

Add an auxiliary-bus hwmon driver for the temperature sensor exposed by
AMD Promontory 21 (PROM21) xHCI PCI functions. The driver binds to the
"hwmon" auxiliary device published by the PROM21 xHCI PCI glue and
exposes the sensor as temp1_input under the prom21_xhci hwmon device.

The sensor is accessed through a PROM21 vendor index/data register pair
in the xHCI PCI MMIO BAR. The driver consumes parent-provided MMIO data
from the PROM21 PCI glue instead of inspecting the parent PCI driver's
drvdata. The read path restores the previous vendor index value after
sampling and does not runtime-resume the parent PCI device; reads from a
suspended parent return -ENODATA.

Document the supported device, register access, runtime PM behavior, and
sysfs lookup method. The documentation also records the observation
method used to identify the register pair and derive the conversion
formula.

Assisted-by: Codex:gpt-5.5
Signed-off-by: Jihong Min <hurryman2212@gmail.com>
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Tested-by: Yaroslav Isakov <yaroslav.isakov@gmail.com>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
---
 Documentation/hwmon/index.rst       |   1 +
 Documentation/hwmon/prom21-xhci.rst | 101 ++++++++++++
 drivers/hwmon/Kconfig               |  10 ++
 drivers/hwmon/Makefile              |   1 +
 drivers/hwmon/prom21-xhci.c         | 239 ++++++++++++++++++++++++++++
 5 files changed, 352 insertions(+)
 create mode 100644 Documentation/hwmon/prom21-xhci.rst
 create mode 100644 drivers/hwmon/prom21-xhci.c

diff --git a/Documentation/hwmon/index.rst b/Documentation/hwmon/index.rst
index 8b655e5d6b68..324208f1faa2 100644
--- a/Documentation/hwmon/index.rst
+++ b/Documentation/hwmon/index.rst
@@ -216,6 +216,7 @@ Hardware Monitoring Kernel Drivers
    pmbus
    powerz
    powr1220
+   prom21-xhci
    pt5161l
    pxe1610
    pwm-fan
diff --git a/Documentation/hwmon/prom21-xhci.rst b/Documentation/hwmon/prom21-xhci.rst
new file mode 100644
index 000000000000..7984fb187bd8
--- /dev/null
+++ b/Documentation/hwmon/prom21-xhci.rst
@@ -0,0 +1,101 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Kernel driver prom21-xhci
+=========================
+
+Supported chips:
+
+  * AMD Promontory 21 (PROM21) xHCI USB host controller
+
+    Prefix: 'prom21_xhci'
+
+    PCI IDs: 1022:43fc, 1022:43fd
+
+Author:
+
+  - Jihong Min <hurryman2212@gmail.com>
+
+Description
+-----------
+
+This driver exposes the temperature sensor in AMD PROM21 xHCI controllers.
+
+The driver binds to an auxiliary device created by the xHCI PCI driver for
+supported controllers. The sensor value is accessed through a vendor-specific
+index/data register pair in the controller's PCI MMIO BAR.
+The auxiliary device is created by the ``xhci-pci-prom21`` PCI glue driver.
+USB host operation is otherwise delegated to the common ``xhci-pci`` code.
+
+PROM21 is an AMD chipset IP used in single-chip or daisy-chained configurations
+to build AMD 6xx/8xx series chipsets. Since the xHCI controllers are
+integrated in PROM21, this temperature can also be used as a monitor for a
+temperature close to the AMD chipset temperature.
+
+Register access
+---------------
+
+The temperature value is read through a vendor-specific index/data register
+pair in the xHCI PCI MMIO BAR. The driver uses the following byte offsets from
+the MMIO BAR base:
+
+======================= =====================================================
+0x3000			Vendor index register
+0x3008			Vendor data register
+======================= =====================================================
+
+The driver saves the current vendor index register value, writes the
+temperature selector ``0x0001e520`` to the vendor index register, reads the
+vendor data register, and restores the previous vendor index value before
+returning. The raw temperature value is the low 8 bits of the vendor data
+register value.
+
+The hwmon core serializes this driver's callbacks, and the driver restores the
+previous index value after each read. This does not provide synchronization
+with firmware, SMM, ACPI AML, or any other user outside this driver.
+
+No public AMD reference is available for the register pair or the raw value.
+The register pair was identified on an X870E system with two PROM21 xHCI
+controllers. One controller was passed through to a Windows VM, and the same
+controller's PCI MMIO BAR was observed from the Linux host while HWiNFO64 was
+reporting the PROM21 xHCI temperature. In the test environment, the reported
+temperature was very stable at idle and the displayed sensor resolution was
+low, which made it possible to look for a consistently repeating MMIO response
+for the same reported temperature. During observation, offset 0x3000 repeatedly
+contained selector ``0x0001e520``. Writing the same selector to offset 0x3000
+from Linux and then reading offset 0x3008 reproduced the same raw value, so the
+offsets are treated as a vendor index/data register pair.
+
+The conversion formula was empirically inferred by matching observed raw
+8-bit values against HWiNFO64's reported PROM21 xHCI temperature for the same
+controller. The observed mapping is:
+
+  temp[C] = raw * 0.9066 - 78.624
+
+Runtime PM
+----------
+
+The driver does not wake the xHCI PCI device for hwmon reads. It reads the
+temperature only when the parent device is already active. A read from a
+suspended device returns ``-ENODATA``. After a successful read, the driver
+drops its active-only runtime PM reference and lets the PM core re-evaluate the
+idle state.
+
+Sysfs entries
+-------------
+
+======================= =====================================================
+temp1_input		Temperature in millidegrees Celsius
+======================= =====================================================
+
+The hwmon device name is ``prom21_xhci``. The sysfs path depends on the hwmon
+device number assigned by the kernel. Userspace can locate the device by
+matching the ``name`` attribute:
+
+.. code-block:: sh
+
+   for hwmon in /sys/class/hwmon/hwmon*; do
+           [ "$(cat "$hwmon/name")" = "prom21_xhci" ] || continue
+           cat "$hwmon/temp1_input"
+   done
+
+If the raw register value is invalid, ``temp1_input`` returns ``-ENODATA``.
diff --git a/drivers/hwmon/Kconfig b/drivers/hwmon/Kconfig
index 14e4cea48acc..1e770b548810 100644
--- a/drivers/hwmon/Kconfig
+++ b/drivers/hwmon/Kconfig
@@ -951,6 +951,16 @@ config SENSORS_POWR1220
 	  This driver can also be built as a module. If so, the module
 	  will be called powr1220.
 
+config SENSORS_PROM21_XHCI
+	tristate "AMD Promontory 21 xHCI temperature sensor"
+	depends on USB_XHCI_PCI
+	help
+	  If you say yes here you get support for the AMD Promontory 21
+	  (PROM21) xHCI temperature sensor.
+
+	  This driver can also be built as a module. If so, the module
+	  will be called prom21-xhci.
+
 config SENSORS_LAN966X
 	tristate "Microchip LAN966x Hardware Monitoring"
 	depends on SOC_LAN966 || COMPILE_TEST
diff --git a/drivers/hwmon/Makefile b/drivers/hwmon/Makefile
index 982ee2c6f9de..f833aed890d8 100644
--- a/drivers/hwmon/Makefile
+++ b/drivers/hwmon/Makefile
@@ -196,6 +196,7 @@ obj-$(CONFIG_SENSORS_PC87427)	+= pc87427.o
 obj-$(CONFIG_SENSORS_PCF8591)	+= pcf8591.o
 obj-$(CONFIG_SENSORS_POWERZ)	+= powerz.o
 obj-$(CONFIG_SENSORS_POWR1220)  += powr1220.o
+obj-$(CONFIG_SENSORS_PROM21_XHCI)	+= prom21-xhci.o
 obj-$(CONFIG_SENSORS_PT5161L)	+= pt5161l.o
 obj-$(CONFIG_SENSORS_PWM_FAN)	+= pwm-fan.o
 obj-$(CONFIG_SENSORS_QNAP_MCU_HWMON)	+= qnap-mcu-hwmon.o
diff --git a/drivers/hwmon/prom21-xhci.c b/drivers/hwmon/prom21-xhci.c
new file mode 100644
index 000000000000..d40d0c53ce45
--- /dev/null
+++ b/drivers/hwmon/prom21-xhci.c
@@ -0,0 +1,239 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * AMD Promontory 21 xHCI Hwmon Implementation
+ * (only temperature monitoring is supported)
+ *
+ * This can be effectively used as the alternative chipset temperature monitor.
+ *
+ * Copyright (C) 2026 Jihong Min <hurryman2212@gmail.com>
+ */
+
+#include <linux/auxiliary_bus.h>
+#include <linux/device.h>
+#include <linux/err.h>
+#include <linux/errno.h>
+#include <linux/hwmon.h>
+#include <linux/io.h>
+#include <linux/math.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/platform_data/usb-xhci-prom21.h>
+#include <linux/pm_runtime.h>
+
+#define PROM21_XHCI_INDEX_OFFSET	0x3000
+#define PROM21_XHCI_DATA_OFFSET		0x3008
+#define PROM21_XHCI_TEMP_SELECTOR	0x0001e520
+
+struct prom21_xhci {
+	struct pci_dev *pdev;
+	struct device *hwmon_dev;
+	void __iomem *regs;
+};
+
+static int prom21_xhci_pm_get(struct prom21_xhci *hwmon)
+{
+	struct device *dev = &hwmon->pdev->dev;
+	int ret;
+
+	/*
+	 * PROM21 temperature register access does not return a valid value while
+	 * the parent xHCI PCI function is suspended. Do not wake the device from
+	 * a hwmon read. On success, hold a usage reference without changing the
+	 * runtime PM state; if runtime PM is disabled, allow the read unless the
+	 * device is still marked suspended.
+	 */
+	ret = pm_runtime_get_if_active(dev);
+	if (ret > 0)
+		return 0;
+
+	if (ret == -EINVAL) {
+		if (pm_runtime_status_suspended(dev))
+			return -ENODATA;
+
+		pm_runtime_get_noresume(dev);
+		return 0;
+	}
+
+	if (!ret)
+		return -ENODATA;
+
+	return ret;
+}
+
+/*
+ * This is not a pure MMIO read. The PROM21 vendor data register is selected
+ * by temporarily writing PROM21_XHCI_TEMP_SELECTOR to the vendor index
+ * register.
+ * The hwmon core already serializes this driver's callbacks, so this driver
+ * does not need an additional private lock. That does not synchronize with
+ * firmware, SMM, ACPI, or other possible users. Keep the sequence short and
+ * restore the previous index before returning.
+ */
+static int prom21_xhci_read_temp_raw_restore_index(struct prom21_xhci *hwmon,
+						   u8 *raw)
+{
+	struct device *dev = &hwmon->pdev->dev;
+	u32 index;
+	u8 data;
+	int ret;
+
+	ret = prom21_xhci_pm_get(hwmon);
+	if (ret)
+		return ret;
+
+	index = readl(hwmon->regs + PROM21_XHCI_INDEX_OFFSET);
+	/* Select the PROM21 temperature register through the vendor index. */
+	writel(PROM21_XHCI_TEMP_SELECTOR,
+	       hwmon->regs + PROM21_XHCI_INDEX_OFFSET);
+	/* Use a 32-bit read for PCI MMIO register access. */
+	data = readl(hwmon->regs + PROM21_XHCI_DATA_OFFSET) & 0xff;
+	/* Restore the previous vendor index register value. */
+	writel(index, hwmon->regs + PROM21_XHCI_INDEX_OFFSET);
+	readl(hwmon->regs + PROM21_XHCI_INDEX_OFFSET);
+
+	/*
+	 * Drop the usage reference taken by prom21_xhci_pm_get(). This is
+	 * enough because the read path never resumes the device; use the normal
+	 * put path so the PM core can re-evaluate idle state after the read.
+	 * Otherwise, a racing xHCI autosuspend attempt can see a nonzero
+	 * runtime PM usage count and skip autosuspend, and a later
+	 * pm_runtime_put_noidle(), which does not check for an idle device,
+	 * would leave the device active.
+	 */
+	pm_runtime_put(dev);
+
+	if (!data)
+		return -ENODATA;
+
+	*raw = data;
+	return 0;
+}
+
+static long prom21_xhci_raw_to_millicelsius(u8 raw)
+{
+	/*
+	 * No public AMD reference is available for this value.
+	 * The scale was derived from observed PROM21 xHCI temperature readings:
+	 *  temp[C] = raw * 0.9066 - 78.624
+	 */
+	return DIV_ROUND_CLOSEST(raw * 9066, 10) - 78624;
+}
+
+static umode_t prom21_xhci_is_visible(const void *drvdata,
+				      enum hwmon_sensor_types type, u32 attr,
+				      int channel)
+{
+	if (type != hwmon_temp)
+		return 0;
+
+	switch (attr) {
+	case hwmon_temp_input:
+		return 0444;
+	default:
+		return 0;
+	}
+}
+
+static int prom21_xhci_read(struct device *dev, enum hwmon_sensor_types type,
+			    u32 attr, int channel, long *val)
+{
+	struct prom21_xhci *hwmon = dev_get_drvdata(dev);
+	u8 raw;
+	int ret;
+
+	if (type != hwmon_temp || attr != hwmon_temp_input)
+		return -EOPNOTSUPP;
+
+	ret = prom21_xhci_read_temp_raw_restore_index(hwmon, &raw);
+	if (ret)
+		return ret;
+
+	*val = prom21_xhci_raw_to_millicelsius(raw);
+	return 0;
+}
+
+static const struct hwmon_ops prom21_xhci_ops = {
+	.is_visible = prom21_xhci_is_visible,
+	.read = prom21_xhci_read,
+};
+
+static const struct hwmon_channel_info *const prom21_xhci_info[] = {
+	HWMON_CHANNEL_INFO(temp, HWMON_T_INPUT),
+	NULL,
+};
+
+static const struct hwmon_chip_info prom21_xhci_chip_info = {
+	.ops = &prom21_xhci_ops,
+	.info = prom21_xhci_info,
+};
+
+static int prom21_xhci_probe(struct auxiliary_device *auxdev,
+			     const struct auxiliary_device_id *id)
+{
+	struct device *dev = &auxdev->dev;
+	const struct prom21_xhci_pdata *pdata = dev_get_platdata(dev);
+	struct prom21_xhci *hwmon;
+
+	if (!pdata)
+		return dev_err_probe(dev, -ENODEV,
+				     "platform data unavailable\n");
+
+	if (!pdata->regs ||
+	    pdata->rsrc_len < PROM21_XHCI_DATA_OFFSET + sizeof(u32))
+		return dev_err_probe(dev, -ENODEV, "invalid MMIO resource\n");
+
+	hwmon = devm_kzalloc(dev, sizeof(*hwmon), GFP_KERNEL);
+	if (!hwmon)
+		return -ENOMEM;
+
+	hwmon->pdev = pdata->pdev;
+	hwmon->regs = pdata->regs;
+	auxiliary_set_drvdata(auxdev, hwmon);
+
+	/*
+	 * Parent the hwmon device to the PCI function because the temperature
+	 * value is read from that function's MMIO BAR, and systems may contain
+	 * multiple PROM21 xHCI functions. This lets userspace identify the PCI
+	 * endpoint for each reading. The auxiliary driver still owns the hwmon
+	 * lifetime and unregisters it before HCD teardown.
+	 */
+	hwmon->hwmon_dev =
+		hwmon_device_register_with_info(&pdata->pdev->dev, "prom21_xhci",
+						hwmon, &prom21_xhci_chip_info,
+						NULL);
+	if (IS_ERR(hwmon->hwmon_dev))
+		return PTR_ERR(hwmon->hwmon_dev);
+
+	return 0;
+}
+
+static void prom21_xhci_remove(struct auxiliary_device *auxdev)
+{
+	struct prom21_xhci *hwmon = auxiliary_get_drvdata(auxdev);
+
+	/*
+	 * The PROM21 PCI glue destroys the auxiliary device before HCD teardown.
+	 * Unregister the hwmon device here so sysfs removes the attributes,
+	 * stops new reads, and drains active hwmon callbacks before the xHCI
+	 * MMIO mapping is released.
+	 */
+	hwmon_device_unregister(hwmon->hwmon_dev);
+}
+
+static const struct auxiliary_device_id prom21_xhci_id_table[] = {
+	{ .name = "xhci_pci_prom21.hwmon" },
+	{}
+};
+MODULE_DEVICE_TABLE(auxiliary, prom21_xhci_id_table);
+
+static struct auxiliary_driver prom21_xhci_driver = {
+	.name = "prom21-xhci",
+	.probe = prom21_xhci_probe,
+	.remove = prom21_xhci_remove,
+	.id_table = prom21_xhci_id_table,
+};
+module_auxiliary_driver(prom21_xhci_driver);
+
+MODULE_AUTHOR("Jihong Min <hurryman2212@gmail.com>");
+MODULE_DESCRIPTION("AMD Promontory 21 xHCI temperature sensor driver");
+MODULE_LICENSE("GPL");
-- 
2.53.0


^ permalink raw reply related

* [PATCH v7 1/2] usb: xhci-pci: add AMD Promontory 21 PCI glue
From: Jihong Min @ 2026-05-19  0:07 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Mathias Nyman
  Cc: Guenter Roeck, Jonathan Corbet, Shuah Khan, Mario Limonciello,
	Basavaraj Natikar, Michal Pecio, Mario Limonciello,
	Yaroslav Isakov, linux-usb, linux-hwmon, linux-doc, linux-pci,
	linux-kernel, Jihong Min
In-Reply-To: <20260519000732.2334711-1-hurryman2212@gmail.com>

AMD Promontory 21 (PROM21) xHCI PCI functions use the common xhci-pci
core for USB operation, but also expose controller-specific sensor data.
Add a small PROM21 PCI glue driver for AMD 1022:43fc and 1022:43fd
controllers.

The glue delegates USB host operation to the common xhci-pci core and
publishes a "hwmon" auxiliary device with parent-provided MMIO data.
Auxiliary device creation failure is logged but does not fail the xHCI
probe.

Make the PROM21 glue a hidden Kconfig tristate driven by the user-visible
SENSORS_PROM21_XHCI option. If sensor support is disabled, generic
xhci-pci binds PROM21 controllers normally. If sensor support is enabled,
the glue follows USB_XHCI_PCI.

This keeps the auxiliary device available for a modular sensor driver while
avoiding a built-in xhci-pci core handing PROM21 controllers to a glue
driver that is only available as a module during initramfs.

Assisted-by: Codex:gpt-5.5
Signed-off-by: Jihong Min <hurryman2212@gmail.com>
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Tested-by: Yaroslav Isakov <yaroslav.isakov@gmail.com>
---
 drivers/usb/host/Kconfig                      |   6 +
 drivers/usb/host/Makefile                     |   1 +
 drivers/usb/host/xhci-pci-prom21.c            | 137 ++++++++++++++++++
 drivers/usb/host/xhci-pci.c                   |  11 ++
 drivers/usb/host/xhci-pci.h                   |   3 +
 include/linux/platform_data/usb-xhci-prom21.h |  22 +++
 6 files changed, 180 insertions(+)
 create mode 100644 drivers/usb/host/xhci-pci-prom21.c
 create mode 100644 include/linux/platform_data/usb-xhci-prom21.h

diff --git a/drivers/usb/host/Kconfig b/drivers/usb/host/Kconfig
index 0a277a07cf70..43f30cd867e4 100644
--- a/drivers/usb/host/Kconfig
+++ b/drivers/usb/host/Kconfig
@@ -42,6 +42,12 @@ config USB_XHCI_PCI
 	depends on USB_PCI
 	default y
 
+config USB_XHCI_PCI_PROM21
+	tristate
+	depends on USB_XHCI_PCI
+	default USB_XHCI_PCI if SENSORS_PROM21_XHCI != n
+	select AUXILIARY_BUS
+
 config USB_XHCI_PCI_RENESAS
 	tristate "Support for additional Renesas xHCI controller with firmware"
 	depends on USB_XHCI_PCI
diff --git a/drivers/usb/host/Makefile b/drivers/usb/host/Makefile
index a07e7ba9cd53..174580c1281a 100644
--- a/drivers/usb/host/Makefile
+++ b/drivers/usb/host/Makefile
@@ -71,6 +71,7 @@ obj-$(CONFIG_USB_UHCI_HCD)	+= uhci-hcd.o
 obj-$(CONFIG_USB_FHCI_HCD)	+= fhci.o
 obj-$(CONFIG_USB_XHCI_HCD)	+= xhci-hcd.o
 obj-$(CONFIG_USB_XHCI_PCI)	+= xhci-pci.o
+obj-$(CONFIG_USB_XHCI_PCI_PROM21)	+= xhci-pci-prom21.o
 obj-$(CONFIG_USB_XHCI_PCI_RENESAS)	+= xhci-pci-renesas.o
 obj-$(CONFIG_USB_XHCI_PLATFORM) += xhci-plat-hcd.o
 obj-$(CONFIG_USB_XHCI_HISTB)	+= xhci-histb.o
diff --git a/drivers/usb/host/xhci-pci-prom21.c b/drivers/usb/host/xhci-pci-prom21.c
new file mode 100644
index 000000000000..6486f4a09345
--- /dev/null
+++ b/drivers/usb/host/xhci-pci-prom21.c
@@ -0,0 +1,137 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * AMD Promontory 21 xHCI host controller PCI Bus Glue.
+ *
+ * This does not add any PROM21-specific USB or xHCI operation. It exists only
+ * to publish an auxiliary device for integrated temperature sensor support.
+ *
+ * Copyright (C) 2026 Jihong Min <hurryman2212@gmail.com>
+ */
+
+#include <linux/auxiliary_bus.h>
+#include <linux/device/devres.h>
+#include <linux/errno.h>
+#include <linux/idr.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/platform_data/usb-xhci-prom21.h>
+#include <linux/usb.h>
+#include <linux/usb/hcd.h>
+
+#include "xhci-pci.h"
+
+struct prom21_xhci_auxdev {
+	struct auxiliary_device *auxdev;
+	struct prom21_xhci_pdata pdata;
+	int id;
+};
+
+static DEFINE_IDA(prom21_xhci_auxdev_ida);
+
+static void prom21_xhci_auxdev_release(struct device *dev, void *res)
+{
+	struct prom21_xhci_auxdev *prom21_auxdev = res;
+
+	auxiliary_device_destroy(prom21_auxdev->auxdev);
+	ida_free(&prom21_xhci_auxdev_ida, prom21_auxdev->id);
+}
+
+static int prom21_xhci_create_auxdev(struct pci_dev *pdev)
+{
+	struct prom21_xhci_auxdev *prom21_auxdev;
+	struct usb_hcd *hcd = pci_get_drvdata(pdev);
+	int ret;
+
+	prom21_auxdev = devres_alloc(prom21_xhci_auxdev_release,
+				     sizeof(*prom21_auxdev), GFP_KERNEL);
+	if (!prom21_auxdev)
+		return -ENOMEM;
+
+	prom21_auxdev->pdata.pdev = pdev;
+	prom21_auxdev->pdata.regs = hcd->regs;
+	prom21_auxdev->pdata.rsrc_len = hcd->rsrc_len;
+
+	prom21_auxdev->id = ida_alloc(&prom21_xhci_auxdev_ida, GFP_KERNEL);
+	if (prom21_auxdev->id < 0) {
+		ret = prom21_auxdev->id;
+		goto err_free_devres;
+	}
+
+	prom21_auxdev->auxdev = auxiliary_device_create(&pdev->dev,
+							KBUILD_MODNAME, "hwmon",
+							&prom21_auxdev->pdata,
+							prom21_auxdev->id);
+	if (!prom21_auxdev->auxdev) {
+		ret = -ENOMEM;
+		goto err_free_ida;
+	}
+
+	devres_add(&pdev->dev, prom21_auxdev);
+	return 0;
+
+err_free_ida:
+	ida_free(&prom21_xhci_auxdev_ida, prom21_auxdev->id);
+err_free_devres:
+	devres_free(prom21_auxdev);
+	return ret;
+}
+
+static void prom21_xhci_destroy_auxdev(struct pci_dev *pdev)
+{
+	devres_release(&pdev->dev, prom21_xhci_auxdev_release, NULL, NULL);
+}
+
+static int prom21_xhci_probe(struct pci_dev *dev,
+			     const struct pci_device_id *id)
+{
+	int retval;
+
+	retval = xhci_pci_common_probe(dev, id);
+	if (retval)
+		return retval;
+
+	retval = prom21_xhci_create_auxdev(dev);
+	if (retval) {
+		/*
+		 * The auxiliary device only provides optional temperature sensor
+		 * support. Keep the xHCI controller usable if it fails.
+		 */
+		dev_err(&dev->dev,
+			"failed to create PROM21 hwmon auxiliary device: %d\n",
+			retval);
+	}
+
+	return 0;
+}
+
+static void prom21_xhci_remove(struct pci_dev *dev)
+{
+	prom21_xhci_destroy_auxdev(dev);
+	xhci_pci_remove(dev);
+}
+
+static const struct pci_device_id pci_ids[] = {
+	{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_PROM21_XHCI_43FC) },
+	{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_PROM21_XHCI_43FD) },
+	{ /* end: all zeroes */ }
+};
+MODULE_DEVICE_TABLE(pci, pci_ids);
+
+static struct pci_driver prom21_xhci_driver = {
+	.name = "xhci-pci-prom21",
+	.id_table = pci_ids,
+
+	.probe = prom21_xhci_probe,
+	.remove = prom21_xhci_remove,
+
+	.shutdown = usb_hcd_pci_shutdown,
+	.driver = {
+		.pm = pm_ptr(&usb_hcd_pci_pm_ops),
+	},
+};
+module_pci_driver(prom21_xhci_driver);
+
+MODULE_AUTHOR("Jihong Min <hurryman2212@gmail.com>");
+MODULE_DESCRIPTION("AMD Promontory 21 xHCI PCI Host Controller Driver");
+MODULE_IMPORT_NS("xhci");
+MODULE_LICENSE("GPL");
diff --git a/drivers/usb/host/xhci-pci.c b/drivers/usb/host/xhci-pci.c
index 585b2f3117b0..039c26b241d0 100644
--- a/drivers/usb/host/xhci-pci.c
+++ b/drivers/usb/host/xhci-pci.c
@@ -696,12 +696,23 @@ static const struct pci_device_id pci_ids_renesas[] = {
 	{ /* end: all zeroes */ }
 };
 
+/* handled by xhci-pci-prom21 if enabled */
+static const struct pci_device_id pci_ids_prom21[] = {
+	{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_PROM21_XHCI_43FC) },
+	{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_PROM21_XHCI_43FD) },
+	{ /* end: all zeroes */ }
+};
+
 static int xhci_pci_probe(struct pci_dev *dev, const struct pci_device_id *id)
 {
 	if (IS_ENABLED(CONFIG_USB_XHCI_PCI_RENESAS) &&
 			pci_match_id(pci_ids_renesas, dev))
 		return -ENODEV;
 
+	if (IS_ENABLED(CONFIG_USB_XHCI_PCI_PROM21) &&
+	    pci_match_id(pci_ids_prom21, dev))
+		return -ENODEV;
+
 	return xhci_pci_common_probe(dev, id);
 }
 
diff --git a/drivers/usb/host/xhci-pci.h b/drivers/usb/host/xhci-pci.h
index e87c7d9d76b8..11f435f94322 100644
--- a/drivers/usb/host/xhci-pci.h
+++ b/drivers/usb/host/xhci-pci.h
@@ -4,6 +4,9 @@
 #ifndef XHCI_PCI_H
 #define XHCI_PCI_H
 
+#define PCI_DEVICE_ID_AMD_PROM21_XHCI_43FC	0x43fc
+#define PCI_DEVICE_ID_AMD_PROM21_XHCI_43FD	0x43fd
+
 int xhci_pci_common_probe(struct pci_dev *dev, const struct pci_device_id *id);
 void xhci_pci_remove(struct pci_dev *dev);
 
diff --git a/include/linux/platform_data/usb-xhci-prom21.h b/include/linux/platform_data/usb-xhci-prom21.h
new file mode 100644
index 000000000000..ee672ad452a8
--- /dev/null
+++ b/include/linux/platform_data/usb-xhci-prom21.h
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * AMD Promontory 21 xHCI auxiliary device platform data.
+ *
+ * Copyright (C) 2026 Jihong Min <hurryman2212@gmail.com>
+ */
+
+#ifndef _LINUX_PLATFORM_DATA_USB_XHCI_PROM21_H
+#define _LINUX_PLATFORM_DATA_USB_XHCI_PROM21_H
+
+#include <linux/compiler_types.h>
+#include <linux/types.h>
+
+struct pci_dev;
+
+struct prom21_xhci_pdata {
+	struct pci_dev *pdev;
+	void __iomem *regs;
+	resource_size_t rsrc_len;
+};
+
+#endif
-- 
2.53.0


^ permalink raw reply related

* [PATCH v7 0/2] AMD Promontory 21 xHCI temperature sensor support
From: Jihong Min @ 2026-05-19  0:07 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Mathias Nyman
  Cc: Guenter Roeck, Jonathan Corbet, Shuah Khan, Mario Limonciello,
	Basavaraj Natikar, Michal Pecio, Mario Limonciello,
	Yaroslav Isakov, linux-usb, linux-hwmon, linux-doc, linux-pci,
	linux-kernel, Jihong Min

Hi,

This series adds temperature monitoring for AMD Promontory 21 (PROM21)
xHCI PCI functions.

Patch 1 adds a small PROM21-specific xHCI PCI glue driver. USB host
operation is delegated to the common xhci-pci code, while the PROM21 glue
publishes an auxiliary device for optional sensor support.

Patch 2 adds an auxiliary-bus hwmon driver that binds to that auxiliary
device and exposes the PROM21 xHCI temperature value as temp1_input.

The hwmon driver reads the sensor through a vendor index/data register pair
in the xHCI PCI MMIO BAR. It does not wake the parent PCI device for hwmon
reads; if the parent is suspended, the read returns -ENODATA.

Changes in v7:
- Tie the hidden PROM21 PCI glue option to the user-visible
  SENSORS_PROM21_XHCI option instead of enabling it for all x86 builds.
- Drop an unnecessary NULL check after successful xhci_pci_common_probe().
- Use a goto-based cleanup path in prom21_xhci_create_auxdev().

Jihong Min (2):
  usb: xhci-pci: add AMD Promontory 21 PCI glue
  hwmon: add AMD Promontory 21 xHCI temperature sensor support

 Documentation/hwmon/index.rst                 |   1 +
 Documentation/hwmon/prom21-xhci.rst           | 101 ++++++++
 drivers/hwmon/Kconfig                         |  10 +
 drivers/hwmon/Makefile                        |   1 +
 drivers/hwmon/prom21-xhci.c                   | 239 ++++++++++++++++++
 drivers/usb/host/Kconfig                      |   6 +
 drivers/usb/host/Makefile                     |   1 +
 drivers/usb/host/xhci-pci-prom21.c            | 137 ++++++++++
 drivers/usb/host/xhci-pci.c                   |  11 +
 drivers/usb/host/xhci-pci.h                   |   3 +
 include/linux/platform_data/usb-xhci-prom21.h |  22 ++
 11 files changed, 532 insertions(+)
 create mode 100644 Documentation/hwmon/prom21-xhci.rst
 create mode 100644 drivers/hwmon/prom21-xhci.c
 create mode 100644 drivers/usb/host/xhci-pci-prom21.c
 create mode 100644 include/linux/platform_data/usb-xhci-prom21.h


base-commit: 4d3a2a466b8d68d852a1f3bbf11204b718428dc4
-- 
2.53.0

^ permalink raw reply

* Re: [PATCH] killswitch: add per-function short-circuit mitigation primitive
From: Song Liu @ 2026-05-19  0:01 UTC (permalink / raw)
  To: Paul Moore
  Cc: Sasha Levin, corbet, akpm, skhan, linux-doc, linux-kernel,
	linux-kselftest, gregkh, linux-security-module
In-Reply-To: <CAHC9VhTGDOJZDzEA31qc0S6FJpYNC4oD__HQ-65wqqwhng=V9Q@mail.gmail.com>

On Mon, May 18, 2026 at 4:57 PM Paul Moore <paul@paul-moore.com> wrote:
>
> On Mon, May 18, 2026 at 7:23 PM Song Liu <song@kernel.org> wrote:
> > On Mon, May 18, 2026 at 2:29 PM Paul Moore <paul@paul-moore.com> wrote:
> > [...]
> > > In my opinion, making killswitch an LSM is more of a procedural item
> > > that deals with how we view a capability like killswitch.  I
> > > personally view killswitch as somewhat similar to Lockdown, which is
> > > why I made the suggestion.
> > >
> > > The use of kprobes, while an interesting idea, presents problems as
> > > allowing any kernel symbol to be killed introduces the potential for
> > > security regressions.  As a reminder, some LSMs, as well as other
> > > kernel subsystems, have mechanisms in place to restrict root and/or
> > > enforce one-way configuration locks; while many people equate "root"
> > > with full control, in many cases today that is not strictly correct.
> > >
> > > Yes, kprobes have been around for some time, this is not a new
> > > problem, but killswitch makes it far more convenient and accessible to
> > > do dangerous things with kprobes.  If killswitch makes it past the RFC
> > > stage without any significant changes to its kill mechanism, we may
> > > need to start considering more liberal usage of NOKPROBE_SYMBOL()
> > > which I think would be an unfortunate casualty.
> >
> > I don't think we can use NOKPROBE_SYMBOL(). There are functions
> > that we don't want to killswitch, but still want to trace.
>
> That was exactly my point, but we need to figure something out so
> killswitch doesn't make it easier to cause a regression.

killswitch is making it easier to fix a CVE. It can surely make it easier
to cause a regression. AFAICT, the only protection here is "it is only
for root".

Thanks,
Song

^ permalink raw reply

* Re: [PATCH v3] killswitch: add per-function short-circuit mitigation primitive
From: Song Liu @ 2026-05-18 23:59 UTC (permalink / raw)
  To: Sasha Levin
  Cc: linux-kernel, linux-doc, linux-kselftest, bpf, live-patching,
	Greg Kroah-Hartman, Andrew Morton, Jonathan Corbet,
	Mathieu Desnoyers, Joshua Peisach, Florian Weimer, Breno Leitao,
	Anthony Iliopoulos, Michal Hocko, Jiri Olsa
In-Reply-To: <agsVDqdALBoHEHlv@laps>

On Mon, May 18, 2026 at 6:33 AM Sasha Levin <sashal@kernel.org> wrote:
>
> On Sun, May 17, 2026 at 11:37:36PM -0700, Song Liu wrote:
> >On Sun, May 17, 2026 at 6:49 AM Sasha Levin <sashal@kernel.org> wrote:
> >> * fail_function (CONFIG_FUNCTION_ERROR_INJECTION) is disabled in
> >>   most production kernels. Even where enabled, it only works on
> >>   functions pre-annotated with ALLOW_ERROR_INJECTION() in source -
> >>   no help for a freshly-disclosed CVE. The debugfs UI is blocked by
> >>   lockdown=integrity and the override is probabilistic.
> >>
> >> * BPF override (bpf_override_return) honors the same
> >>   ALLOW_ERROR_INJECTION() whitelist, and BPF itself is off in many
> >>   production kernels. Even where on, the operator interface is
> >>   "load a verified BPF program," not a one-line write.
> >
> >If it is OK for killswitch to attach to any kernel functions, do we still
> >need ALLOW_ERROR_INJECTION() for fail_function and BPF
> >override? Shall we instead also allow fail_function and BPF override
> >to attach to any kernel functions?
>
> I don't think so. ALLOW_ERROR_INJECTION is not a security mechanism, it's an
> integrity/safety mechanism for both bpf and fault injection.
>
> It protects against a "developer or CI script doing legitimate fault injection
> accidentally panics the box" scenario, not an "attacker gets in" one.

There really isn't a clear boundary between "security mechanism" and
"non-security mechanism". As we are making killswitch available
everywhere under root, users will soon learn to use it to do fault injection,
and potentially much more scary things. (Think about agents with sudo
access).

Thanks,
Song

^ permalink raw reply

* Re: [PATCH] killswitch: add per-function short-circuit mitigation primitive
From: Paul Moore @ 2026-05-18 23:57 UTC (permalink / raw)
  To: Song Liu
  Cc: Sasha Levin, corbet, akpm, skhan, linux-doc, linux-kernel,
	linux-kselftest, gregkh, linux-security-module
In-Reply-To: <CAPhsuW5jQOzRTi1ea+=UPhx5W9bkBdivPagRE=O=nx0zf_vb8w@mail.gmail.com>

On Mon, May 18, 2026 at 7:23 PM Song Liu <song@kernel.org> wrote:
> On Mon, May 18, 2026 at 2:29 PM Paul Moore <paul@paul-moore.com> wrote:
> [...]
> > In my opinion, making killswitch an LSM is more of a procedural item
> > that deals with how we view a capability like killswitch.  I
> > personally view killswitch as somewhat similar to Lockdown, which is
> > why I made the suggestion.
> >
> > The use of kprobes, while an interesting idea, presents problems as
> > allowing any kernel symbol to be killed introduces the potential for
> > security regressions.  As a reminder, some LSMs, as well as other
> > kernel subsystems, have mechanisms in place to restrict root and/or
> > enforce one-way configuration locks; while many people equate "root"
> > with full control, in many cases today that is not strictly correct.
> >
> > Yes, kprobes have been around for some time, this is not a new
> > problem, but killswitch makes it far more convenient and accessible to
> > do dangerous things with kprobes.  If killswitch makes it past the RFC
> > stage without any significant changes to its kill mechanism, we may
> > need to start considering more liberal usage of NOKPROBE_SYMBOL()
> > which I think would be an unfortunate casualty.
>
> I don't think we can use NOKPROBE_SYMBOL(). There are functions
> that we don't want to killswitch, but still want to trace.

That was exactly my point, but we need to figure something out so
killswitch doesn't make it easier to cause a regression.

-- 
paul-moore.com

^ permalink raw reply

* Re: [PATCH 00/28] mm/damon: introduce data attributes monitoring
From: SeongJae Park @ 2026-05-18 23:53 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Andrew Morton, Liam R. Howlett, David Hildenbrand,
	Jonathan Corbet, Lorenzo Stoakes, Masami Hiramatsu,
	Mathieu Desnoyers, Michal Hocko, Mike Rapoport, Shuah Khan,
	Shuah Khan, Steven Rostedt, Suren Baghdasaryan, Vlastimil Babka,
	damon, linux-doc, linux-kernel, linux-kselftest, linux-mm,
	linux-trace-kernel
In-Reply-To: <20260518234119.97569-1-sj@kernel.org>

On Mon, 18 May 2026 16:40:48 -0700 SeongJae Park <sj@kernel.org> wrote:

> TL; DR
> ======
> 
> Extend DAMON for monitoring general data attributes other than accesses.
> The short term motivation is lightweight page type (e.g., belonging
> cgroup) aware monitoring.  In long term, this will help extending DAMON
> for multiple access events capture primitives (e.g., page faults and
> PMU) and eventually pivotting DAMON to a "Data Attributes Monitoring and
> Operations eNgine" in long term.
[...]
> Changes from RFC v3
> - rfc v3: https://lore.kernel.org/20260516183712.81393-1-sj@kernel.org
> - Wordsmithing documentation.
> - Drop RFC tag.
> - Rebase to mm-new.

Sashiko failed [1] to reivew this series because it is still having an old
version of mm-new, while this series is based on mm-new.  Same issues were
found in RFC versions, so I was making those to based on mm-stable, and got
Sashiko reviews.  On the last version (RFC v3), I confirmed [2] Sashiko find no
more blocker.  So I believe this is good to go for more testing in mm-new.  I
will of course happy to get different inputs.

[1] https://sashiko.dev/#/patchset/20260518234119.97569-1-sj%40kernel.org
[2] https://lore.kernel.org/20260516220317.4300-1-sj@kernel.org


Thanks,
SJ

[...]

^ permalink raw reply

* Re: [PATCH v3] killswitch: add per-function short-circuit mitigation primitive
From: Song Liu @ 2026-05-18 23:52 UTC (permalink / raw)
  To: Sasha Levin
  Cc: linux-kernel, linux-doc, linux-kselftest, bpf, live-patching,
	Greg Kroah-Hartman, Andrew Morton, Jonathan Corbet,
	Mathieu Desnoyers, Joshua Peisach, Florian Weimer, Breno Leitao,
	Anthony Iliopoulos, Michal Hocko, Jiri Olsa
In-Reply-To: <20260517134858.146569-1-sashal@kernel.org>

On Sun, May 17, 2026 at 6:49 AM Sasha Levin <sashal@kernel.org> wrote:
>
> When a kernel (security) issue goes public, fleets stay exposed until a patched
> kernel is built, distributed, and rebooted into.
>
> For many such issues the simplest mitigation is to stop calling the buggy
> function. Killswitch provides that. An admin writes:
>
>     echo "engage af_alg_sendmsg -1" \
>         > /sys/kernel/security/killswitch/control
>

With v3, we hit this with fentry and killswitch on the same function:

[root@(none) /]# bpftrace -e 'fentry:security_file_open {@count+=1;}' &
[1] 295
Attached 1 probe
[root@(none) /]# echo 'engage security_file_open 0' >
/sys/kernel/security/killswitch/control
[   97.112360] killswitch: engage security_file_open=0 uid=0
auid=4294967295 ses=4294967295 comm=bash
[   97.120766] BUG: unable to handle page fault for address: ffffffffb5855043
[   97.121212] #PF: supervisor read access in kernel mode
[   97.121517] #PF: error_code(0x0000) - not-present page
[   97.121710] PGD 4a76067 P4D 4a77067 PUD 4a78063 PMD 0
[   97.121710] Oops: Oops: 0000 [#1] SMP NOPTI
[   97.121710] CPU: 1 UID: 0 PID: 430 Comm: bash Tainted: G
     N H 7.1.0-rc4+ #195 PREEMPT(full)
[   97.121710] Tainted: [N]=TEST, [H]=KILLSWITCH
[   97.121710] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[   97.121710] RIP: 0010:fd_install+0x1c/0x220
[   97.121710] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f
1e fa 0f 1f 44 00 00 65 48 8b 15 47 a0 a4 04 41 54 55 53 48 8b 9a 70
0a 00 00 <f6> 46 43 01 0f 85 62 01 00 00 41 89 fc 48 89 f5 65 ff 05 3d
a0 a4
[   97.121710] RSP: 0018:ffa0000000f2fe70 EFLAGS: 00010286
[   97.121710] RAX: ffffffffb5855000 RBX: ff11000100911c40 RCX: 0000000000000000
[   97.121710] RDX: ff110001045349c0 RSI: ffffffffb5855000 RDI: 0000000000000003
[   97.121710] RBP: ff11000100be81c0 R08: 0000000000000001 R09: 0000000000000000
[   97.121710] R10: 0000000000000001 R11: 00000000000008c2 R12: 0000000000000003
[   97.121710] R13: 00000000ffffff9c R14: 0000000000000101 R15: 0000000000000000
[   97.121710] FS:  00007fb231d4d740(0000) GS:ff110001b5855000(0000)
knlGS:0000000000000000
[   97.121710] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   97.121710] CR2: ffffffffb5855043 CR3: 0000000114513002 CR4: 0000000000771ef0
[   97.121710] PKRU: 00000000
[   97.121710] Call Trace:
[   97.121710]  <TASK>
[   97.121710]  do_sys_openat2+0x7f/0xe0
[   97.121710]  __x64_sys_openat+0x56/0xa0
[   97.121710]  do_syscall_64+0xc4/0xf20
[   97.121710]  ? srso_alias_return_thunk+0x5/0xfbef5
[   97.121710]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   97.121710] RIP: 0033:0x7fb231e4ee1b
[   97.121710] Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25
18 00 00 00 85 c0 75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00
00 0f 05 <48> 3d 00 f0 ff ff 0f 87 91 00 00 00 48 8b 54 24 28 64 48 2b
14 25
[   97.121710] RSP: 002b:00007ffefe160770 EFLAGS: 00000246 ORIG_RAX:
0000000000000101
[   97.121710] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fb231e4ee1b
[   97.121710] RDX: 0000000000000000 RSI: 000055616f0411d0 RDI: 00000000ffffff9c
[   97.121710] RBP: 000055616f0411d0 R08: 000055616f046b60 R09: 0064692d656e6968
[   97.121710] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[   97.121710] R13: 000055616f03cb20 R14: 000055616f039310 R15: 0000000000000000
[   97.121710]  </TASK>
[   97.121710] Modules linked in:
[   97.121710] CR2: ffffffffb5855043
[   97.121710] ---[ end trace 0000000000000000 ]---
[   97.121710] RIP: 0010:fd_install+0x1c/0x220
[   97.121710] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f
1e fa 0f 1f 44 00 00 65 48 8b 15 47 a0 a4 04 41 54 55 53 48 8b 9a 70
0a 00 00 <f6> 46 43 01 0f 85 62 01 00 00 41 89 fc 48 89 f5 65 ff 05 3d
a0 a4
[   97.121710] RSP: 0018:ffa0000000f2fe70 EFLAGS: 00010286
[   97.121710] RAX: ffffffffb5855000 RBX: ff11000100911c40 RCX: 0000000000000000
[   97.121710] RDX: ff110001045349c0 RSI: ffffffffb5855000 RDI: 0000000000000003
[   97.121710] RBP: ff11000100be81c0 R08: 0000000000000001 R09: 0000000000000000
[   97.121710] R10: 0000000000000001 R11: 00000000000008c2 R12: 0000000000000003
[   97.121710] R13: 00000000ffffff9c R14: 0000000000000101 R15: 0000000000000000
[   97.121710] FS:  00007fb231d4d740(0000) GS:ff110001b5855000(0000)
knlGS:0000000000000000
[   97.121710] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   97.121710] CR2: ffffffffb5855043 CR3: 0000000114513002 CR4: 0000000000771ef0
[   97.121710] PKRU: 00000000
[   97.121710] Kernel panic - not syncing: Fatal exception
[   97.121710] Kernel Offset: disabled

^ permalink raw reply

* Re: [PATCH 5/6] kselftest: alloc_tag: add kselftest for ioctl interface
From: Abhishek Bapat @ 2026-05-18 23:48 UTC (permalink / raw)
  To: Suren Baghdasaryan, Andrew Morton, Kent Overstreet
  Cc: Shuah Khan, Jonathan Corbet, linux-doc, linux-kernel, linux-mm,
	Sourav Panda
In-Reply-To: <6d8a2059a4625ef638062815d9586f7a083d02ec.1777936301.git.abhishekbapat@google.com>

On Mon, May 4, 2026 at 4:36 PM Abhishek Bapat <abhishekbapat@google.com> wrote:
>
> Introduce a kselftest to verify the new IOCTL-based interface for
> /proc/allocinfo. The test covers:
>
> 1. Validation of the filename filter.
> 2. Validation of the function filter.
>
> The first test validates the functionality of the filename filter. Using
> "mm/memory.c" as the candidate filename filter, it retrieves filtered
> entries from both procfs and ioctl and matches the first VEC_MAX_ENTRIES
> entries.
>
> The second test validates the functionality of the function filter.
> It uses "dup_mm" as the candidate function as we do not expect this
> function name to change frequently and hence won't be needing to modify
> this test often.
>
> Note that both the tests match line no, function name and file name
> fields. Bytes allocated and calls are not matched as those values may
> change in the time when the data is being read from procfs and ioctl and
> hence can lead to false negatives.
>
> Signed-off-by: Abhishek Bapat <abhishekbapat@google.com>
> ---
>  tools/testing/selftests/alloc_tag/Makefile    |   9 +
>  .../alloc_tag/allocinfo_ioctl_test.c          | 316 ++++++++++++++++++
>  2 files changed, 325 insertions(+)
>  create mode 100644 tools/testing/selftests/alloc_tag/Makefile
>  create mode 100644 tools/testing/selftests/alloc_tag/allocinfo_ioctl_test.c
>
> diff --git a/tools/testing/selftests/alloc_tag/Makefile b/tools/testing/selftests/alloc_tag/Makefile
> new file mode 100644
> index 000000000000..f2b8fc022c3b
> --- /dev/null
> +++ b/tools/testing/selftests/alloc_tag/Makefile
> @@ -0,0 +1,9 @@
> +# SPDX-License-Identifier: GPL-2.0
> +
> +TEST_GEN_PROGS := allocinfo_ioctl_test
> +
> +CFLAGS += -Wall
> +CFLAGS += -I../../../../usr/include
> +
> +include ../lib.mk
> +
> diff --git a/tools/testing/selftests/alloc_tag/allocinfo_ioctl_test.c b/tools/testing/selftests/alloc_tag/allocinfo_ioctl_test.c
> new file mode 100644
> index 000000000000..543023ca3d27
> --- /dev/null
> +++ b/tools/testing/selftests/alloc_tag/allocinfo_ioctl_test.c
> @@ -0,0 +1,316 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +/* kselftest for allocinfo ioctl
> + * allocinfo ioctl retrives allocinfo data through ioctl
> + * Copyright (C) 2026 Google, Inc.
> + */
> +
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <stdbool.h>
> +#include <unistd.h>
> +#include <sys/ioctl.h>
> +#include <linux/types.h>
> +#include <linux/alloc_tag.h>
> +#include "../kselftest.h"
> +
> +#define MAX_LINE_LEN           512
> +#define ALLOCINFO_PROC         "/proc/allocinfo"
> +
> +enum ioctl_ret {
> +       IOCTL_SUCCESS = 0,
> +       IOCTL_FAILURE = 1,
> +       IOCTL_INVALID_DATA = 2,
> +};
> +
> +#define VEC_MAX_ENTRIES 32
> +
> +struct allocinfo_tag_data_vec {
> +       struct allocinfo_tag_data tag[VEC_MAX_ENTRIES];
> +       __u64 count;
> +};
> +
> +static inline int __allocinfo_get_content_id(int dev_fd, struct allocinfo_content_id *params)
> +{
> +       return ioctl(dev_fd, ALLOCINFO_IOC_CONTENT_ID, params);
> +}
> +
> +static inline int __allocinfo_get_at(int dev_fd, struct allocinfo_get_at *params)
> +{
> +       return ioctl(dev_fd, ALLOCINFO_IOC_GET_AT, params);
> +}
> +
> +static inline int __allocinfo_get_next(int dev_fd, struct allocinfo_tag_data *params)
> +{
> +       return ioctl(dev_fd, ALLOCINFO_IOC_GET_NEXT, params);
> +}
> +
> +static bool match_entry(const struct allocinfo_tag_data *procfs_entry,
> +                       const struct allocinfo_tag_data *tag_data,
> +                       bool match_bytes, bool match_calls, bool match_lineno,
> +                       bool match_function, bool match_filename)
> +{
> +       if (match_bytes && tag_data->counter.bytes != procfs_entry->counter.bytes) {
> +               ksft_print_msg("size retrieved through ioctl does not match procfs\n");
> +               return false;
> +       }
> +
> +       if (match_calls && tag_data->counter.calls != procfs_entry->counter.calls) {
> +               ksft_print_msg("call count retrieved through ioctl does not match procfs\n");
> +               return false;
> +       }
> +
> +       if (match_lineno && tag_data->tag.lineno != procfs_entry->tag.lineno) {
> +               ksft_print_msg("lineno retrieved through ioctl does not match procfs\n");
> +               return false;
> +       }
> +
> +       if (match_function &&
> +           strncmp(tag_data->tag.function, procfs_entry->tag.function, ALLOCINFO_STR_SIZE)) {
> +               ksft_print_msg("function retrieved through ioctl does not match procfs\n");
> +               return false;
> +       }
> +
> +       if (match_filename &&
> +           strncmp(tag_data->tag.filename, procfs_entry->tag.filename, ALLOCINFO_STR_SIZE)) {
> +               ksft_print_msg("filename retrieved through ioctl does not match procfs\n");
> +               return false;
> +       }
> +       return true;
> +}
> +
> +static bool match_entries(const struct allocinfo_tag_data_vec *procfs_entries,
> +                         const struct allocinfo_tag_data_vec *tags,
> +                         bool match_bytes, bool match_calls, bool match_lineno,
> +                         bool match_function, bool match_filename)
> +{
> +       __u64 i;
> +
> +       if (procfs_entries->count != tags->count) {
> +               ksft_print_msg("Entry count mismatch. ioctl entries: %llu, proc entries: %llu\n",
> +                              tags->count, procfs_entries->count);
> +               return false;
> +       }
> +       for (i = 0; i < procfs_entries->count; i++) {
> +               if (!match_entry(&procfs_entries->tag[i], &tags->tag[i],
> +                                match_bytes, match_calls, match_lineno,
> +                                match_function, match_filename)) {
> +                       ksft_print_msg("%lluth entry does not match.\n", i);
> +                       return false;
> +               }
> +       }
> +       return true;
> +}
> +
> +static int get_filtered_procfs_entries(struct allocinfo_tag_data_vec *procfs_entries,
> +                                      const struct allocinfo_filter *filter, int fd)
> +{
> +       FILE *fp = fdopen(fd, "r");
> +       char line[MAX_LINE_LEN];
> +       int matches, skip_lines = 2;
> +       struct allocinfo_tag_data procfs_entry;
> +
> +       if (!fp) {
> +               ksft_print_msg("Failed to open " ALLOCINFO_PROC " for reading\n");
> +               return 1;
> +       }
> +       memset(procfs_entries, 0, sizeof(*procfs_entries));
> +       while (fgets(line, sizeof(line), fp) && procfs_entries->count < VEC_MAX_ENTRIES) {
> +               /*The first two procfs entries are for the header, so we skip them.*/
> +               if (skip_lines-- > 0)
> +                       continue;
This logic will change slightly in the v2 patch set. Currently, the
test skips first two lines from /proc/allocinfo as they're not actual
entries. However, the number of lines to skip might change. Therefore,
the v2 patch set will contain logic that checks for the first line
that matches expected format and will start parsing from that line.
> +
> +               memset(&procfs_entry, 0, sizeof(procfs_entry));
> +               matches = sscanf(line, "%llu %llu %[^:]:%llu func:%s",
> +                                &procfs_entry.counter.bytes,
> +                                &procfs_entry.counter.calls,
> +                                procfs_entry.tag.filename,
> +                                &procfs_entry.tag.lineno,
> +                                procfs_entry.tag.function);
> +
> +               if (matches != 5)
> +                       continue;
> +
> +               if (filter->mask & ALLOCINFO_FILTER_MASK_FILENAME) {
> +                       if (strncmp(procfs_entry.tag.filename,
> +                                   filter->fields.filename, ALLOCINFO_STR_SIZE))
> +                               continue;
> +               }
> +               if (filter->mask & ALLOCINFO_FILTER_MASK_FUNCTION) {
> +                       if (strncmp(procfs_entry.tag.function,
> +                                   filter->fields.function, ALLOCINFO_STR_SIZE))
> +                               continue;
> +               }
> +               if (filter->mask & ALLOCINFO_FILTER_MASK_LINENO) {
> +                       if (procfs_entry.tag.lineno != filter->fields.lineno)
> +                               continue;
> +               }
> +               if (filter->mask & ALLOCINFO_FILTER_MASK_MIN_SIZE) {
> +                       if (procfs_entry.counter.bytes < filter->fields.min_size)
> +                               continue;
> +               }
> +               if (filter->mask & ALLOCINFO_FILTER_MASK_MAX_SIZE) {
> +                       if (procfs_entry.counter.bytes > filter->fields.max_size)
> +                               continue;
> +               }
> +
> +               memcpy(&procfs_entries->tag[procfs_entries->count++], &procfs_entry,
> +                      sizeof(procfs_entry));
> +       }
> +       return 0;
> +}
> +
> +static enum ioctl_ret get_filtered_ioctl_entries(struct allocinfo_tag_data_vec *tags,
> +                                                const struct allocinfo_filter *filter, int fd,
> +                                                __u64 start_pos)
> +{
> +       struct allocinfo_content_id start_cont_id, end_cont_id;
> +       struct allocinfo_get_at get_at_params;
> +       const int max_retries = 10;
> +       int retry_count = 0;
> +       int status;
> +
> +       /*
> +        * __allocinfo_get_content_id may return different values if a kernel module was loaded
> +        * between the two calls. If that happens, the data gathered cannot be considered consistent
> +        * and hence needs to be fetched again to avoid flakiness.
> +        */
> +       do {
> +               if (__allocinfo_get_content_id(fd, &start_cont_id)) {
> +                       ksft_print_msg("allocinfo_get_content_id failed\n");
> +                       return IOCTL_FAILURE;
> +               }
> +
> +               memset(tags, 0, sizeof(*tags));
> +               memset(&get_at_params, 0, sizeof(get_at_params));
> +               memcpy(&get_at_params.filter, filter, sizeof(*filter));
> +               get_at_params.pos = start_pos;
> +               if (__allocinfo_get_at(fd, &get_at_params)) {
> +                       ksft_print_msg("allocinfo_get_at failed\n");
> +                       return IOCTL_FAILURE;
> +               }
> +               memcpy(&tags->tag[tags->count++], &get_at_params.data, sizeof(get_at_params.data));
> +
> +               while (tags->count < VEC_MAX_ENTRIES &&
> +                      __allocinfo_get_next(fd, &tags->tag[tags->count]) == 0)
> +                       tags->count++;
> +
> +               if (__allocinfo_get_content_id(fd, &end_cont_id)) {
> +                       ksft_print_msg("allocinfo_get_content_id failed\n");
> +                       return IOCTL_FAILURE;
> +               }
> +
> +               if (start_cont_id.id == end_cont_id.id) {
> +                       status = IOCTL_SUCCESS;
> +               } else {
> +                       ksft_print_msg("allocinfo_get_content_id mismatch, retrying...\n");
> +                       status = IOCTL_INVALID_DATA;
> +               }
> +       } while (status == IOCTL_INVALID_DATA && retry_count++ < max_retries);
> +
> +       return status;
> +}
> +
> +static int run_filter_test(const struct allocinfo_filter *filter)
> +{
> +       int fd;
> +       struct allocinfo_tag_data_vec *tags = malloc(sizeof(*tags));
> +       struct allocinfo_tag_data_vec *procfs_entries = malloc(sizeof(*procfs_entries));
> +       int ioctl_status;
> +       int ret = KSFT_PASS;
> +
> +       if (!tags || !procfs_entries) {
> +               ksft_print_msg("Memory allocation failed.\n");
> +               ret = KSFT_FAIL;
> +               goto freemem;
> +       }
> +
> +       fd = open(ALLOCINFO_PROC, O_RDONLY);
> +       if (fd < 0) {
> +               ksft_exit_skip("Failed to open " ALLOCINFO_PROC ": %s\n", strerror(errno));
> +               ret = KSFT_FAIL;
> +               goto freemem;
> +       }
> +
> +       if (get_filtered_procfs_entries(procfs_entries, filter, fd)) {
> +               ksft_print_msg("Error retrieving entries from " ALLOCINFO_PROC "\n");
> +               ret = KSFT_FAIL;
> +               goto exit;
> +       }
> +
> +       if (procfs_entries->count == 0) {
> +               ksft_print_msg("No entries found in " ALLOCINFO_PROC ", skipping test\n");
> +               ret = KSFT_SKIP;
> +               goto exit;
> +       }
> +
> +       ioctl_status = get_filtered_ioctl_entries(tags, filter, fd, 0);
> +       if (ioctl_status == IOCTL_INVALID_DATA) {
> +               ksft_print_msg("Trouble retrieving valid IOCTL entries, skipping.\n");
> +               ret = KSFT_SKIP;
> +               goto exit;
> +       }
> +       if (ioctl_status == IOCTL_FAILURE) {
> +               ksft_print_msg("Error retrieving IOCTL entries.\n");
> +               ret = KSFT_FAIL;
> +               goto exit;
> +       }
> +
> +       if (!match_entries(procfs_entries, tags, false, false, true, true, true))
> +               ret = KSFT_FAIL;
> +
> +exit:
> +       close(fd);
> +freemem:
> +       free(tags);
> +       free(procfs_entries);
> +       return ret;
> +}
> +
> +static int test_filename_filter(void)
> +{
> +       struct allocinfo_filter filter;
> +       const char *target_filename = "mm/memory.c";
> +
> +       memset(&filter, 0, sizeof(filter));
> +       filter.mask |= ALLOCINFO_FILTER_MASK_FILENAME;
> +       strncpy(filter.fields.filename, target_filename, ALLOCINFO_STR_SIZE);
> +
> +       return run_filter_test(&filter);
> +}
> +
> +static int test_function_filter(void)
> +{
> +       struct allocinfo_filter filter;
> +       const char *target_function = "dup_mm";
> +
> +       memset(&filter, 0, sizeof(filter));
> +       filter.mask |= ALLOCINFO_FILTER_MASK_FUNCTION;
> +       strncpy(filter.fields.function, target_function, ALLOCINFO_STR_SIZE);
> +
> +       return run_filter_test(&filter);
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +       int ret;
> +
> +       ksft_set_plan(2);
> +
> +       ret = test_filename_filter();
> +       if (ret == KSFT_SKIP)
> +               ksft_test_result_skip("Skipping test_filename_filter\n");
> +       else
> +               ksft_test_result(ret == KSFT_PASS, "test_filename_filter\n");
> +
> +       ret = test_function_filter();
> +       if (ret == KSFT_SKIP)
> +               ksft_test_result_skip("Skipping test_function_filter\n");
> +       else
> +               ksft_test_result(ret == KSFT_PASS, "test_function_filter\n");
> +
> +       ksft_finished();
> +}
> --
> 2.54.0.545.g6539524ca2-goog
>

^ permalink raw reply

* Re: [PATCH 3/6] alloc_tag: add size-based filtering to ioctl
From: Abhishek Bapat @ 2026-05-18 23:43 UTC (permalink / raw)
  To: Hao Ge
  Cc: Suren Baghdasaryan, Andrew Morton, Kent Overstreet, Shuah Khan,
	Jonathan Corbet, linux-doc, linux-kernel, linux-mm, Sourav Panda
In-Reply-To: <b681fad1-1a27-47fa-b58a-1f639eb3f3d9@linux.dev>

On Wed, May 13, 2026 at 11:54 PM Hao Ge <hao.ge@linux.dev> wrote:
>
> Hi Abhishek
>
>
> On 2026/5/5 07:36, Abhishek Bapat wrote:
> > Extend the allocinfo filtering mechanism to allow users to filter tags
> > based on the total number of bytes allocated [min_size, max_size]. The
> > size range is inclusive.
> >
> > Filtering by size involves retrieving allocinfo per-CPU counters, which
> > is an expensive operation. Hence, the performance of size-based
> > filtering will be worse than other filters.
> >
> > Signed-off-by: Abhishek Bapat <abhishekbapat@google.com>
> > ---
> >   include/uapi/linux/alloc_tag.h |  8 +++++++-
> >   lib/alloc_tag.c                | 15 +++++++++++++++
> >   2 files changed, 22 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/uapi/linux/alloc_tag.h b/include/uapi/linux/alloc_tag.h
> > index 0cc9db5298c6..229068efd24c 100644
> > --- a/include/uapi/linux/alloc_tag.h
> > +++ b/include/uapi/linux/alloc_tag.h
> > @@ -20,6 +20,8 @@ struct allocinfo_tag {
> >       char function[ALLOCINFO_STR_SIZE];
> >       char filename[ALLOCINFO_STR_SIZE];
> >       __u64 lineno;
> > +     __u64 min_size;
> > +     __u64 max_size;
> >   };
>
> allocinfo_tag is used both as a tag identifier in the output data
>
> (allocinfo_tag_data.tag) and as filter criteria
>
> (allocinfo_filter.fields). min_size and max_size are filter
>
> parameters, not tag identity. Also, allocinfo_to_params() does not
>
> fill these fields, so userspace gets zeros in the output, which is
>
> a bit confusing. Might be cleaner to separate filter parameters
>
> from tag identity.
>
> >   struct allocinfo_counter {
> > @@ -39,13 +41,17 @@ enum {
> >       ALLOCINFO_FILTER_FUNCTION,
> >       ALLOCINFO_FILTER_FILENAME,
> >       ALLOCINFO_FILTER_LINENO,
> > -     __ALLOCINFO_FILTER_LAST = ALLOCINFO_FILTER_LINENO
> > +     ALLOCINFO_FILTER_MIN_SIZE,
> > +     ALLOCINFO_FILTER_MAX_SIZE,
> > +     __ALLOCINFO_FILTER_LAST = ALLOCINFO_FILTER_MAX_SIZE
> >   };
> >
> >   #define ALLOCINFO_FILTER_MASK_MODNAME               (1 << ALLOCINFO_FILTER_MODNAME)
> >   #define ALLOCINFO_FILTER_MASK_FUNCTION              (1 << ALLOCINFO_FILTER_FUNCTION)
> >   #define ALLOCINFO_FILTER_MASK_FILENAME              (1 << ALLOCINFO_FILTER_FILENAME)
> >   #define ALLOCINFO_FILTER_MASK_LINENO                (1 << ALLOCINFO_FILTER_LINENO)
> > +#define ALLOCINFO_FILTER_MASK_MIN_SIZE               (1 << ALLOCINFO_FILTER_MIN_SIZE)
> > +#define ALLOCINFO_FILTER_MASK_MAX_SIZE               (1 << ALLOCINFO_FILTER_MAX_SIZE)
> >
> >   #define ALLOCINFO_FILTER_MASKS \
> >       ((1 << (__ALLOCINFO_FILTER_LAST + 1)) - 1)
> > diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
> > index 7ff936e15e97..98a27c302928 100644
> > --- a/lib/alloc_tag.c
> > +++ b/lib/alloc_tag.c
> > @@ -195,6 +195,9 @@ static int allocinfo_ioctl_get_content_id(struct seq_file *m, void __user *arg)
> >
> >   static bool matches_filter(struct codetag *ct, struct allocinfo_filter *filter)
> >   {
> > +     struct alloc_tag *tag;
> > +     struct alloc_tag_counters counters;
> > +
> >       if (!ct || !filter || !filter->mask)
> >               return true;
> >
> > @@ -214,6 +217,18 @@ static bool matches_filter(struct codetag *ct, struct allocinfo_filter *filter)
> >           ct->lineno != filter->fields.lineno)
> >               return false;
> >
> > +     if ((filter->mask & ALLOCINFO_FILTER_MASK_MIN_SIZE) ||
> > +         (filter->mask & ALLOCINFO_FILTER_MASK_MAX_SIZE)) {
> > +             tag = ct_to_alloc_tag(ct);
> > +             counters = alloc_tag_read(tag);
>
> alloc_tag_read() is called twice for matching tags
>
> When size filtering is enabled, matches_filter() calls alloc_tag_read()
>
> to check the size, and then allocinfo_to_params() calls it again to
>
> fill the output data:
>
> matches_filter():
>
>      counters = alloc_tag_read(tag);        // 1st read
>
>      if (counters.bytes < min_size)
>
>          return false;
>
> allocinfo_to_params():
>
>      counter = alloc_tag_read(tag);         // 2nd read (same tag)
>
>      data->counter.bytes = counter.bytes;
>
> For matching tags, the same per-CPU counter aggregation is done twice.
>
> On large machines this is not trivial. Would it make sense to cache
>
> the counters from matches_filter() and reuse them in allocinfo_to_params()?
>
>
> > +             if ((filter->mask & ALLOCINFO_FILTER_MASK_MIN_SIZE) &&
> > +                 counters.bytes < filter->fields.min_size)
> > +                     return false;
> > +             if ((filter->mask & ALLOCINFO_FILTER_MASK_MAX_SIZE) &&
> > +                 counters.bytes > filter->fields.max_size)
> > +                     return false;
> > +     }
> > +
>
> No validation for min_size > max_size.
>
> If both MIN_SIZE and MAX_SIZE are set but min_size > max_size,
>
> no records will match and the user gets no indication of the
>
> invalid input. This could be checked alongside the existing
>
> mask validation in allocinfo_ioctl_get_at():
>
>      if (params.filter.mask & ~ALLOCINFO_FILTER_MASKS)
>
>          return -EINVAL;
>
>      +   if ((params.filter.mask & ALLOCINFO_FILTER_MASK_MIN_SIZE) &&
>
>      +       (params.filter.mask & ALLOCINFO_FILTER_MASK_MAX_SIZE) &&
>
>      +       params.filter.fields.min_size > params.filter.fields.max_size)
>
>      +            return -EINVAL;
>
> Thanks
>
> Best Regards
>
> Hao
>
Ack, will include in v2.
> >       return true;
> >   }
> >

^ permalink raw reply

* Re: [PATCH 2/6] alloc_tag: add ioctl filters to /proc/allocinfo
From: Abhishek Bapat @ 2026-05-18 23:42 UTC (permalink / raw)
  To: Hao Ge
  Cc: Shuah Khan, Jonathan Corbet, linux-doc, linux-kernel, linux-mm,
	Sourav Panda, Suren Baghdasaryan, Andrew Morton, Kent Overstreet
In-Reply-To: <7e4881af-3fca-474a-abb7-daa75986e3ad@linux.dev>

On Wed, May 13, 2026 at 11:15 PM Hao Ge <hao.ge@linux.dev> wrote:
>
> Hi Abhishek
>
> On 2026/5/5 07:36, Abhishek Bapat wrote:
> > Extend the capability of the IOCTL mechanism to filter allocations based
> > on tag's module name, function name, file name and line number.
> >
> > Signed-off-by: Abhishek Bapat <abhishekbapat@google.com>
> > ---
> >   include/uapi/linux/alloc_tag.h | 26 +++++++++++++++-
> >   lib/alloc_tag.c                | 55 ++++++++++++++++++++++++++++++++--
> >   2 files changed, 77 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/uapi/linux/alloc_tag.h b/include/uapi/linux/alloc_tag.h
> > index e9a5b55fcc7a..0cc9db5298c6 100644
> > --- a/include/uapi/linux/alloc_tag.h
> > +++ b/include/uapi/linux/alloc_tag.h
> > @@ -34,8 +34,32 @@ struct allocinfo_tag_data {
> >       struct allocinfo_counter counter;
> >   };
> >
> > +enum {
> > +     ALLOCINFO_FILTER_MODNAME,
> > +     ALLOCINFO_FILTER_FUNCTION,
> > +     ALLOCINFO_FILTER_FILENAME,
> > +     ALLOCINFO_FILTER_LINENO,
> > +     __ALLOCINFO_FILTER_LAST = ALLOCINFO_FILTER_LINENO
> > +};
> > +
> > +#define ALLOCINFO_FILTER_MASK_MODNAME                (1 << ALLOCINFO_FILTER_MODNAME)
> > +#define ALLOCINFO_FILTER_MASK_FUNCTION               (1 << ALLOCINFO_FILTER_FUNCTION)
> > +#define ALLOCINFO_FILTER_MASK_FILENAME               (1 << ALLOCINFO_FILTER_FILENAME)
> > +#define ALLOCINFO_FILTER_MASK_LINENO         (1 << ALLOCINFO_FILTER_LINENO)
> > +
> > +#define ALLOCINFO_FILTER_MASKS \
> > +     ((1 << (__ALLOCINFO_FILTER_LAST + 1)) - 1)
> > +
> > +struct allocinfo_filter {
> > +     __u64 mask; /* bitmask of the filter fields used */
> > +     struct allocinfo_tag fields;
> > +};
> > +
> >   struct allocinfo_get_at {
> > -     __u64 pos;      /* input */
> > +     /* inputs */
> > +     __u64 pos;
> > +     struct allocinfo_filter filter;
> > +     /* output */
> >       struct allocinfo_tag_data data;
> >   };
> >
> > diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
> > index 5c24d2f954d4..7ff936e15e97 100644
> > --- a/lib/alloc_tag.c
> > +++ b/lib/alloc_tag.c
> > @@ -47,6 +47,7 @@ int alloc_tag_ref_offs;
> >   struct allocinfo_private {
> >       struct codetag_iterator iter;
> >       bool print_header;
> > +     struct allocinfo_filter filter;
> >       /* ioctl uses a separate iterator not to interfere with reads */
> >       struct codetag_iterator ioctl_iter;
> >       bool positioned; /* seq_open_private() sets to 0 */
> > @@ -156,6 +157,11 @@ static void allocinfo_copy_str(char *dest, const char *src)
> >       strscpy(dest, allocinfo_str(src), ALLOCINFO_STR_SIZE);
> >   }
> >
> > +static int allocinfo_cmp_str(const char *str, const char *template)
> > +{
> > +     return strncmp(allocinfo_str(str), template, ALLOCINFO_STR_SIZE);
> > +}
> > +
> >   static void allocinfo_to_params(struct codetag *ct,
> >                               struct allocinfo_tag_data *data)
> >   {
> > @@ -187,26 +193,67 @@ static int allocinfo_ioctl_get_content_id(struct seq_file *m, void __user *arg)
> >       return 0;
> >   }
> >
> > +static bool matches_filter(struct codetag *ct, struct allocinfo_filter *filter)
> > +{
> > +     if (!ct || !filter || !filter->mask)
> > +             return true;
> > +
>
> Minor: in matches_filter(), returning true when ct is NULL seems
>
> semantically odd since both callers already check for ct != NULL
>
> before calling this function. Not a real issue though.
>
> > +     if ((filter->mask & ALLOCINFO_FILTER_MASK_MODNAME) &&
> > +         ct->modname && (allocinfo_cmp_str(ct->modname, filter->fields.modname)))
> > +             return false;
> > +
>
> In matches_filter(), when ct->modname is NULL (built-in kernel code),
>
> the modname filter is skipped due to
>
> ct->modname && (allocinfo_cmp_str(...))
>
> This means built-in allocations always pass the modname filter. Since
>
> built-in code doesn't belong to any module, maybe it should not match
>
> when a modname filter is set:
>
> if (filter->mask & ALLOCINFO_FILTER_MASK_MODNAME) {
>
>      if (!ct->modname)
>
>          return false;
>
> if (allocinfo_cmp_str(ct->modname, filter->fields.modname))
>
>      return false;
>
> }
>
> Thanks
>
> Best Regards
>
> Hao
>
Ack, will include in v2.

> > +     if ((filter->mask & ALLOCINFO_FILTER_MASK_FUNCTION) &&
> > +         ct->function && (allocinfo_cmp_str(ct->function, filter->fields.function)))
> > +             return false;
> > +
> > +     if ((filter->mask & ALLOCINFO_FILTER_MASK_FILENAME) &&
> > +         ct->filename && (allocinfo_cmp_str(ct->filename, filter->fields.filename)))
> > +             return false;
> > +
> > +     if ((filter->mask & ALLOCINFO_FILTER_MASK_LINENO) &&
> > +         ct->lineno != filter->fields.lineno)
> > +             return false;
> > +
> > +     return true;
> > +}
> > +
> >   static int allocinfo_ioctl_get_at(struct seq_file *m, void __user *arg)
> >   {
> >       struct allocinfo_private *priv;
> >       struct codetag *ct;
> > -     __u64 pos;
> >       struct allocinfo_get_at params = {0};
> > +     __u64 skip_count;
> >
> >       if (copy_from_user(&params, arg, sizeof(params)))
> >               return -EFAULT;
> >
> > +     if (params.filter.mask & ~ALLOCINFO_FILTER_MASKS)
> > +             return -EINVAL;
> > +
> >       priv = (struct allocinfo_private *)m->private;
> > -     pos = params.pos;
> > +
> > +     skip_count = params.pos;
> >
> >       codetag_lock_module_list(alloc_tag_cttype, true);
> >
> > +     if (params.filter.mask)
> > +             priv->filter = params.filter;
> > +     else
> > +             priv->filter.mask = 0;
> > +
> >       /* Find the codetag */
> >       priv->ioctl_iter = codetag_get_ct_iter(alloc_tag_cttype);
> >       ct = codetag_next_ct(&priv->ioctl_iter);
> > -     while (ct && pos--)
> > +
> > +     while (ct) {
> > +             if (matches_filter(ct, &priv->filter)) {
> > +                     if (skip_count == 0)
> > +                             break;
> > +                     skip_count--;
> > +             }
> >               ct = codetag_next_ct(&priv->ioctl_iter);
> > +     }
> > +
> >       if (ct) {
> >               allocinfo_to_params(ct, &params.data);
> >               priv->positioned = true;
> > @@ -240,6 +287,8 @@ static int allocinfo_ioctl_get_next(struct seq_file *m, void __user *arg)
> >       }
> >
> >       ct = codetag_next_ct(&priv->ioctl_iter);
> > +     while (ct && !matches_filter(ct, &priv->filter))
> > +             ct = codetag_next_ct(&priv->ioctl_iter);
> >       if (ct)
> >               allocinfo_to_params(ct, &params);
> >

^ permalink raw reply

* [PATCH 28/28] Docs/admin-guide/mm/damon/usage: update for memcg damon filter
From: SeongJae Park @ 2026-05-18 23:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: SeongJae Park, Liam R. Howlett, David Hildenbrand,
	Jonathan Corbet, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
	Shuah Khan, Suren Baghdasaryan, Vlastimil Babka, damon, linux-doc,
	linux-kernel, linux-mm
In-Reply-To: <20260518234119.97569-1-sj@kernel.org>

Update DAMON usage document for the newly added belonging memory cgroup
attribute monitoring feature.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 Documentation/admin-guide/mm/damon/usage.rst | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst
index abd38385b3c23..d46875e603d86 100644
--- a/Documentation/admin-guide/mm/damon/usage.rst
+++ b/Documentation/admin-guide/mm/damon/usage.rst
@@ -74,7 +74,7 @@ comma (",").
     │ │ │ │ │ │ nr_regions/min,max
     │ │ │ │ │ │ :ref:`probes <damon_usage_sysfs_probes>`/nr_probes
     │ │ │ │ │ │ │ 0/filters/nr_filters
-    │ │ │ │ │ │ │ │ 0/type,matching,allow
+    │ │ │ │ │ │ │ │ 0/type,matching,allow,path
     │ │ │ │ │ │ │ │ ...
     │ │ │ │ │ │ │ ...
     │ │ │ │ │ :ref:`targets <sysfs_targets>`/nr_targets
@@ -289,7 +289,9 @@ the data attribute for the probe.
 In the beginning, ``filters`` directory has only one file, ``nr_filters``.
 Writing a number (``N``) to the file creates the number of child directories
 named ``0`` to ``N-1``.  Each directory represents each filter and works in a
-way similar to that for :ref:`DAMOS filter <sysfs_filters>`.
+way similar to that for :ref:`DAMOS filter <sysfs_filters>`.  When the filter
+``type`` is ``memcg``, ``path`` file acts as ``memcg_path`` for :ref:`DAMOS
+filter <sysfs_filters>`.
 
 .. _sysfs_targets:
 
-- 
2.47.3

^ permalink raw reply related

* [PATCH 27/28] Docs/mm/damon/design: update for memcg damon filter
From: SeongJae Park @ 2026-05-18 23:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: SeongJae Park, Liam R. Howlett, David Hildenbrand,
	Jonathan Corbet, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
	Shuah Khan, Suren Baghdasaryan, Vlastimil Babka, damon, linux-doc,
	linux-kernel, linux-mm
In-Reply-To: <20260518234119.97569-1-sj@kernel.org>

Update DAMON design document for the newly added belonging memory cgroup
attribute monitoring feature.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 Documentation/mm/damon/design.rst | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/mm/damon/design.rst b/Documentation/mm/damon/design.rst
index 937960d2b6d73..2da7ca0d3d17a 100644
--- a/Documentation/mm/damon/design.rst
+++ b/Documentation/mm/damon/design.rst
@@ -293,8 +293,8 @@ registration is made by specifying a probe per attribute.  Each of the probe
 specifies a rule to determine if a given memory region has the related
 attribute.  The rule is constructed with multiple filters.  The filters work
 same to :ref:`DAMOS filters <damon_design_damos_filters>` except the supported
-filter types.  Currently only ``anon`` filter type is supported for data
-attributes monitoring.
+filter types.  Currently only ``anon`` and ``memcg`` filter types are supported
+for data attributes monitoring.
 
 If such probes are registered, DAMON executes the probes for each region's
 sampling memory when it does the access :ref:`sampling
-- 
2.47.3

^ permalink raw reply related

* [PATCH 21/28] Docs/admin-guide/mm/damon/usage: document data attributes monitoring
From: SeongJae Park @ 2026-05-18 23:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: SeongJae Park, Liam R. Howlett, David Hildenbrand,
	Jonathan Corbet, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
	Shuah Khan, Suren Baghdasaryan, Vlastimil Babka, damon, linux-doc,
	linux-kernel, linux-mm
In-Reply-To: <20260518234119.97569-1-sj@kernel.org>

Update DAMON usage document for the newly added data attributes
monitoring feature.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 Documentation/admin-guide/mm/damon/usage.rst | 44 ++++++++++++++++++--
 Documentation/mm/damon/design.rst            |  2 +
 2 files changed, 43 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst
index c74dfa0ff3bfb..abd38385b3c23 100644
--- a/Documentation/admin-guide/mm/damon/usage.rst
+++ b/Documentation/admin-guide/mm/damon/usage.rst
@@ -72,6 +72,11 @@ comma (",").
     │ │ │ │ │ │ intervals/sample_us,aggr_us,update_us
     │ │ │ │ │ │ │ intervals_goal/access_bp,aggrs,min_sample_us,max_sample_us
     │ │ │ │ │ │ nr_regions/min,max
+    │ │ │ │ │ │ :ref:`probes <damon_usage_sysfs_probes>`/nr_probes
+    │ │ │ │ │ │ │ 0/filters/nr_filters
+    │ │ │ │ │ │ │ │ 0/type,matching,allow
+    │ │ │ │ │ │ │ │ ...
+    │ │ │ │ │ │ │ ...
     │ │ │ │ │ :ref:`targets <sysfs_targets>`/nr_targets
     │ │ │ │ │ │ :ref:`0 <sysfs_target>`/pid_target,obsolete_target
     │ │ │ │ │ │ │ :ref:`regions <sysfs_regions>`/nr_regions
@@ -98,6 +103,9 @@ comma (",").
     │ │ │ │ │ │ │ :ref:`stats <sysfs_schemes_stats>`/nr_tried,sz_tried,nr_applied,sz_applied,sz_ops_filter_passed,qt_exceeds,nr_snapshots,max_nr_snapshots
     │ │ │ │ │ │ │ :ref:`tried_regions <sysfs_schemes_tried_regions>`/total_bytes
     │ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age,sz_filter_passed
+    │ │ │ │ │ │ │ │ │ probes
+    │ │ │ │ │ │ │ │ │ │ 0/hits
+    │ │ │ │ │ │ │ │ │ │ ...
     │ │ │ │ │ │ │ │ ...
     │ │ │ │ │ │ ...
     │ │ │ │ ...
@@ -227,8 +235,8 @@ contexts/<N>/monitoring_attrs/
 
 Files for specifying attributes of the monitoring including required quality
 and efficiency of the monitoring are in ``monitoring_attrs`` directory.
-Specifically, two directories, ``intervals`` and ``nr_regions`` exist in this
-directory.
+Specifically, three directories, ``intervals``, ``nr_regions`` and ``probes``
+exist in this directory.
 
 Under ``intervals`` directory, three files for DAMON's sampling interval
 (``sample_us``), aggregation interval (``aggr_us``), and update interval
@@ -262,6 +270,27 @@ tuning-applied current values of the two intervals can be read from the
 ``sample_us`` and ``aggr_us`` files after writing ``update_tuned_intervals`` to
 the ``state`` file.
 
+.. _damon_usage_sysfs_probes:
+
+contexts/<N>/monitoring_attrs/probes/
+-------------------------------------
+
+A directory for registering :ref:`data attributes monitoring
+<damon_design_data_attrs_monitoring>` probes.
+
+In the beginning, this directory has only one file, ``nr_probes``.  Writing a
+number (``N``) to the file creates the number of child directories named ``0``
+to ``N-1``.  Each directory represents each monitoring probe.
+
+In each probe directory, one directory, ``filters`` exists.  The directory
+contains files for installing filters for the probe, that is used to determine
+the data attribute for the probe.
+
+In the beginning, ``filters`` directory has only one file, ``nr_filters``.
+Writing a number (``N``) to the file creates the number of child directories
+named ``0`` to ``N-1``.  Each directory represents each filter and works in a
+way similar to that for :ref:`DAMOS filter <sysfs_filters>`.
+
 .. _sysfs_targets:
 
 contexts/<N>/targets/
@@ -615,10 +644,19 @@ tried_regions/<N>/
 ------------------
 
 In each region directory, you will find five files (``start``, ``end``,
-``nr_accesses``, ``age``, and ``sz_filter_passed``).  Reading the files will
+``nr_accesses``, ``age`` and ``sz_filter_passed``).  Reading the files will
 show the properties of the region that corresponding DAMON-based operation
 scheme ``action`` has tried to be applied.
 
+tried_regions/<N>/probes/
+-------------------------
+
+In each region directory, one directory (``probes``) also exists.  In the
+directory, subdirectories named ``0`` to ``N-1`` exists.  ``N`` is the number
+of installed probes.  In each number-named directory, a file (``hits``) exist.
+Reading the file shows the number of data attributes monitoring probe-hit
+positive samples of the region.
+
 Example
 ~~~~~~~
 
diff --git a/Documentation/mm/damon/design.rst b/Documentation/mm/damon/design.rst
index 7fcb726263c1a..937960d2b6d73 100644
--- a/Documentation/mm/damon/design.rst
+++ b/Documentation/mm/damon/design.rst
@@ -276,6 +276,8 @@ interval``, DAMON checks if the region's size and access frequency
 (``nr_accesses``) has significantly changed.  If so, the counter is reset to
 zero.  Otherwise, the counter is increased.
 
+.. _damon_design_data_attrs_monitoring:
+
 Data Attributes Monitoring
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-- 
2.47.3

^ permalink raw reply related

* [PATCH 20/28] Docs/mm/damon/design: document data attributes monitoring
From: SeongJae Park @ 2026-05-18 23:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: SeongJae Park, Liam R. Howlett, David Hildenbrand,
	Jonathan Corbet, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
	Shuah Khan, Suren Baghdasaryan, Vlastimil Babka, damon, linux-doc,
	linux-kernel, linux-mm
In-Reply-To: <20260518234119.97569-1-sj@kernel.org>

Update DAMON design document for newly added data attributes monitoring
feature.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 Documentation/mm/damon/design.rst | 37 +++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/Documentation/mm/damon/design.rst b/Documentation/mm/damon/design.rst
index faea09dc88d63..7fcb726263c1a 100644
--- a/Documentation/mm/damon/design.rst
+++ b/Documentation/mm/damon/design.rst
@@ -276,6 +276,43 @@ interval``, DAMON checks if the region's size and access frequency
 (``nr_accesses``) has significantly changed.  If so, the counter is reset to
 zero.  Otherwise, the counter is increased.
 
+Data Attributes Monitoring
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Data access pattern is only one type of data attributes.  In some use cases,
+users need to know more data attributes information.  For example, users may
+need to know how much of a given hot or cold memory region is backed by
+anonymous pages, or belong to a specific cgroup.  For such use case, data
+attributes monitoring feature is provided.
+
+Using the feature, users can register data attributes of their interest to the
+DAMON :ref:`context <damon_design_execution_model_and_data_structures>`.  The
+registration is made by specifying a probe per attribute.  Each of the probe
+specifies a rule to determine if a given memory region has the related
+attribute.  The rule is constructed with multiple filters.  The filters work
+same to :ref:`DAMOS filters <damon_design_damos_filters>` except the supported
+filter types.  Currently only ``anon`` filter type is supported for data
+attributes monitoring.
+
+If such probes are registered, DAMON executes the probes for each region's
+sampling memory when it does the access :ref:`sampling
+<damon_design_region_based_sampling>`.  The number of samples that identified
+as having the data attribute (hitting the probe) per :ref:`aggregation interval
+<damon_design_monitoring>` is accounted in a per-region per-probe counter.
+Users can therefore know how much of a given DAMON region has a specific data
+attribute by reading the per-region per-probe probe hits counter after each
+aggregation interval.
+
+This is a sampling based mechanism.  Hence, it is lightweight but the output
+may include some measurement errors.  The output should be used with good
+understanding of statistics.
+
+Another way to do this for higher accuracy is using :ref:`DAMOS filter
+<damon_design_damos_filters>` with ``stat`` :ref:`action
+<damon_design_damos_action>` and ``sz_ops_filter_passed`` :ref:`stat
+<damon_design_damos_stat>`.  This approach provides the data attributes
+information in page level.  But, because it is operated in page level, the
+overhead is proportional to the size of the memory.
 
 Dynamic Target Space Updates Handling
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-- 
2.47.3

^ permalink raw reply related

* [PATCH 00/28] mm/damon: introduce data attributes monitoring
From: SeongJae Park @ 2026-05-18 23:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: SeongJae Park, Liam R. Howlett, David Hildenbrand,
	Jonathan Corbet, Lorenzo Stoakes, Masami Hiramatsu,
	Mathieu Desnoyers, Michal Hocko, Mike Rapoport, Shuah Khan,
	Shuah Khan, Steven Rostedt, Suren Baghdasaryan, Vlastimil Babka,
	damon, linux-doc, linux-kernel, linux-kselftest, linux-mm,
	linux-trace-kernel

TL; DR
======

Extend DAMON for monitoring general data attributes other than accesses.
The short term motivation is lightweight page type (e.g., belonging
cgroup) aware monitoring.  In long term, this will help extending DAMON
for multiple access events capture primitives (e.g., page faults and
PMU) and eventually pivotting DAMON to a "Data Attributes Monitoring and
Operations eNgine" in long term.

Background: High Cost of Page Level Properties Monitoring
=========================================================

DAMON is initially introduced as a Data Access MONitor.  It has been
extended for not only access monitoring but also data access-aware
system operations (DAMOS).  But still the monitoring part is only for
data accesses.

Data access patterns is good information, but some users need more
holistic views.  Particularly, users want to show the access pattern
information together with the types of the memory.  For example, users
who work for making huge pages efficiently want to know how much of
DAMON-found hot/cold regions are backed by huge pages.  Users who run
multiple workloads with different cgroups want to know how much of
DAMON-found hot/cold regions belong to specific cgroups.

For the user demand, we developed a DAMOS extension for page level
properties based monitoring [1], which has landed on 6.14.  Using the
feature, users can inform the page level data properties that they are
interested in, in a flexible format that uses DAMOS filters.  Then,
DAMON applies the filters to each folio of the entire DAMON region and
lets users know how many bytes of memory in each DAMON region passed the
given filters.

This gives page level detailed and deterministic information to users.
But, because the operation is done at page level, the overhead is
proportional to the memory size.  It was useful for test or debugging
purposes on a small number of machines.  But it was obviously too heavy
to be enabled always on all machines running the real user workloads.
For real world workloads, it was recommended to use the feature with
user-space controlled sampling approaches.  For example, users could do
the page level monitoring only once per hour, on randomly selected one
percent of machines of their fleet.  If the runtime and the  size of the
fleet is long and big enough, it should provide statistically meaningful
data.

But users are too busy to implement such controls on their own.

Data Attributes Monitoring
==========================

Extend DAMON to monitor not only data accesses, but also general data
attributes.  Do the extension while keeping the main promise of DAMON,
the bounded and best-effort minimum overhead.

Allow users to specify what data attributes in addition to the data
access they want to monitor.  Users can install one 'data probe' per
data attribute of their interest for this purpose.  The 'data probe'
should be able to be applied to any memory, and determine if the given
memory has the appropriate data attribute.  E.g., if memory of physical
address 42 belongs to cgroup A.  Each 'data probe' is configured with
filters that are very similar to the DAMOS filters.

When DAMON checks if each sampling address memory of each region is
accessed since the last check, it applies data probes if registered.
Same to the number of access check-positive samples accounting
(nr_accesses), it accounts the number of each data probe-positive
samples in another per-region counters array, namely 'probe_hits'. When
DAMON resets nr_accesses every aggregation interval, it resets
'probe_hits' together.

Users can read 'probe_hits' just before the values are reset.  In this
way, users can know how many hot/cold memory regions have data
attributes of their interest.  E.g., 30 percent of this system's hot
memory is belonging to cgroup A, and 80 percent of the cgroup
A-belonging hot memory is backed by huge pages.

Patches Sequence
================

First eight patches implement the core feature, interface and the
working support.  Patch 1 introduces data probe data structure, namely
damon_probe.  Patch 2 extends damon_ctx for installing data probes.
Patch 3 introduces another data structure for filters of each data
probe, namely damon_filter.  Patch 4 updates damon_ctx commit function
to handle the probes.  Patch 5 extends damon_region for the per-region
per-probe positive samples counter, namely probe_hits.  Patch 6 extends
damon_operations for applying probes on the underlying DAMON operations
implementation.  Patch 7 updates kdamond_fn() to invoke the probes
applying callback.  Patch 8 finally implements the probes support on
paddr ops.

Ten changes for user interface (patches 9-18) come next.  Patches 9-13
implements sysfs directories and files for setting data probes, namely
probes directory, probe directory, filters directory, filter directory
and filter directory internal files, respectively.  Patch 14 connects
the user inputs that are made via the sysfs files to DAMON core.
Following three patches (patches 15-17) implement sysfs directories and
files for showing the probe_hits to users, namely probes directory,
probe directory and hits files, respectively.  Patch 18 introduces a new
tracepoint for showing the probe_hits via tracefs.

Patch 19 adds a selftest for the sysfs files.

Patches 20 and 21 documents the design and usage of the new feature,
respectively.

Seven additional patches (patches 22-28) for monitoring belonging memory
cgroup follow.  Depending on the feedback, this part might be separated
to another series in future.  Patch 22 defines the DAMON filter type for
the new attribute, namely DAMON_FILTER_TYPE_MEMCG.  Patch 23 add the
support on paddr ops.  Patch 24 updates the sysfs interface for setup of
the target memcg.  Patch 25 move code for easy reuse of the filter
target memcg setup.  Patch 26 connects the user input to the core layer.
Finally, patches 27 and 28 update the design and usage documents for the
memcg attribute monitoring support.

Discussions
===========

This allows the page properties monitoring with overhead that is low
enough to be enabled always on real world workloads.  Because the
sampling time for access check is reused for data attributes check,  the
upper-bounded and best-effort minimum overhead of DAMON is kept.
Because the sampling memory for access check is reused for data
attributes check, additional overhead is minimum.

Still DAMOS-based page level properties monitoring should be useful,
because it provides a deterministic page level information.  When in
doubt of the sampling based information, running DAMOS-based one
together and comparing the results would be useful, for debugging and
tuning.

Plan for Dropping RFC tag
=========================

Making changes for feedback from myself, humans and Sashiko should be
the major remaining work.

I'm currently hoping to drop the RFC tag by 7.2-rc1.

Future Works: Mid Term
========================

This version of implementation is limiting the maximum number of data
probes to four.  I will try to find a way to remove the limit in future.
I personally think it should be enough for common use cases, though, and
therefore not giving high priority at the moment.

Future Works: Long Term
=======================

There are user requests for extending DAMON with detailed access
information, for example, per-CPUs/threads/read/writes monitoring.  For
that, I was working [2] on extending DAMON to use page fault events as
another access check primitives, and making the infrastructure flexible
for future use of yet another access check primitive.  Actually there is
another ongoing work [3] for extending DAMON with PMU events.  The
motivation of the work is reducing the overhead, though.

In my work [2], I was introducing a new interface for access sampling
primitives control.  Now I think this data probe interface can be used
for that, too.  That is, data access becomes just one type of data
attribute.  Also, pg_idle-confirmed access, page fault-confirmed access,
and PMU event-confirmed access will be different types of data
attributes.

The regions adjustment mechanism is currently working based on the
access information.  That's because DAMON is designed for data access
monitoring.  That is, data access information is the primary interest,
and therefore DAMON adjusts regions in a way that can best-present the
information.

Once data access becomes just one of data attributes, there is no reason
to think data access that special.  There might be some users not
interested in access at all but want to know the location of memory of
specific type.  Data probes interface will allow doing that.  Further,
we could extend the interface to let users set any data attribute as the
'primary' attribute.  Then, DAMON will split and merge regions in a way
that can best-present the 'primary' attributes.

DAMOS will also be extended, to specify targets based on not only the
data access pattern, but all user-registered data attributes.  From this
stage, we may be able to call DAMON as a "Data Attributes Monitoring and
Operations eNgine".

[1] https://lore.kernel.org/20250106193401.109161-1-sj@kernel.org
[2] https://lore.kernel.org/20251208062943.68824-1-sj@kernel.org/
[3] https://lore.kernel.org/20260423004211.7037-1-akinobu.mita@gmail.com

Changes from RFC v3
- rfc v3: https://lore.kernel.org/20260516183712.81393-1-sj@kernel.org
- Wordsmithing documentation.
- Drop RFC tag.
- Rebase to mm-new.
Changes from RFC v2.2
- rfc v2.2: https://lore.kernel.org/20260515004433.128933-1-sj@kernel.org
- Rename damon_aggregated_v2 trace event to damon_region_aggregated.
- Address Sashiko issues.
  - Enclose arguments on damon_for_each_{probe,filter}[_safe]() macros.
  - Fix typos in comments and documents.
  - Update probe_hits for region split and merge.
  - Add more documentation for damon_operation->apply_probes() callback.
  - Reduce unnecessary folio_{get,put}() in damon_pa_apply_probes().
  - Define damon_sysfs_probe_attrs as static.
  - Link scheme tried region sysfs dir and increase the count only after
    all internal dir population success.
  - Commit damon_filter->memcg_id for newly added filters.
Changes from RFC v2.1
- rfc v2.1: https://lore.kernel.org/20260514140904.119781-1-sj@kernel.org
- Rebase to mm-stable (7.1-rc3) to avoid Sashiko patch apply failure.
Changes from RFC v2
- rfc v2: https://lore.kernel.org/20260512143645.113201-1-sj@kernel.org
- Optimize nr_probes calculation for probe_hits tracepoint.
- Use TRACE_EVENT_CONDITION() for probe_hits tracepoint.
- Rebase to latest mm-new.
Changes from RFC
- rfc: https://lore.kernel.org/all/20260426205222.93895-1-sj@kernel.org/
- Support memcg DAMON filter.
- Use per-probe probe_hits sysfs file.
- Use dynamic_array for probe_hits tracing.
- Fix filter matching field.
- Fix folio leaking in damon_pa_filter_pass().
- Move nr_regions of damon_aggregated_v2 tracepoint after end.
- Rename DAMON_TEST_TYPE_ANON to DAMON_FILTER_TYPE_ANON.

SeongJae Park (28):
  mm/damon/core: introduce struct damon_probe
  mm/damon/core: embed damon_probe objects in damon_ctx
  mm/damon/core: introduce damon_filter
  mm/damon/core: commit probes
  mm/damon/core: introduce damon_region->probe_hits
  mm/damon/core: introduce damon_ops->apply_probes
  mm/damon/core: do data attributes monitoring
  mm/damon/paddr: support data attributes monitoring
  mm/damon/sysfs: implement probes dir
  mm/damon/sysfs: implement probe dir
  mm/damon/sysfs: implement filters directory
  mm/damon/sysfs: implement filter dir
  mm/damon/sysfs: implement filter dir files
  mm/damon/sysfs: setup probes on DAMON core API parameters
  mm/damon/sysfs-schemes: implement tried_regions/<r>/probes/
  mm/damon/sysfs-schemes: implement probe dir
  mm/damon/sysfs-schemes: implement probe/hits file
  mm/damon: trace probe_hits
  selftests/damon/sysfs.sh: test probes dir
  Docs/mm/damon/design: document data attributes monitoring
  Docs/admin-guide/mm/damon/usage: document data attributes monitoring
  mm/damon/core: introduce DAMON_FILTER_TYPE_MEMCG
  mm/damon/paddr: support DAMON_FILTER_TYPE_MEMCG
  mm/damon/sysfs: add filters/<F>/path file
  mm/damon/sysfs-schemes: move memcg_path_to_id() to sysfs-common
  mm/damon/sysfs: setup damon_filter->memcg_id from path
  Docs/mm/damon/design: update for memcg damon filter
  Docs/admin-guide/mm/damon/usage: update for memcg damon filter

 Documentation/admin-guide/mm/damon/usage.rst |  46 +-
 Documentation/mm/damon/design.rst            |  39 ++
 include/linux/damon.h                        |  69 +++
 include/trace/events/damon.h                 |  38 ++
 mm/damon/core.c                              | 211 +++++++
 mm/damon/paddr.c                             |  76 +++
 mm/damon/sysfs-common.c                      |  41 ++
 mm/damon/sysfs-common.h                      |   2 +
 mm/damon/sysfs-schemes.c                     | 224 ++++++--
 mm/damon/sysfs.c                             | 557 +++++++++++++++++++
 tools/testing/selftests/damon/sysfs.sh       |  48 ++
 11 files changed, 1303 insertions(+), 48 deletions(-)


base-commit: b491d3b062a367a23fdc98def7fe3a8cf21bb3b0
-- 
2.47.3

^ permalink raw reply

* Re: [PATCH 1/6] alloc_tag: add ioctl to /proc/allocinfo
From: Abhishek Bapat @ 2026-05-18 23:41 UTC (permalink / raw)
  To: Hao Ge
  Cc: Suren Baghdasaryan, Shuah Khan, Jonathan Corbet, linux-doc,
	linux-kernel, linux-mm, Sourav Panda, Kent Overstreet,
	Andrew Morton
In-Reply-To: <2f546525-8ff6-4bbe-86ae-6f474f7cefe3@linux.dev>

On Wed, May 13, 2026 at 9:38 PM Hao Ge <hao.ge@linux.dev> wrote:
>
> Hi Suren and Abhishek
>
>
> Thanks for the patch! A couple of minor comments below.
>
>
> On 2026/5/5 07:36, Abhishek Bapat wrote:
> > From: Suren Baghdasaryan <surenb@google.com>
> >
> > Add the following ioctl commands for /proc/allocinfo file:
> >
> > ALLOCINFO_IOC_CONTENT_ID - gets content identifier which can be used
> > to check whether the file content has changed specifically due to module
> > load/unload. Every time a module is loaded / unloaded, the returned
> > value will be different. By comparing the identifier value at the
> > beginning and at the end of the content retrieval operation, users can
> > validate retrieved information for consistency.
> >
> > ALLOCINFO_IOC_GET_AT - gets the record at the specified position. This
> > is the position of a record in /proc/allocinfo.
> >
> > ALLOCINFO_IOC_GET_NEXT - gets the record next to the last retrieved
> > one. If no records were previously retrieved, returns the first
> > record.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > Signed-off-by: Abhishek Bapat <abhishekbapat@google.com>
> > ---
> >   .../userspace-api/ioctl/ioctl-number.rst      |   2 +
> >   include/linux/codetag.h                       |   1 +
> >   include/uapi/linux/alloc_tag.h                |  54 ++++++
> >   lib/alloc_tag.c                               | 178 +++++++++++++++++-
> >   lib/codetag.c                                 |  11 ++
> >   5 files changed, 244 insertions(+), 2 deletions(-)
> >   create mode 100644 include/uapi/linux/alloc_tag.h
> >
> > diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> > index 331223761fff..84f6808a8578 100644
> > --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> > +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> > @@ -349,6 +349,8 @@ Code  Seq#    Include File                                             Comments
> >                                                                          <mailto:luzmaximilian@gmail.com>
> >   0xA5  20-2F  linux/surface_aggregator/dtx.h                            Microsoft Surface DTX driver
> >                                                                          <mailto:luzmaximilian@gmail.com>
> > +0xA6  00-0F  uapi/linux/alloc_tag.h                                    Memory allocation profiling
> > +                                                                       <mailto:surenb@google.com>
> >   0xAA  00-3F  linux/uapi/linux/userfaultfd.h
> >   0xAB  00-1F  linux/nbd.h
> >   0xAC  00-1F  linux/raw.h
> > diff --git a/include/linux/codetag.h b/include/linux/codetag.h
> > index 8ea2a5f7c98a..2bcd4e7c809e 100644
> > --- a/include/linux/codetag.h
> > +++ b/include/linux/codetag.h
> > @@ -76,6 +76,7 @@ struct codetag_iterator {
> >
> >   void codetag_lock_module_list(struct codetag_type *cttype, bool lock);
> >   bool codetag_trylock_module_list(struct codetag_type *cttype);
> > +unsigned long codetag_get_content_id(struct codetag_type *cttype);
> >   struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype);
> >   struct codetag *codetag_next_ct(struct codetag_iterator *iter);
> >
> > diff --git a/include/uapi/linux/alloc_tag.h b/include/uapi/linux/alloc_tag.h
> > new file mode 100644
> > index 000000000000..e9a5b55fcc7a
> > --- /dev/null
> > +++ b/include/uapi/linux/alloc_tag.h
> > @@ -0,0 +1,54 @@
> > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > +/*
> > + *  include/linux/alloc_tag.h
> > + */
> > +
> > +#ifndef _UAPI_ALLOC_TAG_H
> > +#define _UAPI_ALLOC_TAG_H
> > +
> > +#include <linux/types.h>
> > +
> > +#define ALLOCINFO_STR_SIZE   64
> > +
> > +struct allocinfo_content_id {
> > +     __u64 id;
> > +};
> > +
> > +struct allocinfo_tag {
> > +     /* Longer names are trimmed */
> > +     char modname[ALLOCINFO_STR_SIZE];
> > +     char function[ALLOCINFO_STR_SIZE];
> > +     char filename[ALLOCINFO_STR_SIZE];
> > +     __u64 lineno;
> > +};
> > +
> > +struct allocinfo_counter {
> > +     __u64 bytes;
> > +     __u64 calls;
> > +     __u8 accurate;
> > +     __u8 pad[7]; /* Add alignment to not break the 32-bit compatible interface */
> > +};
> > +
> > +struct allocinfo_tag_data {
> > +     struct allocinfo_tag tag;
> > +     struct allocinfo_counter counter;
> > +};
> > +
> > +struct allocinfo_get_at {
> > +     __u64 pos;      /* input */
> > +     struct allocinfo_tag_data data;
> > +};
> > +
> > +#define _ALLOCINFO_IOC_CONTENT_ID    0
> > +#define _ALLOCINFO_IOC_GET_AT                1
> > +#define _ALLOCINFO_IOC_GET_NEXT              2
> > +
> > +#define ALLOCINFO_IOC_BASE           0xA6
> > +#define ALLOCINFO_IOC_CONTENT_ID     _IOR(ALLOCINFO_IOC_BASE, _ALLOCINFO_IOC_CONTENT_ID,     \
> > +                                          struct allocinfo_content_id)
> > +#define ALLOCINFO_IOC_GET_AT         _IOWR(ALLOCINFO_IOC_BASE, _ALLOCINFO_IOC_GET_AT,        \
> > +                                           struct allocinfo_get_at)
> > +#define ALLOCINFO_IOC_GET_NEXT               _IOR(ALLOCINFO_IOC_BASE, _ALLOCINFO_IOC_GET_NEXT,       \
> > +                                          struct allocinfo_tag_data)
> > +
> > +#endif /* _UAPI_ALLOC_TAG_H */
> > diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
> > index ed1bdcf1f8ab..5c24d2f954d4 100644
> > --- a/lib/alloc_tag.c
> > +++ b/lib/alloc_tag.c
> > @@ -14,6 +14,7 @@
> >   #include <linux/string_choices.h>
> >   #include <linux/vmalloc.h>
> >   #include <linux/kmemleak.h>
> > +#include <uapi/linux/alloc_tag.h>
> >
> >   #define ALLOCINFO_FILE_NAME         "allocinfo"
> >   #define MODULE_ALLOC_TAG_VMAP_SIZE  (100000UL * sizeof(struct alloc_tag))
> > @@ -46,6 +47,9 @@ int alloc_tag_ref_offs;
> >   struct allocinfo_private {
> >       struct codetag_iterator iter;
> >       bool print_header;
> > +     /* ioctl uses a separate iterator not to interfere with reads */
> > +     struct codetag_iterator ioctl_iter;
> > +     bool positioned; /* seq_open_private() sets to 0 */
> >   };
> >
> >   static void *allocinfo_start(struct seq_file *m, loff_t *pos)
> > @@ -125,6 +129,177 @@ static const struct seq_operations allocinfo_seq_op = {
> >       .show   = allocinfo_show,
> >   };
> >
> > +static int allocinfo_open(struct inode *inode, struct file *file)
> > +{
> > +     return seq_open_private(file, &allocinfo_seq_op,
> > +                             sizeof(struct allocinfo_private));
> > +}
> > +
> > +static int allocinfo_release(struct inode *inode, struct file *file)
> > +{
> > +     return seq_release_private(inode, file);
> > +}
> > +
> > +static const char *allocinfo_str(const char *str)
> > +{
> > +     size_t len = strlen(str);
> > +
> > +     /* Keep an extra space for the trailing NULL. */
> > +     if (len >= ALLOCINFO_STR_SIZE)
> > +             str += (len - ALLOCINFO_STR_SIZE) + 1;
> > +     return str;
> > +}
> > +
> > +/* Copy a string and trim from the beginning if it's too long */
> > +static void allocinfo_copy_str(char *dest, const char *src)
> > +{
> > +     strscpy(dest, allocinfo_str(src), ALLOCINFO_STR_SIZE);
> > +}
> > +
> > +static void allocinfo_to_params(struct codetag *ct,
> > +                             struct allocinfo_tag_data *data)
> > +{
> > +     struct alloc_tag *tag = ct_to_alloc_tag(ct);
> > +     struct alloc_tag_counters counter = alloc_tag_read(tag);
> > +
> > +     if (ct->modname)
> > +             allocinfo_copy_str(data->tag.modname, ct->modname);
> > +     else
> > +             data->tag.modname[0] = '\0';
>
> Minor nit about allocinfo_to_params():
>
> When modname is NULL (built-in kernel code), the current code sets it
>
> to an empty string:
>
>      if (ct->modname)
>
>          allocinfo_copy_str(data->tag.modname, ct->modname);
>
>      else
>
>          data->tag.modname[0] = '\0';
>
> This is of course workable in userspace by checking for an empty
>
> string, but I was wondering if it would be cleaner to use "vmlinux"
>
> as a default:
>
> else
>
>            allocinfo_copy_str(data->tag.modname, "vmlinux");
>
>
> For some context, in our memory analysis workflow we often group
>
> allocations by module to get a quick overview of where memory goes,
>
> for example:
>
> vmlinux:    2.1 GB    (kernel core)
>
> nvidia:     1.2 GB    (GPU driver)
>
> iwlwifi:    800 MB    (WiFi driver)
>
> ext4:       500 MB    (filesystem)
>
> Having a consistent identifier for kernel built-in allocations would
>
> avoid each userspace tool needing to handle the empty string as a
>
> special case. Totally fine if this is intentional though.
>
Thanks for bringing this up, I can certainly make this change.
However, the information is not currently exposed this way through
/proc/allocinfo. /proc/allocinfo does not categorize kernel non-module
allocations as vmlinux, so there will a delta between how IOCTL and
/proc/allocinfo behave. Suren, could you comment on whether this
recommendation is fine by you?

> > +     allocinfo_copy_str(data->tag.function, ct->function);
> > +     allocinfo_copy_str(data->tag.filename, ct->filename);
> > +     data->tag.lineno = ct->lineno;
> > +     data->counter.bytes = counter.bytes;
> > +     data->counter.calls = counter.calls;
> > +     data->counter.accurate = !alloc_tag_is_inaccurate(tag);
> > +}
> > +
> > +static int allocinfo_ioctl_get_content_id(struct seq_file *m, void __user *arg)
> > +{
> > +     struct allocinfo_content_id params;
> > +
> > +     codetag_lock_module_list(alloc_tag_cttype, true);
> > +     params.id = codetag_get_content_id(alloc_tag_cttype);
> > +     codetag_lock_module_list(alloc_tag_cttype, false);
> > +     if (copy_to_user(arg, &params, sizeof(params)))
> > +             return -EFAULT;
> > +
> > +     return 0;
> > +}
> > +
> > +static int allocinfo_ioctl_get_at(struct seq_file *m, void __user *arg)
> > +{
> > +     struct allocinfo_private *priv;
> > +     struct codetag *ct;
> > +     __u64 pos;
> > +     struct allocinfo_get_at params = {0};
> > +
> > +     if (copy_from_user(&params, arg, sizeof(params)))
> > +             return -EFAULT;
> > +
> > +     priv = (struct allocinfo_private *)m->private;
> > +     pos = params.pos;
> > +
> > +     codetag_lock_module_list(alloc_tag_cttype, true);
> > +
> > +     /* Find the codetag */
> > +     priv->ioctl_iter = codetag_get_ct_iter(alloc_tag_cttype);
> > +     ct = codetag_next_ct(&priv->ioctl_iter);
> > +     while (ct && pos--)
> > +             ct = codetag_next_ct(&priv->ioctl_iter);
>
> I noticed that codetag_next_ct(&priv->ioctl_iter) and
>
> priv->positioned are accessed without serialization in the ioctl
>
> path. Concurrent ioctl calls on the same fd could race on these
>
> fields. Just something I spotted while reading the code.
>
>
> Thanks
>
> Best Regards
>
> Hao
>
I believe this should be prevented by `codetag_lock_module_list`; am I
wrong in my understanding?

> > +     if (ct) {
> > +             allocinfo_to_params(ct, &params.data);
> > +             priv->positioned = true;
> > +     }
> > +
> > +     codetag_lock_module_list(alloc_tag_cttype, false);
> > +
> > +     if (!ct)
> > +             return -ENOENT;
> > +
> > +     if (copy_to_user(arg, &params, sizeof(params)))
> > +             return -EFAULT;
> > +
> > +     return 0;
> > +}
> > +
> > +static int allocinfo_ioctl_get_next(struct seq_file *m, void __user *arg)
> > +{
> > +     struct allocinfo_private *priv;
> > +     struct codetag *ct;
> > +     struct allocinfo_tag_data params = {0};
> > +     int ret = 0;
> > +
> > +     priv = (struct allocinfo_private *)m->private;
> > +
> > +     codetag_lock_module_list(alloc_tag_cttype, true);
> > +
> > +     if (!priv->positioned) {
> > +             priv->ioctl_iter = codetag_get_ct_iter(alloc_tag_cttype);
> > +             priv->positioned = true;
> > +     }
> > +
> > +     ct = codetag_next_ct(&priv->ioctl_iter);
> > +     if (ct)
> > +             allocinfo_to_params(ct, &params);
> > +
> > +     if (!ct) {
> > +             priv->positioned = false;
> > +             ret = -ENOENT;
> > +     }
> > +     codetag_lock_module_list(alloc_tag_cttype, false);
> > +
> > +     if (ret == 0) {
> > +             if (copy_to_user(arg, &params, sizeof(params)))
> > +                     return -EFAULT;
> > +     }
> > +     return ret;
> > +}
> > +
> > +static long allocinfo_ioctl(struct file *file, unsigned int cmd,
> > +                         unsigned long __arg)
> > +{
> > +     void __user *arg = (void __user *)__arg;
> > +     int ret;
> > +
> > +     switch (cmd) {
> > +     case ALLOCINFO_IOC_CONTENT_ID:
> > +             ret = allocinfo_ioctl_get_content_id(file->private_data, arg);
> > +             break;
> > +     case ALLOCINFO_IOC_GET_AT:
> > +             ret = allocinfo_ioctl_get_at(file->private_data, arg);
> > +             break;
> > +     case ALLOCINFO_IOC_GET_NEXT:
> > +             ret = allocinfo_ioctl_get_next(file->private_data, arg);
> > +             break;
> > +     default:
> > +             ret = -ENOIOCTLCMD;
> > +             break;
> > +     }
> > +
> > +     return ret;
> > +}
> > +
> > +#ifdef CONFIG_COMPAT
> > +static long allocinfo_compat_ioctl(struct file *file, unsigned int cmd,
> > +                                unsigned long arg)
> > +{
> > +     return allocinfo_ioctl(file, cmd, (unsigned long)compat_ptr(arg));
> > +}
> > +#endif
> > +
> > +static const struct proc_ops allocinfo_proc_ops = {
> > +     .proc_open              = allocinfo_open,
> > +     .proc_read_iter         = seq_read_iter,
> > +     .proc_lseek             = seq_lseek,
> > +     .proc_release           = allocinfo_release,
> > +     .proc_ioctl             = allocinfo_ioctl,
> > +#ifdef CONFIG_COMPAT
> > +     .proc_compat_ioctl      = allocinfo_compat_ioctl,
> > +#endif
> > +
> > +};
> > +
> >   size_t alloc_tag_top_users(struct codetag_bytes *tags, size_t count, bool can_sleep)
> >   {
> >       struct codetag_iterator iter;
> > @@ -946,8 +1121,7 @@ static int __init alloc_tag_init(void)
> >               return 0;
> >       }
> >
> > -     if (!proc_create_seq_private(ALLOCINFO_FILE_NAME, 0400, NULL, &allocinfo_seq_op,
> > -                                  sizeof(struct allocinfo_private), NULL)) {
> > +     if (!proc_create(ALLOCINFO_FILE_NAME, 0400, NULL, &allocinfo_proc_ops)) {
> >               pr_err("Failed to create %s file\n", ALLOCINFO_FILE_NAME);
> >               shutdown_mem_profiling(false);
> >               return -ENOMEM;
> > diff --git a/lib/codetag.c b/lib/codetag.c
> > index 304667897ad4..93aa30991563 100644
> > --- a/lib/codetag.c
> > +++ b/lib/codetag.c
> > @@ -48,6 +48,17 @@ bool codetag_trylock_module_list(struct codetag_type *cttype)
> >       return down_read_trylock(&cttype->mod_lock) != 0;
> >   }
> >
> > +unsigned long codetag_get_content_id(struct codetag_type *cttype)
> > +{
> > +     lockdep_assert_held(&cttype->mod_lock);
> > +
> > +     /*
> > +      * next_mod_seq is updated on every load, so can be used to identify
> > +      * content changes.
> > +      */
> > +     return cttype->next_mod_seq;
> > +}
> > +
> >   struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype)
> >   {
> >       struct codetag_iterator iter = {

Note, I will be following up with a v2 patchset with your feedback
included. Please bring up any other points you'd want to clarify so
that I can include all the changes in the v2 patchset. Thanks for
reviewing!

^ permalink raw reply

* Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: T.J. Mercier @ 2026-05-18 23:39 UTC (permalink / raw)
  To: Barry Song
  Cc: Albert Esteve, Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Shuah Khan, Sumit Semwal, Christian König,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Benjamin Gaignard, Brian Starkey, John Stultz,
	Christian Brauner, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc,
	linux-kernel, linux-media, dri-devel, linaro-mm-sig, linux-mm,
	linux-security-module, selinux, linux-kselftest, mripard,
	echanude
In-Reply-To: <CAGsJ_4xwJ7SAhKPJyRtMTw6psTO7H1EcFFpDw0po1W8PX4FE8g@mail.gmail.com>

On Mon, May 18, 2026 at 3:43 PM Barry Song <baohua@kernel.org> wrote:
>
> On Mon, May 18, 2026 at 8:16 PM Albert Esteve <aesteve@redhat.com> wrote:
> >
> > On Sat, May 16, 2026 at 9:37 AM Barry Song <baohua@kernel.org> wrote:
> > >
> > > On Tue, May 12, 2026 at 5:18 PM Albert Esteve <aesteve@redhat.com> wrote:
> > > >
> > > > On embedded platforms a central process often allocates dma-buf
> > > > memory on behalf of client applications. Without a way to
> > > > attribute the charge to the requesting client's cgroup, the
> > > > cost lands on the allocator, making per-cgroup memory limits
> > > > ineffective for the actual consumers.
> > > >
> > > > Add charge_pid_fd to struct dma_heap_allocation_data. When set to
> > > > a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's
> > > > memcg and charges the buffer there via mem_cgroup_charge_dmabuf()
> > > > inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with
> > > > the mem_accounting module parameter enabled, the buffer is charged
> > > > to the allocator's own cgroup.
> > > >
> > > > Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for
> > > > system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap
> > > > page allocations. Keeping __GFP_ACCOUNT would charge the same pages
> > > > twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route
> > > > all accounting through a single MEMCG_DMABUF path.
> > > >
> > > [...]
> > >
> > > > -               if (mem_accounting)
> > > > -                       flags |= __GFP_ACCOUNT;
> > >
> > > Hi Albert,
> > >
> > > would it be better to move this and its description to patch 1? It
> > > looks like patch 1 already introduces the double accounting changes,
> > > and patch 2 is mainly just supporting remote charging.
> >
> > Hi Barry,
> >
> > Thanks for looking into this series! Yes, in my head I was trying to
> > keep patch 1, which was taken from a previous, different series, and
> > then diverge from it starting with patch 2. This would clarify the
> > difference between the two. But I can see it just added some confusion
> > (for example, patch 1 charges on dma_buf_export() and then it is moved
> > to dma_heap_buffer_alloc() in patch 2). I will reorganize it better
> > for the next version, including your suggestion.
>
> Yep, I understand the situation now. I also understand
> that you were referring to T.J.'s patch, which caused
> some back-and-forth confusion for readers when reading
> patches 1 and 2.

Albert, please don't feel obligated to keep my patch intact if
integrating it into other patches simplifies the series.

> > > Also, mem_accounting is only used by system_heap.c; has this patchset
> > > also eliminated its need?
> >
> > No, mem_accounting is still handled in this patch for the general case
> > where no `charge_pid_fd` is used. See dma_heap_buffer_alloc() code:
> >
> > +       if (memcg)
> > +               css_get(&memcg->css);
> > +       else if (mem_accounting)
> > +               memcg = get_mem_cgroup_from_mm(current->mm);
>
> I see. What feels a bit odd to me is that mem_accounting
> could either be dropped (with unconditional charging), or
> it should cover both remote and local charge cases.
>
> I don’t have a strong opinion here—it just feels a bit
> strange, since its description is quite generic for memcg:
>
> "Enable cgroup-based memory accounting for dma-buf heap
> allocations (default=false)."
>
> Best Regards
> Barry

^ permalink raw reply

* Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: T.J. Mercier @ 2026-05-18 23:39 UTC (permalink / raw)
  To: Barry Song
  Cc: Christian König, Albert Esteve, Tejun Heo, Johannes Weiner,
	Michal Koutný, Jonathan Corbet, Shuah Khan, Sumit Semwal,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Benjamin Gaignard, Brian Starkey, John Stultz,
	Christian Brauner, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc,
	linux-kernel, linux-media, dri-devel, linaro-mm-sig, linux-mm,
	linux-security-module, selinux, linux-kselftest, mripard,
	echanude
In-Reply-To: <CAGsJ_4y=Gsv=FSUjJ5+99Gg6ULUnv0LRexCGOGetzChR3YA44Q@mail.gmail.com>

On Mon, May 18, 2026 at 3:19 PM Barry Song <baohua@kernel.org> wrote:
>
> On Tue, May 19, 2026 at 5:17 AM T.J. Mercier <tjmercier@google.com> wrote:
> [...]
> > > > > Yeah I think this might work. I know of 3 cases, and it trivially
> > > > > solves the first two. The third requires some work on our end to
> > > > > extend our userspace interfaces to include the pidfd but it seems
> > > > > doable. I'm checking with our graphics folks.
> > > > >
> > > > > 1) Direct allocation from user (e.g. app -> allocation ioctl on
> > > > > /dev/dma_heap/foo)
> > > > > No changes required to userspace. mem_accounting=1 charges the app.
> > > > >
> > > > > 2) Single hop remote allocation (e.g. app -> AHardwareBuffer_allocate
> > > > > -> gralloc)
> > > > > gralloc has the caller's pid as described in the commit message. Open
> > > > > a pidfd and pass it in the dma_heap_allocation_data.
> > > > >
> > > > > 3) Double hop remote allocation (e.g. app -> dequeueBuffer ->
> > > > > SurfaceFlinger -> gralloc)
> > > > > In this case gralloc knows SurfaceFlinger's pid, but not the app's. So
> > > > > we need to add the app's pidfd to the SurfaceFlinger -> gralloc
> > > > > interface, or transfer the memcg charge from SurfaceFlinger to the app
> > > > > after the allocation.
> > > > > It'd be nice to avoid the charge transfer option entirely, but if we
> > > > > need it that doesn't seem so bad in this case because it's a bulk
> > > > > charge for the entire dmabuf rather than per-page. So the exporter
> > > > > doesn't need to get involved (we wouldn't need a new dma_buf_op) and
> > > > > we wouldn't have to worry about looping and locking for each page.
> > > > >
> > > >
> > > > Hi T.J.,
> > > >
> > > > Your description of the three different cases sounds very interesting.
> > > > It helps me understand how difficult it can be to correctly charge
> > > > dma-buf in the current user scenarios.
> > > >
> > > > I’m wondering where I can find Android userspace code that transfers
> > > > the PID of RPC callers. Do we have any existing sample code in Android
> > > > for this?
> > >
> > > Hi Barry,
> > >
> > > In Java android.os.Binder.getCallingPid() will provide it. Here
> >
> > ... let me try again
> >
> > Here are some examples from the framework code:
> >
> > https://cs.android.com/search?q=getCallingPid%20f:ActivityManager&sq=&ss=android%2Fplatform%2Fsuperproject
> >
> > In native code we have AIBinder_getCallingPid and
> > android::IPCThreadState::self()->getCallingPid() (or
> > android::hardware::IPCThreadState::self()->getCallingPid() for HIDL)
> >
> > https://cs.android.com/search?q=getCallingPid%20l:cpp%20-f:prebuilt&ss=android%2Fplatform%2Fsuperproject
>
> Thanks very much, T.J. That is very helpful. I guess
> that would require user space to understand the RPC
> procedure, including single-hop and two-hop cases, and
> make the corresponding changes.

Yes, this is solvable by having a policy in allocator services where
the caller is implicitly charged, while also supporting cases where
the RPC includes additional explicit information about who to charge.
This needs security checks to prevent arbitrary remote charges at both
the ioctl() level (selinux charge_to from patch 4), and at the RPC
level (not sure yet but maybe a private interface between system
components and gralloc), so that only privileged components can
initiate remote charges.

> You pointed out the SurfaceFlinger cases, which are
> two hops. It seems that AI models are also using
> dma_heap, at least from what I have observed on MTK
> and Qualcomm phones. Likely, we need to understand
> those RPC relationships in userspace and make the
> corresponding changes.
> I assume AI models are a single-hop case?

It's currently a mix because AI model loading is largely controlled by
vendor code right now. Some implementations use
AHardwareBuffer_allocate, but that comes with unnecessary RPC overhead
for the AI use case. So I think we should be trending towards direct
allocations from dma-buf heaps because model loading time is
important.

^ permalink raw reply

* Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: T.J. Mercier @ 2026-05-18 23:39 UTC (permalink / raw)
  To: Christian König
  Cc: Albert Esteve, Christian Brauner, Tejun Heo, Johannes Weiner,
	Michal Koutný, Jonathan Corbet, Shuah Khan, Sumit Semwal,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Benjamin Gaignard, Brian Starkey, John Stultz,
	Paul Moore, James Morris, Serge E. Hallyn, Stephen Smalley,
	Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc, linux-kernel,
	linux-media, dri-devel, linaro-mm-sig, linux-mm,
	linux-security-module, selinux, linux-kselftest, mripard,
	echanude
In-Reply-To: <88efe10a-8b93-4a81-8279-4a5559d0f17c@amd.com>

On Mon, May 18, 2026 at 7:07 AM Christian König
<christian.koenig@amd.com> wrote:
>
> On 5/18/26 14:50, Albert Esteve wrote:
> > On Mon, May 18, 2026 at 9:20 AM Christian König
> > <christian.koenig@amd.com> wrote:
> >>
> >> On 5/15/26 19:06, T.J. Mercier wrote:
> >>> On Fri, May 15, 2026 at 6:53 AM Christian Brauner <brauner@kernel.org> wrote:
> >>>>
> >>>> On Tue, May 12, 2026 at 11:10:44AM +0200, Albert Esteve wrote:
> >>>>> On embedded platforms a central process often allocates dma-buf
> >>>>> memory on behalf of client applications. Without a way to
> >>>>> attribute the charge to the requesting client's cgroup, the
> >>>>> cost lands on the allocator, making per-cgroup memory limits
> >>>>> ineffective for the actual consumers.
> >>>>>
> >>>>> Add charge_pid_fd to struct dma_heap_allocation_data. When set to
> >>>>
> >>>> Please be aware that pidfds come in two flavors:
> >>>>
> >>>> thread-group pidfds and thread-specific pidfds. Make sure that your API
> >>>> doesn't implicitly depend on this distinction not existing.
> >>>
> >>> Hi Christian,
> >>>
> >>> Memcg is not a controller that supports "thread mode" so all threads
> >>> in a group should belong to the same memcg.
> >>
> >> BTW: Exactly that is the requirement automotive has with their native context use case.
> >>
> >> The use case is that you have a deamon which has multiple threads were each one is acting on behalve of some other process.
> >>
> >> At the moment we basically say they are simply not using cgroups for that use case, but it would be really nice if we could handle that as well.
> >>
> >> Summarizing the requirement of that use case: You need a different cgroup for each thread of a process.
> >
> > Hi Christian,
> >
> > Thanks for sharing this atuomotive usecase. If I understand correctly,
> > the actual requirement is attributing dma-buf charges to the right
> > client, not putting each daemon thread in a different cgroup?
>
> Nope, exactly that's the difference.
>
> The thread acts as a filtering agent for both memory allocation and command submission for somebody else, the process on which behalve the daemon does things can even be in a client VM, completely remote over some network or even something like a microcontroller.
>
> Everything the thread does regarding CPU time, GPU driver memory allocation as well as resources like GPU processing and I/O time etc.. needs to be accounted to one client which can be different for each thread of the process.
>
> The only thing which is shared with the main process thread is CPU memory resources, e.g. malloc() because that is basically just needed for housekeeping and pretty much irrelevant for this kind of use case.
>
> The problem is now you can't do that with cgroups at the moment but unfortunately only the kernel has the information you need to know to do this.
>
> So what you end up with is to define tons of interfaces just to get the necessary information from the kernel into userspace and then essentially duplicate the same infrastructure cgroup provides in the kernel in userspace again.
>
> > If so,
> > the `charge_pid_fd` approach achieves this directly by passing the
> > client's `pid_fd`, without needing to add per-thread cgroup
> > infrastructure.
>
> Well it's already a massive improvemt, we could basically stop doing the whole duplication part for the GPU driver stack and just use cgroups for this part.
>
> Doing that automatically for CPU and I/O time would just be nice to have additionally.
>
> Regards,
> Christian.

Hopefully I'm following correctly here.... So you are duplicating the
GPU driver stack to achieve remote accounting on a per-thread basis?
Does this mean for GPU allocations you currently have some GFP_ACCOUNT
magic in your driver to attribute GPU memory to the correct remote
client? So this series would close the gap for dma-buf allocations,
but what about private GPU driver memory allocated on behalf of a
client?

^ permalink raw reply

* Re: [PATCH v6 1/2] usb: xhci-pci: add AMD Promontory 21 PCI glue
From: Guenter Roeck @ 2026-05-18 23:25 UTC (permalink / raw)
  To: Jihong Min, Michal Pecio
  Cc: Greg Kroah-Hartman, Mathias Nyman, Jonathan Corbet, Shuah Khan,
	Mario Limonciello, Basavaraj Natikar, linux-usb, linux-hwmon,
	linux-doc, linux-pci, linux-kernel, Mario Limonciello (AMD),
	Yaroslav Isakov
In-Reply-To: <7ff352be-05d2-4c21-931e-18238172e4d7@gmail.com>

On 5/18/26 16:06, Jihong Min wrote:
> 
> 
> On 5/19/26 06:37, Michal Pecio wrote:
>> That's true.
>> Making this possible is the whole purpose of "if IS_ENABLED" here:
> 
> I re-checked the Kconfig cases, and I think you are right here.
> 
> The two cases I was trying to avoid are:
> 
>    1. the sensor driver is built as a module, or loads only after the
>       initramfs stage, but the PROM21 controller has already been bound by
>       the generic xhci-pci driver, so no auxiliary device exists for the
>       sensor driver to bind to;
> 
>    2. the built-in generic xhci-pci driver rejects the PROM21 controller, but
>       xhci-pci-prom21 is only available as a module and is not present during
>       initramfs, leaving USB behind that controller unavailable at that
> stage.
> 
> Looking at your proposed Kconfig shape again, it handles both cases.
> 
> If SENSORS_PROM21_XHCI=n, then no sensor support is requested and
> USB_XHCI_PCI_PROM21 can stay disabled. In that case generic xhci-pci binds
> the controller, which is fine because there is no sensor driver that
> needs an
> auxiliary device.
> 
> If SENSORS_PROM21_XHCI=m or y and USB_XHCI_PCI=y, then
> USB_XHCI_PCI_PROM21 follows USB_XHCI_PCI and becomes y. That means the
> PROM21
> glue is available during early boot, creates the auxiliary device, and the
> hwmon driver can still bind later if it is built as a module.
> 
> If USB_XHCI_PCI=m, then xhci-pci itself is modular. In that case needing the
> PROM21 glue module in initramfs is not a PROM21-specific built-in/module
> split
> problem; it is the normal requirement for a modular xHCI PCI setup.
> 
> So I agree that tying the hidden glue option to whether
> SENSORS_PROM21_XHCI is enabled is reasonable.
> 
>> Currently, you have a weird situation where xhci-pci-prom21 always
>> binds on x86 and xhci-pci on other platforms (with the unofficial PCIe
>> card you mentioned), plus the sensor cannot work on other platforms.
> 
> Agreed. I also agree that the X86 dependency is only a heuristic and is
> not a
> good restriction for a PCI ID based driver. PROM21 is mainly used on AMD x86
> desktop platforms today, but the unofficial PCIe card example shows that the
> device can exist outside the normal AMD x86 chipset topology.
> 
> I do not know whether other PROM21-related functionality is supported on
> non-x86 platforms, but this driver does not need to prevent the xHCI
> temperature sensor path from being built there.
> 
>> One could further argue that neither should it care whether some hwmon
>> driver exists at all, or which kernel releases it exists in :)
> 
> Right. I think the cleanest result is:
> 
>    - generic xhci-pci handles PROM21 when no sensor support is requested;
>    - xhci-pci-prom21 handles PROM21 only when the sensor path is enabled;
>    - the hwmon driver remains the user-visible option.
> 
> Unless Guenter or the USB maintainers object, I plan to change the next
> revision in that direction and test the Kconfig combinations locally.
> 

Ok with me if you are sure that it works. When you send a new revision,
you might want to also change the error path as suggested.

Thanks,
Guenter


^ permalink raw reply

* Re: [PATCH] killswitch: add per-function short-circuit mitigation primitive
From: Song Liu @ 2026-05-18 23:22 UTC (permalink / raw)
  To: Paul Moore
  Cc: Sasha Levin, corbet, akpm, skhan, linux-doc, linux-kernel,
	linux-kselftest, gregkh, linux-security-module
In-Reply-To: <CAHC9VhS1DJNs9gDB6gD9WKhL08giSVajBskZ+=mY0AWRCAsw7Q@mail.gmail.com>

On Mon, May 18, 2026 at 2:29 PM Paul Moore <paul@paul-moore.com> wrote:
[...]
> In my opinion, making killswitch an LSM is more of a procedural item
> that deals with how we view a capability like killswitch.  I
> personally view killswitch as somewhat similar to Lockdown, which is
> why I made the suggestion.
>
> The use of kprobes, while an interesting idea, presents problems as
> allowing any kernel symbol to be killed introduces the potential for
> security regressions.  As a reminder, some LSMs, as well as other
> kernel subsystems, have mechanisms in place to restrict root and/or
> enforce one-way configuration locks; while many people equate "root"
> with full control, in many cases today that is not strictly correct.
>
> Yes, kprobes have been around for some time, this is not a new
> problem, but killswitch makes it far more convenient and accessible to
> do dangerous things with kprobes.  If killswitch makes it past the RFC
> stage without any significant changes to its kill mechanism, we may
> need to start considering more liberal usage of NOKPROBE_SYMBOL()
> which I think would be an unfortunate casualty.

I don't think we can use NOKPROBE_SYMBOL(). There are functions
that we don't want to killswitch, but still want to trace.

Thanks,
Song

^ permalink raw reply

* Re: [PATCH v6 1/2] usb: xhci-pci: add AMD Promontory 21 PCI glue
From: Jihong Min @ 2026-05-18 23:07 UTC (permalink / raw)
  To: Michal Pecio
  Cc: Greg Kroah-Hartman, Mathias Nyman, Guenter Roeck, Jonathan Corbet,
	Shuah Khan, Mario Limonciello, Basavaraj Natikar, linux-usb,
	linux-hwmon, linux-doc, linux-pci, linux-kernel,
	Mario Limonciello (AMD), Yaroslav Isakov
In-Reply-To: <144ec61c-4cc1-4986-a16c-7c1b99f3a72e@gmail.com>



On 5/19/26 05:30, Jihong Min wrote:
> It seems that these three functions above are everything that you truly
> want to add; the rest is boilerplate required by this two-module scheme
> to work, plus ID tables which must be duplicated and kept in sync.
>
> I wonder if a separate module is really justified, as opposed to simply
> linking this file into xhci_pci.ko when directed by Kconfig.
>
> The downside would be slightly higher memory usage on systems where the
> hwmon driver is enabled but not needed. OTOH, same systems would likely
> see reduced disk waste.

One clarification about this part:

In my previous reply I said that I could rework this either way depending on
the USB maintainer preference. After thinking about it again, I think the
current direction is the better one.

Mathias's earlier review pushed this series away from adding PROM21-specific
hwmon support directly into the common xhci-pci path. I agree with that
direction. The common xhci-pci driver should not grow PROM21-specific sensor
logic.

The current split keeps the PROM21-specific auxiliary-device lifetime
handling
in xhci-pci-prom21.c, keeps the hwmon implementation in drivers/hwmon, and
leaves xhci-pci.c with only the PCI ID handoff. That is also the closest
match
to the existing Renesas handoff approach.

So, while I previously phrased this as something I would leave entirely
to the
USB maintainers, my current preference is to keep the separate PROM21
PCI glue
driver unless Mathias or another USB maintainer specifically asks for a
different structure.


Sincerely,
Jihong Min

^ permalink raw reply

* Re: [PATCH v6 1/2] usb: xhci-pci: add AMD Promontory 21 PCI glue
From: Jihong Min @ 2026-05-18 23:06 UTC (permalink / raw)
  To: Michal Pecio
  Cc: Greg Kroah-Hartman, Mathias Nyman, Guenter Roeck, Jonathan Corbet,
	Shuah Khan, Mario Limonciello, Basavaraj Natikar, linux-usb,
	linux-hwmon, linux-doc, linux-pci, linux-kernel,
	Mario Limonciello (AMD), Yaroslav Isakov
In-Reply-To: <20260518233711.4c99cc72.michal.pecio@gmail.com>



On 5/19/26 06:37, Michal Pecio wrote:
> That's true.
> Making this possible is the whole purpose of "if IS_ENABLED" here:

I re-checked the Kconfig cases, and I think you are right here.

The two cases I was trying to avoid are:

  1. the sensor driver is built as a module, or loads only after the
     initramfs stage, but the PROM21 controller has already been bound by
     the generic xhci-pci driver, so no auxiliary device exists for the
     sensor driver to bind to;

  2. the built-in generic xhci-pci driver rejects the PROM21 controller, but
     xhci-pci-prom21 is only available as a module and is not present during
     initramfs, leaving USB behind that controller unavailable at that
stage.

Looking at your proposed Kconfig shape again, it handles both cases.

If SENSORS_PROM21_XHCI=n, then no sensor support is requested and
USB_XHCI_PCI_PROM21 can stay disabled. In that case generic xhci-pci binds
the controller, which is fine because there is no sensor driver that
needs an
auxiliary device.

If SENSORS_PROM21_XHCI=m or y and USB_XHCI_PCI=y, then
USB_XHCI_PCI_PROM21 follows USB_XHCI_PCI and becomes y. That means the
PROM21
glue is available during early boot, creates the auxiliary device, and the
hwmon driver can still bind later if it is built as a module.

If USB_XHCI_PCI=m, then xhci-pci itself is modular. In that case needing the
PROM21 glue module in initramfs is not a PROM21-specific built-in/module
split
problem; it is the normal requirement for a modular xHCI PCI setup.

So I agree that tying the hidden glue option to whether
SENSORS_PROM21_XHCI is enabled is reasonable.

> Currently, you have a weird situation where xhci-pci-prom21 always
> binds on x86 and xhci-pci on other platforms (with the unofficial PCIe
> card you mentioned), plus the sensor cannot work on other platforms.

Agreed. I also agree that the X86 dependency is only a heuristic and is
not a
good restriction for a PCI ID based driver. PROM21 is mainly used on AMD x86
desktop platforms today, but the unofficial PCIe card example shows that the
device can exist outside the normal AMD x86 chipset topology.

I do not know whether other PROM21-related functionality is supported on
non-x86 platforms, but this driver does not need to prevent the xHCI
temperature sensor path from being built there.

> One could further argue that neither should it care whether some hwmon
> driver exists at all, or which kernel releases it exists in :)

Right. I think the cleanest result is:

  - generic xhci-pci handles PROM21 when no sensor support is requested;
  - xhci-pci-prom21 handles PROM21 only when the sensor path is enabled;
  - the hwmon driver remains the user-visible option.

Unless Guenter or the USB maintainers object, I plan to change the next
revision in that direction and test the Kconfig combinations locally.


Sincerely,
Jihong Min

^ permalink raw reply

* Re: [PATCH] nios2: remove the architecture
From: Simon Schuster @ 2026-05-18 17:24 UTC (permalink / raw)
  To: Peter Zijlstra, Arnd Bergmann, Ethan Nelson-Moore, Dinh Nguyen
  Cc: linux-doc, devicetree, workflows, Linux-Arch, dmaengine,
	linux-i2c, linux-iio, Netdev, linux-pci, linux-pwm,
	linux-hardening, linux-kbuild, linux-csky@vger.kernel.org,
	Jonathan Corbet, Shuah Khan, Rob Herring, Krzysztof Kozlowski,
	Conor Dooley, Daniel Lezcano, Thomas Gleixner, Alex Shi,
	Yanteng Si, Dongliang Mu, Hu Haowen, Kees Cook, Oleg Nesterov,
	Will Deacon, Aneesh Kumar K.V (Arm), Andrew Morton,
	Nicholas Piggin, Vinod Koul, Frank Li, Dave Penkler, Andi Shyti,
	Jonathan Cameron, David Lechner, Nuno Sá, Andy Shevchenko,
	Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Lorenzo Pieralisi, Krzysztof WilczyDski,
	Andreas Oetken
In-Reply-To: <20260518105735.GW3126523@noisy.programming.kicks-ass.net>

Hi Ethan, Arnd, Peter and Dinh,

On Mon, May 18, 2026 at 11:29:48AM +0200, Arnd Bergmann wrote:
> We last discussed this a year ago when Simon Schuster mentioned[1]
> that Siemens Energy is still using NIOS-2 in production and would
> prefer to have this still included in Linux for at least another
> few years until the obligation for kernel updates ends.

First off, thank you, Arnd, for remembering us as this patch series came
up and also to Dinh for his maintenance of the architecture!

Regarding our status in relation to nios2, Arnd's response already gives
you the gist:

We are well aware that the architecture was deprecated by Intel and are
therefore phasing it out in favour of more contemporary hardware.
I'm also fully aware of the uncertain future of 32-bit architectures as
a whole [0] and that this fate will come to nios2 sooner or later.
But as of now, the mainline support is still in very good shape.

On Mon, May 18, 2026 at 12:57:35PM +0200, Peter Zijlstra wrote:
> Isn't that what we have LTS branches for?

Unfortunately, as we are an infrastructure provider for civil energy
infrastructure, the refurbishment cycle is a bit slower than for
traditional consumer systems. This implies that the traditional LTS
support duration (max. Dec 2028 as of writing [1]) is rather short, and
we would be glad if we could keep the architecture in mainline for at
least 5 years and only then "decay" to LTS.

On Mon, May 18, 2026 at 11:29:48AM +0200, Arnd Bergmann wrote:
> My feeling is that the maintenance burden of keeping nios2 is
> relatively low. On the other hand, maintaining it out of tree
> as a patch set is also something that should not be all that
> hard if it does get removed.

Judging from the architecture's git history, it seems that it's
currently mainly touched by treewide refactors, which are extremely
helpful as we therefore do not have to piece these changes together 
downstream. In other respects, we try to be good citizens and contribute
bugfixes as well as required cleanups (such as implementing clone3 [2]
and fixing its flag behaviour on 32-bit architectures) as they come up.

If desired, we also would be happy to intensify our support regarding
reviews or testing to share the maintnance burden if it helps to keep
nios2 in mainline a bit longer.

Best regards,
Simon
 
0: https://lwn.net/Articles/1035727/
1: https://www.kernel.org/category/releases.html
2: https://lore.kernel.org/lkml/20250821-nios2-implement-clone3-v1-0-1bb24017376a@siemens-energy.com/

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox