* [PATCH 4/5] hwmon: add Stratix 10 SoC FPGA hardware monitor driver
From: tze.yee.ng @ 2026-06-19 9:38 UTC (permalink / raw)
To: Guenter Roeck, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
linux-hwmon, devicetree, linux-kernel, Dinh Nguyen, Mahesh Rao,
Jonathan Corbet, Shuah Khan, linux-doc
In-Reply-To: <cover.1781861409.git.tze.yee.ng@altera.com>
From: Tze Yee Ng <tze.yee.ng@altera.com>
Add a hardware monitoring driver for Altera Stratix 10 SoC FPGA devices
that reads temperature and voltage sensors through the Stratix 10 service
layer. Use the asynchronous service layer interface when available, with
a synchronous fallback.
Signed-off-by: Nazim Amirul <muhammad.nazim.amirul.nazle.asmade@altera.com>
Signed-off-by: Tze Yee Ng <tze.yee.ng@altera.com>
---
Documentation/hwmon/index.rst | 1 +
Documentation/hwmon/stratix10-hwmon.rst | 31 ++
MAINTAINERS | 2 +
drivers/hwmon/Kconfig | 10 +
drivers/hwmon/Makefile | 1 +
drivers/hwmon/stratix10-hwmon.c | 575 ++++++++++++++++++++++++
6 files changed, 620 insertions(+)
create mode 100644 Documentation/hwmon/stratix10-hwmon.rst
create mode 100644 drivers/hwmon/stratix10-hwmon.c
diff --git a/Documentation/hwmon/index.rst b/Documentation/hwmon/index.rst
index 8b655e5d6b68..30f533301903 100644
--- a/Documentation/hwmon/index.rst
+++ b/Documentation/hwmon/index.rst
@@ -244,6 +244,7 @@ Hardware Monitoring Kernel Drivers
sparx5-temp
spd5118
stpddc60
+ stratix10-hwmon
surface_fan
sy7636a-hwmon
tc654
diff --git a/Documentation/hwmon/stratix10-hwmon.rst b/Documentation/hwmon/stratix10-hwmon.rst
new file mode 100644
index 000000000000..61b682fe177a
--- /dev/null
+++ b/Documentation/hwmon/stratix10-hwmon.rst
@@ -0,0 +1,31 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Kernel driver stratix10-hwmon
+=============================
+
+Supported chips:
+
+ * Altera Stratix 10 SoC FPGA
+
+Authors:
+ - Nazim Amirul <muhammad.nazim.amirul.nazle.asmade@altera.com>
+ - Tze Yee Ng <tze.yee.ng@altera.com>
+
+Description
+-----------
+
+This driver supports hardware monitoring for Altera Stratix 10 SoC FPGA
+devices through the Secure Device Manager and Stratix 10 service layer.
+
+The following sensor types are supported:
+
+ * temperature
+ * voltage
+
+Usage Notes
+-----------
+
+The driver relies on a device tree node to enumerate sensors present on the
+specific device. See
+Documentation/devicetree/bindings/hwmon/altr,stratix10-hwmon.yaml for details
+of the device-tree node.
diff --git a/MAINTAINERS b/MAINTAINERS
index 678f6c429627..5afdf286f8f9 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -943,6 +943,8 @@ M: Tze Yee Ng <tze.yee.ng@altera.com>
L: linux-hwmon@vger.kernel.org
S: Maintained
F: Documentation/devicetree/bindings/hwmon/altr,stratix10-hwmon.yaml
+F: Documentation/hwmon/stratix10-hwmon.rst
+F: drivers/hwmon/stratix10-hwmon.c
ALTERA MAILBOX DRIVER
M: Tien Sung Ang <tiensung.ang@altera.com>
diff --git a/drivers/hwmon/Kconfig b/drivers/hwmon/Kconfig
index 14e4cea48acc..8eff1c71a226 100644
--- a/drivers/hwmon/Kconfig
+++ b/drivers/hwmon/Kconfig
@@ -2112,6 +2112,16 @@ config SENSORS_SMSC47M192
This driver can also be built as a module. If so, the module
will be called smsc47m192.
+config SENSORS_ALTERA_SOCFPGA_STRATIX10
+ tristate "Altera SoC FPGA Stratix 10 hardware monitoring features"
+ depends on INTEL_STRATIX10_SERVICE
+ help
+ If you say yes here you get support for the temperature and
+ voltage sensors of Altera SoC FPGA Stratix 10 devices.
+
+ This driver can also be built as a module. If so, the module
+ will be called stratix10-hwmon.
+
config SENSORS_SMSC47B397
tristate "SMSC LPC47B397-NC"
depends on HAS_IOPORT
diff --git a/drivers/hwmon/Makefile b/drivers/hwmon/Makefile
index 982ee2c6f9de..7e643de0e7d4 100644
--- a/drivers/hwmon/Makefile
+++ b/drivers/hwmon/Makefile
@@ -217,6 +217,7 @@ obj-$(CONFIG_SENSORS_SMPRO) += smpro-hwmon.o
obj-$(CONFIG_SENSORS_SMSC47B397)+= smsc47b397.o
obj-$(CONFIG_SENSORS_SMSC47M1) += smsc47m1.o
obj-$(CONFIG_SENSORS_SMSC47M192)+= smsc47m192.o
+obj-$(CONFIG_SENSORS_ALTERA_SOCFPGA_STRATIX10) += stratix10-hwmon.o
obj-$(CONFIG_SENSORS_SPARX5) += sparx5-temp.o
obj-$(CONFIG_SENSORS_SPD5118) += spd5118.o
obj-$(CONFIG_SENSORS_STTS751) += stts751.o
diff --git a/drivers/hwmon/stratix10-hwmon.c b/drivers/hwmon/stratix10-hwmon.c
new file mode 100644
index 000000000000..7ed1116e57b8
--- /dev/null
+++ b/drivers/hwmon/stratix10-hwmon.c
@@ -0,0 +1,575 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Altera Stratix 10 SoC FPGA hardware monitoring driver
+ *
+ * Copyright (c) 2026 Altera Corporation
+ *
+ * Authors:
+ * Nazim Amirul <muhammad.nazim.amirul.nazle.asmade@altera.com>
+ * Tze Yee Ng <tze.yee.ng@altera.com>
+ */
+
+#include <linux/bitops.h>
+#include <linux/cleanup.h>
+#include <linux/completion.h>
+#include <linux/delay.h>
+#include <linux/firmware/intel/stratix10-svc-client.h>
+#include <linux/hwmon.h>
+#include <linux/hwmon-sysfs.h>
+#include <linux/jiffies.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/of.h>
+#include <linux/platform_device.h>
+
+#define HWMON_TIMEOUT msecs_to_jiffies(SVC_HWMON_REQUEST_TIMEOUT_MS)
+#define HWMON_RETRY_SLEEP_MS 1U
+#define HWMON_ASYNC_MSG_RETRY 3U
+#define STRATIX10_HWMON_MAXSENSORS 16
+#define STRATIX10_HWMON_TEMPERATURE "temperature"
+#define STRATIX10_HWMON_VOLTAGE "voltage"
+#define STRATIX10_HWMON_CHANNEL_MASK GENMASK(15, 0)
+#define STRATIX10_HWMON_PAGE_SHIFT 16
+#define STRATIX10_HWMON_ATTR_VISIBLE 0444
+/* Temperature from SDM is signed Q8.8 millidegrees Celsius (8 fractional bits). */
+#define STRATIX10_HWMON_TEMP_FRAC_BITS 8
+#define STRATIX10_HWMON_TEMP_FRAC_DIV BIT(STRATIX10_HWMON_TEMP_FRAC_BITS)
+/* Voltage from SDM is unsigned Q16 (millivolts, 16 fractional bits). */
+#define STRATIX10_HWMON_VOLT_FRAC_BITS 16
+#define STRATIX10_HWMON_VOLT_FRAC_DIV BIT(STRATIX10_HWMON_VOLT_FRAC_BITS)
+
+#define ETEMP_INACTIVE 0x80000000U
+#define ETEMP_TOO_OLD 0x80000001U
+#define ETEMP_NOT_PRESENT 0x80000002U
+#define ETEMP_TIMEOUT 0x80000003U
+#define ETEMP_CORRUPT 0x80000004U
+#define ETEMP_BUSY 0x80000005U
+#define ETEMP_NOT_INITIALIZED 0x800000FFU
+
+struct stratix10_hwmon_priv {
+ struct stratix10_svc_chan *chan;
+ struct stratix10_svc_client client;
+ struct completion completion;
+ struct mutex lock; /* protect SVC calls */
+ bool async;
+ u32 temperature;
+ u32 voltage;
+ int temperature_channels;
+ int voltage_channels;
+ const char *temp_chan_names[STRATIX10_HWMON_MAXSENSORS];
+ const char *volt_chan_names[STRATIX10_HWMON_MAXSENSORS];
+ u32 temp_chan[STRATIX10_HWMON_MAXSENSORS];
+ u32 volt_chan[STRATIX10_HWMON_MAXSENSORS];
+};
+
+static umode_t stratix10_hwmon_is_visible(const void *dev,
+ enum hwmon_sensor_types type,
+ u32 attr, int chan)
+{
+ const struct stratix10_hwmon_priv *priv = dev;
+
+ switch (type) {
+ case hwmon_temp:
+ if (chan < priv->temperature_channels)
+ return STRATIX10_HWMON_ATTR_VISIBLE;
+ return 0;
+ case hwmon_in:
+ if (chan < priv->voltage_channels)
+ return STRATIX10_HWMON_ATTR_VISIBLE;
+ return 0;
+ default:
+ return 0;
+ }
+}
+
+static void stratix10_hwmon_readtemp_cb(struct stratix10_svc_client *client,
+ struct stratix10_svc_cb_data *data)
+{
+ struct stratix10_hwmon_priv *priv = client->priv;
+
+ if (data->status == BIT(SVC_STATUS_OK)) {
+ priv->temperature = (u32)*(unsigned long *)data->kaddr1;
+ } else if (data->kaddr1) {
+ dev_err(client->dev, "%s failed with status 0x%x, value 0x%lx\n",
+ __func__, data->status,
+ *(unsigned long *)data->kaddr1);
+ } else {
+ dev_err(client->dev, "%s failed with status 0x%x\n",
+ __func__, data->status);
+ }
+
+ complete(&priv->completion);
+}
+
+static void stratix10_hwmon_readvolt_cb(struct stratix10_svc_client *client,
+ struct stratix10_svc_cb_data *data)
+{
+ struct stratix10_hwmon_priv *priv = client->priv;
+
+ if (data->status == BIT(SVC_STATUS_OK)) {
+ priv->voltage = (u32)*(unsigned long *)data->kaddr1;
+ } else if (data->kaddr1) {
+ dev_err(client->dev, "%s failed with status 0x%x, value 0x%lx\n",
+ __func__, data->status,
+ *(unsigned long *)data->kaddr1);
+ } else {
+ dev_err(client->dev, "%s failed with status 0x%x\n",
+ __func__, data->status);
+ }
+
+ complete(&priv->completion);
+}
+
+static void stratix10_hwmon_async_callback(void *ptr)
+{
+ if (ptr)
+ complete(ptr);
+}
+
+static int stratix10_hwmon_parse_temp(long *val, u32 temperature)
+{
+ switch (temperature) {
+ case ETEMP_INACTIVE:
+ case ETEMP_NOT_PRESENT:
+ case ETEMP_CORRUPT:
+ case ETEMP_NOT_INITIALIZED:
+ return -EOPNOTSUPP;
+ case ETEMP_TIMEOUT:
+ case ETEMP_BUSY:
+ case ETEMP_TOO_OLD:
+ return -EAGAIN;
+ default:
+ /* Convert Q8.8 millidegrees Celsius to millidegrees for hwmon. */
+ *val = (long)(s32)temperature / STRATIX10_HWMON_TEMP_FRAC_DIV;
+ return 0;
+ }
+}
+
+static int stratix10_hwmon_encode_temp_arg(u32 reg, u64 *arg)
+{
+ u32 page = (reg >> STRATIX10_HWMON_PAGE_SHIFT) & STRATIX10_HWMON_CHANNEL_MASK;
+ u32 channel = reg & STRATIX10_HWMON_CHANNEL_MASK;
+
+ if (channel >= STRATIX10_HWMON_MAXSENSORS)
+ return -EINVAL;
+
+ *arg = (1ULL << channel) | ((u64)page << STRATIX10_HWMON_PAGE_SHIFT);
+ return 0;
+}
+
+static int stratix10_hwmon_encode_volt_arg(u32 reg, u64 *arg)
+{
+ u32 channel = reg & STRATIX10_HWMON_CHANNEL_MASK;
+
+ if (channel >= STRATIX10_HWMON_MAXSENSORS)
+ return -EINVAL;
+
+ *arg = 1ULL << channel;
+ return 0;
+}
+
+static int stratix10_hwmon_async_read(struct device *dev,
+ enum hwmon_sensor_types type,
+ struct stratix10_svc_client_msg *msg)
+{
+ struct stratix10_hwmon_priv *priv = dev_get_drvdata(dev);
+ struct stratix10_svc_cb_data data = {};
+ struct completion completion;
+ unsigned long wait_ret;
+ void *handle = NULL;
+ int status, index, ret;
+
+ init_completion(&completion);
+
+ for (index = 0; index < HWMON_ASYNC_MSG_RETRY; index++) {
+ status = stratix10_svc_async_send(priv->chan, msg, &handle,
+ stratix10_hwmon_async_callback,
+ &completion);
+ if (status == 0)
+ break;
+ dev_warn(dev, "Failed to send async message\n");
+ msleep(HWMON_RETRY_SLEEP_MS);
+ }
+
+ if (status && !handle)
+ return status;
+
+ wait_ret = wait_for_completion_io_timeout(&completion, HWMON_TIMEOUT);
+ if (wait_ret > 0)
+ dev_dbg(dev, "Received async interrupt\n");
+ else if (wait_ret == 0)
+ dev_dbg(dev, "Timeout occurred, trying to poll the response\n");
+
+ ret = -ETIMEDOUT;
+ for (index = 0; index < HWMON_ASYNC_MSG_RETRY; index++) {
+ status = stratix10_svc_async_poll(priv->chan, handle, &data);
+ if (status == -EAGAIN) {
+ dev_dbg(dev, "Async message is still in progress\n");
+ } else if (status < 0) {
+ dev_alert(dev, "Failed to poll async message: %d\n", status);
+ ret = status;
+ break;
+ } else if (status == 0) {
+ ret = 0;
+ break;
+ }
+ msleep(HWMON_RETRY_SLEEP_MS);
+ }
+
+ if (ret) {
+ dev_err(dev, "Failed to get async response\n");
+ goto done;
+ }
+
+ if (data.status) {
+ dev_err(dev, "%s returned 0x%x from SDM\n", __func__,
+ data.status);
+ ret = -EFAULT;
+ goto done;
+ }
+
+ if (type == hwmon_temp)
+ priv->temperature = (u32)*(unsigned long *)data.kaddr1;
+ else
+ priv->voltage = (u32)*(unsigned long *)data.kaddr1;
+
+ ret = 0;
+
+done:
+ stratix10_svc_async_done(priv->chan, handle);
+ return ret;
+}
+
+static int stratix10_hwmon_sync_read(struct device *dev,
+ enum hwmon_sensor_types type,
+ struct stratix10_svc_client_msg *msg)
+{
+ struct stratix10_hwmon_priv *priv = dev_get_drvdata(dev);
+ int ret;
+
+ reinit_completion(&priv->completion);
+
+ if (type == hwmon_temp)
+ priv->client.receive_cb = stratix10_hwmon_readtemp_cb;
+ else
+ priv->client.receive_cb = stratix10_hwmon_readvolt_cb;
+
+ ret = stratix10_svc_send(priv->chan, msg);
+ if (ret < 0)
+ goto status_done;
+
+ ret = wait_for_completion_interruptible_timeout(&priv->completion,
+ HWMON_TIMEOUT);
+ if (!ret) {
+ dev_err(priv->client.dev, "timeout waiting for SMC call\n");
+ ret = -ETIMEDOUT;
+ goto status_done;
+ }
+ if (ret < 0) {
+ dev_err(priv->client.dev, "error %d waiting for SMC call\n", ret);
+ goto status_done;
+ }
+
+ ret = 0;
+
+status_done:
+ stratix10_svc_done(priv->chan);
+ return ret;
+}
+
+static int stratix10_hwmon_read(struct device *dev, enum hwmon_sensor_types type,
+ u32 attr, int chan, long *val)
+{
+ struct stratix10_hwmon_priv *priv = dev_get_drvdata(dev);
+ struct stratix10_svc_client_msg msg = {0};
+ int ret;
+
+ if (chan >= STRATIX10_HWMON_MAXSENSORS)
+ return -EOPNOTSUPP;
+
+ switch (type) {
+ case hwmon_temp:
+ ret = stratix10_hwmon_encode_temp_arg(priv->temp_chan[chan],
+ &msg.arg[0]);
+ if (ret)
+ return ret;
+ msg.command = COMMAND_HWMON_READTEMP;
+ break;
+ case hwmon_in:
+ ret = stratix10_hwmon_encode_volt_arg(priv->volt_chan[chan],
+ &msg.arg[0]);
+ if (ret)
+ return ret;
+ msg.command = COMMAND_HWMON_READVOLT;
+ break;
+ default:
+ return -EOPNOTSUPP;
+ }
+
+ guard(mutex)(&priv->lock);
+ if (priv->async)
+ ret = stratix10_hwmon_async_read(dev, type, &msg);
+ else
+ ret = stratix10_hwmon_sync_read(dev, type, &msg);
+ if (ret)
+ return ret;
+
+ if (type == hwmon_temp)
+ ret = stratix10_hwmon_parse_temp(val, priv->temperature);
+ else
+ /* Convert Q16 millivolts to millivolts for hwmon. */
+ *val = (long)priv->voltage / STRATIX10_HWMON_VOLT_FRAC_DIV;
+ return ret;
+}
+
+static int stratix10_hwmon_read_string(struct device *dev,
+ enum hwmon_sensor_types type, u32 attr,
+ int chan, const char **str)
+{
+ struct stratix10_hwmon_priv *priv = dev_get_drvdata(dev);
+
+ switch (type) {
+ case hwmon_in:
+ *str = priv->volt_chan_names[chan];
+ return 0;
+ case hwmon_temp:
+ *str = priv->temp_chan_names[chan];
+ return 0;
+ default:
+ return -EOPNOTSUPP;
+ }
+}
+
+static const struct hwmon_ops stratix10_hwmon_ops = {
+ .is_visible = stratix10_hwmon_is_visible,
+ .read = stratix10_hwmon_read,
+ .read_string = stratix10_hwmon_read_string,
+};
+
+static const struct hwmon_channel_info *stratix10_hwmon_info[] = {
+ HWMON_CHANNEL_INFO(temp,
+ HWMON_T_INPUT | HWMON_T_LABEL,
+ HWMON_T_INPUT | HWMON_T_LABEL,
+ HWMON_T_INPUT | HWMON_T_LABEL,
+ HWMON_T_INPUT | HWMON_T_LABEL,
+ HWMON_T_INPUT | HWMON_T_LABEL,
+ HWMON_T_INPUT | HWMON_T_LABEL,
+ HWMON_T_INPUT | HWMON_T_LABEL,
+ HWMON_T_INPUT | HWMON_T_LABEL,
+ HWMON_T_INPUT | HWMON_T_LABEL,
+ HWMON_T_INPUT | HWMON_T_LABEL,
+ HWMON_T_INPUT | HWMON_T_LABEL,
+ HWMON_T_INPUT | HWMON_T_LABEL,
+ HWMON_T_INPUT | HWMON_T_LABEL,
+ HWMON_T_INPUT | HWMON_T_LABEL,
+ HWMON_T_INPUT | HWMON_T_LABEL,
+ HWMON_T_INPUT | HWMON_T_LABEL),
+ HWMON_CHANNEL_INFO(in,
+ HWMON_I_INPUT | HWMON_I_LABEL,
+ HWMON_I_INPUT | HWMON_I_LABEL,
+ HWMON_I_INPUT | HWMON_I_LABEL,
+ HWMON_I_INPUT | HWMON_I_LABEL,
+ HWMON_I_INPUT | HWMON_I_LABEL,
+ HWMON_I_INPUT | HWMON_I_LABEL,
+ HWMON_I_INPUT | HWMON_I_LABEL,
+ HWMON_I_INPUT | HWMON_I_LABEL,
+ HWMON_I_INPUT | HWMON_I_LABEL,
+ HWMON_I_INPUT | HWMON_I_LABEL,
+ HWMON_I_INPUT | HWMON_I_LABEL,
+ HWMON_I_INPUT | HWMON_I_LABEL,
+ HWMON_I_INPUT | HWMON_I_LABEL,
+ HWMON_I_INPUT | HWMON_I_LABEL,
+ HWMON_I_INPUT | HWMON_I_LABEL,
+ HWMON_I_INPUT | HWMON_I_LABEL),
+ NULL
+};
+
+static const struct hwmon_chip_info stratix10_hwmon_chip_info = {
+ .ops = &stratix10_hwmon_ops,
+ .info = stratix10_hwmon_info,
+};
+
+static int stratix10_hwmon_add_channel(struct device *dev, const char *type,
+ u32 val, const char *label,
+ struct stratix10_hwmon_priv *priv)
+{
+ if (!strcmp(type, STRATIX10_HWMON_TEMPERATURE)) {
+ if (priv->temperature_channels >= STRATIX10_HWMON_MAXSENSORS) {
+ dev_warn(dev, "Can't add temp node %s, too many channels\n",
+ label);
+ return 0;
+ }
+
+ priv->temp_chan_names[priv->temperature_channels] = label;
+ priv->temp_chan[priv->temperature_channels] = val;
+ priv->temperature_channels++;
+ return 0;
+ }
+
+ if (!strcmp(type, STRATIX10_HWMON_VOLTAGE)) {
+ if (priv->voltage_channels >= STRATIX10_HWMON_MAXSENSORS) {
+ dev_warn(dev, "Can't add voltage node %s, too many channels\n",
+ label);
+ return 0;
+ }
+
+ priv->volt_chan_names[priv->voltage_channels] = label;
+ priv->volt_chan[priv->voltage_channels] = val;
+ priv->voltage_channels++;
+ return 0;
+ }
+
+ dev_warn(dev, "unsupported sensor type %s\n", type);
+ return 0;
+}
+
+static int stratix10_hwmon_probe_child_from_dt(struct device *dev,
+ struct device_node *child,
+ struct stratix10_hwmon_priv *priv)
+{
+ struct device_node *grandchild;
+ const char *label;
+ u32 val;
+ int ret;
+
+ for_each_child_of_node(child, grandchild) {
+ ret = of_property_read_u32(grandchild, "reg", &val);
+ if (ret) {
+ dev_err(dev, "missing reg property of %pOFn\n",
+ grandchild);
+ of_node_put(grandchild);
+ return ret;
+ }
+
+ ret = of_property_read_string(grandchild, "label", &label);
+ if (ret)
+ label = grandchild->name;
+
+ stratix10_hwmon_add_channel(dev, child->name, val, label, priv);
+ }
+
+ return 0;
+}
+
+static int stratix10_hwmon_probe_from_dt(struct device *dev,
+ struct stratix10_hwmon_priv *priv)
+{
+ struct device_node *child;
+ int ret;
+
+ if (!dev->of_node)
+ return 0;
+
+ for_each_child_of_node(dev->of_node, child) {
+ ret = stratix10_hwmon_probe_child_from_dt(dev, child, priv);
+ if (ret) {
+ of_node_put(child);
+ return ret;
+ }
+ }
+
+ return 0;
+}
+
+static int stratix10_hwmon_probe(struct platform_device *pdev)
+{
+ struct device *dev = &pdev->dev;
+ struct stratix10_hwmon_priv *priv;
+ struct device *hwmon_dev;
+ int ret;
+
+ priv = devm_kzalloc(dev, sizeof(*priv), GFP_KERNEL);
+ if (!priv)
+ return -ENOMEM;
+
+ priv->client.dev = dev;
+ priv->client.priv = priv;
+ init_completion(&priv->completion);
+ mutex_init(&priv->lock);
+
+ ret = stratix10_hwmon_probe_from_dt(dev, priv);
+ if (ret) {
+ dev_err(dev, "Unable to probe from device tree\n");
+ return ret;
+ }
+
+ if (!priv->temperature_channels && !priv->voltage_channels) {
+ dev_err(dev, "no temperature or voltage channels in device tree\n");
+ return -ENODEV;
+ }
+
+ priv->chan = stratix10_svc_request_channel_byname(&priv->client,
+ SVC_CLIENT_HWMON);
+ if (IS_ERR(priv->chan)) {
+ ret = PTR_ERR(priv->chan);
+ if (ret == -EPROBE_DEFER)
+ dev_dbg(dev, "service channel %s not ready, deferring probe\n",
+ SVC_CLIENT_HWMON);
+ else
+ dev_err(dev, "couldn't get service channel %s: %d\n",
+ SVC_CLIENT_HWMON, ret);
+ return ret;
+ }
+
+ ret = stratix10_svc_add_async_client(priv->chan, false);
+ switch (ret) {
+ case 0:
+ priv->async = true;
+ break;
+ case -EINVAL:
+ dev_dbg(dev, "async operations not supported, using sync mode\n");
+ priv->async = false;
+ break;
+ default:
+ dev_err(dev, "failed to add async client: %d\n", ret);
+ stratix10_svc_free_channel(priv->chan);
+ return ret;
+ }
+
+ dev_info(dev, "Initialized %d temperature and %d voltage channels\n",
+ priv->temperature_channels, priv->voltage_channels);
+
+ hwmon_dev = devm_hwmon_device_register_with_info(dev, "stratix10_hwmon",
+ priv,
+ &stratix10_hwmon_chip_info,
+ NULL);
+ if (IS_ERR(hwmon_dev)) {
+ if (priv->async)
+ stratix10_svc_remove_async_client(priv->chan);
+ stratix10_svc_free_channel(priv->chan);
+ return PTR_ERR(hwmon_dev);
+ }
+
+ platform_set_drvdata(pdev, priv);
+ return 0;
+}
+
+static void stratix10_hwmon_remove(struct platform_device *pdev)
+{
+ struct stratix10_hwmon_priv *priv = platform_get_drvdata(pdev);
+
+ if (priv->async)
+ stratix10_svc_remove_async_client(priv->chan);
+ stratix10_svc_free_channel(priv->chan);
+}
+
+static const struct of_device_id stratix10_hwmon_of_match[] = {
+ { .compatible = "altr,stratix10-hwmon" },
+ {}
+};
+MODULE_DEVICE_TABLE(of, stratix10_hwmon_of_match);
+
+static struct platform_driver stratix10_hwmon_driver = {
+ .driver = {
+ .name = "stratix10-hwmon",
+ .of_match_table = stratix10_hwmon_of_match,
+ },
+ .probe = stratix10_hwmon_probe,
+ .remove = stratix10_hwmon_remove,
+};
+module_platform_driver(stratix10_hwmon_driver);
+
+MODULE_AUTHOR("Nazim Amirul <muhammad.nazim.amirul.nazle.asmade@altera.com>");
+MODULE_AUTHOR("Tze Yee Ng <tze.yee.ng@altera.com>");
+MODULE_DESCRIPTION("Altera Stratix 10 SoC FPGA hardware monitoring driver");
+MODULE_LICENSE("GPL");
--
2.43.7
^ permalink raw reply related
* [PATCH 3/5] firmware: stratix10-svc: add async HWMON read commands
From: tze.yee.ng @ 2026-06-19 9:38 UTC (permalink / raw)
To: Guenter Roeck, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
linux-hwmon, devicetree, linux-kernel, Dinh Nguyen, Mahesh Rao,
Jonathan Corbet, Shuah Khan, linux-doc
In-Reply-To: <cover.1781861409.git.tze.yee.ng@altera.com>
From: Tze Yee Ng <tze.yee.ng@altera.com>
Add asynchronous Stratix 10 service layer support for hardware monitor
temperature and voltage read commands in stratix10_svc_async_send() and
stratix10_svc_async_prepare_response().
Signed-off-by: Nazim Amirul <muhammad.nazim.amirul.nazle.asmade@altera.com>
Signed-off-by: Tze Yee Ng <tze.yee.ng@altera.com>
---
drivers/firmware/stratix10-svc.c | 12 +++++++
include/linux/firmware/intel/stratix10-smc.h | 38 ++++++++++++++++++++
2 files changed, 50 insertions(+)
diff --git a/drivers/firmware/stratix10-svc.c b/drivers/firmware/stratix10-svc.c
index e9e35d67ef96..2cfdac31402c 100644
--- a/drivers/firmware/stratix10-svc.c
+++ b/drivers/firmware/stratix10-svc.c
@@ -1311,6 +1311,14 @@ int stratix10_svc_async_send(struct stratix10_svc_chan *chan, void *msg,
args.a0 = INTEL_SIP_SMC_ASYNC_RSU_NOTIFY;
args.a2 = p_msg->arg[0];
break;
+ case COMMAND_HWMON_READTEMP:
+ args.a0 = INTEL_SIP_SMC_ASYNC_HWMON_READTEMP;
+ args.a2 = p_msg->arg[0];
+ break;
+ case COMMAND_HWMON_READVOLT:
+ args.a0 = INTEL_SIP_SMC_ASYNC_HWMON_READVOLT;
+ args.a2 = p_msg->arg[0];
+ break;
default:
dev_err(ctrl->dev, "Invalid command ,%d\n", p_msg->command);
ret = -EINVAL;
@@ -1404,6 +1412,10 @@ static int stratix10_svc_async_prepare_response(struct stratix10_svc_chan *chan,
*/
data->kaddr1 = (void *)&handle->res;
break;
+ case COMMAND_HWMON_READTEMP:
+ case COMMAND_HWMON_READVOLT:
+ data->kaddr1 = (void *)&handle->res.a2;
+ break;
default:
dev_alert(ctrl->dev, "Invalid command\n ,%d", p_msg->command);
diff --git a/include/linux/firmware/intel/stratix10-smc.h b/include/linux/firmware/intel/stratix10-smc.h
index 935dba3633b5..4eb3a6e9659d 100644
--- a/include/linux/firmware/intel/stratix10-smc.h
+++ b/include/linux/firmware/intel/stratix10-smc.h
@@ -680,6 +680,44 @@ INTEL_SIP_SMC_FAST_CALL_VAL(INTEL_SIP_SMC_FUNCID_FPGA_CONFIG_COMPLETED_WRITE)
#define INTEL_SIP_SMC_ASYNC_POLL \
INTEL_SIP_SMC_ASYNC_VAL(INTEL_SIP_SMC_ASYNC_FUNC_ID_POLL)
+/**
+ * Request INTEL_SIP_SMC_ASYNC_HWMON_READTEMP
+ * Async call to request temperature
+ *
+ * Call register usage:
+ * a0 INTEL_SIP_SMC_ASYNC_HWMON_READTEMP
+ * a1 transaction job id
+ * a2 Temperature Channel
+ * a3-a17 not used
+ *
+ * Return status
+ * a0 INTEL_SIP_SMC_STATUS_OK, INTEL_SIP_SMC_STATUS_REJECTED
+ * or INTEL_SIP_SMC_STATUS_BUSY
+ * a1-a17 not used
+ */
+#define INTEL_SIP_SMC_ASYNC_FUNC_ID_HWMON_READTEMP 0xE8
+#define INTEL_SIP_SMC_ASYNC_HWMON_READTEMP \
+ INTEL_SIP_SMC_ASYNC_VAL(INTEL_SIP_SMC_ASYNC_FUNC_ID_HWMON_READTEMP)
+
+/**
+ * Request INTEL_SIP_SMC_ASYNC_HWMON_READVOLT
+ * Async call to request voltage
+ *
+ * Call register usage:
+ * a0 INTEL_SIP_SMC_ASYNC_HWMON_READVOLT
+ * a1 transaction job id
+ * a2 Voltage Channel
+ * a3-a17 not used
+ *
+ * Return status
+ * a0 INTEL_SIP_SMC_STATUS_OK, INTEL_SIP_SMC_STATUS_REJECTED
+ * or INTEL_SIP_SMC_STATUS_BUSY
+ * a1-a17 not used
+ */
+#define INTEL_SIP_SMC_ASYNC_FUNC_ID_HWMON_READVOLT 0xE9
+#define INTEL_SIP_SMC_ASYNC_HWMON_READVOLT \
+ INTEL_SIP_SMC_ASYNC_VAL(INTEL_SIP_SMC_ASYNC_FUNC_ID_HWMON_READVOLT)
+
/**
* Request INTEL_SIP_SMC_ASYNC_RSU_GET_SPT
* Async call to get RSU SPT from SDM.
--
2.43.7
^ permalink raw reply related
* [PATCH 2/5] dt-bindings: firmware: svc: add hwmon property
From: tze.yee.ng @ 2026-06-19 9:38 UTC (permalink / raw)
To: Guenter Roeck, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
linux-hwmon, devicetree, linux-kernel, Dinh Nguyen, Mahesh Rao,
Jonathan Corbet, Shuah Khan, linux-doc
In-Reply-To: <cover.1781861409.git.tze.yee.ng@altera.com>
From: Tze Yee Ng <tze.yee.ng@altera.com>
Altera Stratix 10 SoCFPGA supports hardware monitor access through the
service layer mailbox. Add an optional hwmon child node to the service
layer binding so device trees can describe the hardware monitor.
Signed-off-by: Nazim Amirul <muhammad.nazim.amirul.nazle.asmade@altera.com>
Signed-off-by: Tze Yee Ng <tze.yee.ng@altera.com>
---
.../devicetree/bindings/firmware/intel,stratix10-svc.yaml | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/Documentation/devicetree/bindings/firmware/intel,stratix10-svc.yaml b/Documentation/devicetree/bindings/firmware/intel,stratix10-svc.yaml
index b42cfa78b28b..86ffdb10132f 100644
--- a/Documentation/devicetree/bindings/firmware/intel,stratix10-svc.yaml
+++ b/Documentation/devicetree/bindings/firmware/intel,stratix10-svc.yaml
@@ -62,6 +62,10 @@ properties:
$ref: /schemas/fpga/intel,stratix10-soc-fpga-mgr.yaml
description: Optional child node for fpga manager to perform fabric configuration.
+ hwmon:
+ $ref: /schemas/hwmon/altr,stratix10-hwmon.yaml
+ description: Optional child node for Stratix 10 hardware monitor.
+
required:
- compatible
- method
--
2.43.7
^ permalink raw reply related
* [PATCH 1/5] dt-bindings: hwmon: add Altera Stratix 10 hardware monitor binding
From: tze.yee.ng @ 2026-06-19 9:38 UTC (permalink / raw)
To: Guenter Roeck, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
linux-hwmon, devicetree, linux-kernel, Dinh Nguyen, Mahesh Rao,
Jonathan Corbet, Shuah Khan, linux-doc
In-Reply-To: <cover.1781861409.git.tze.yee.ng@altera.com>
From: Tze Yee Ng <tze.yee.ng@altera.com>
Document the device tree binding for the Altera Stratix 10 SoC FPGA
hardware monitor, including temperature and voltage sensor nodes.
Signed-off-by: Nazim Amirul <muhammad.nazim.amirul.nazle.asmade@altera.com>
Signed-off-by: Tze Yee Ng <tze.yee.ng@altera.com>
---
.../bindings/hwmon/altr,stratix10-hwmon.yaml | 164 ++++++++++++++++++
MAINTAINERS | 7 +
2 files changed, 171 insertions(+)
create mode 100644 Documentation/devicetree/bindings/hwmon/altr,stratix10-hwmon.yaml
diff --git a/Documentation/devicetree/bindings/hwmon/altr,stratix10-hwmon.yaml b/Documentation/devicetree/bindings/hwmon/altr,stratix10-hwmon.yaml
new file mode 100644
index 000000000000..5bd98660ee7b
--- /dev/null
+++ b/Documentation/devicetree/bindings/hwmon/altr,stratix10-hwmon.yaml
@@ -0,0 +1,164 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/hwmon/altr,stratix10-hwmon.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Altera Hardware Monitor for Stratix 10 SoC FPGA
+
+maintainers:
+ - Nazim Amirul <muhammad.nazim.amirul.nazle.asmade@altera.com>
+ - Tze Yee Ng <tze.yee.ng@altera.com>
+
+description: |
+ The Altera Stratix 10 SoC FPGA hardware monitor unit provides on-chip
+ voltage and temperature sensors. These sensors can be used to monitor
+ external voltages and on-chip operating conditions such as internal
+ power rails and on-chip junction temperatures.
+
+ The specific sensor configuration varies by device. Check the device
+ documentation to verify which sensors are available.
+
+ Stratix 10 voltage sensors:
+
+ page 0, channel 2 = 0.8V VCC
+ page 0, channel 3 = 1.8V VCCIO_SDM
+ page 0, channel 6 = 0.9V VCCERAM
+
+ Stratix 10 temperature sensors:
+
+ page 0, channel 0 = main die
+ page 0, channel 1 = tile bottom left
+ page 0, channel 2 = tile middle left
+ page 0, channel 3 = tile top left
+ page 0, channel 4 = tile bottom right
+ page 0, channel 5 = tile middle right
+ page 0, channel 6 = tile top right
+ page 0, channel 7 = hbm2 bottom
+ page 0, channel 8 = hbm2 top
+
+properties:
+ compatible:
+ const: altr,stratix10-hwmon
+
+ temperature:
+ description:
+ The temperature node specifies mappings of temperature sensor diodes on
+ the Stratix 10 SoC FPGA main die and tile die.
+ type: object
+ properties:
+ '#address-cells':
+ const: 1
+ '#size-cells':
+ const: 0
+ patternProperties:
+ "^input(@[0-9a-f]+)?$":
+ description:
+ The input node specifies each individual temperature sensor.
+ type: object
+ properties:
+ reg:
+ description:
+ Sensor channel index in the lower 16-bits (0-15). For temperature
+ sensors, the page number is encoded in the upper 16-bits.
+ The driver encodes the SMC request argument as a channel
+ bitmask (1 << channel) in bits 0..15, with the page number
+ placed in bits 16..31. Channel values >= 16 are rejected to
+ avoid overlap with the page field. For example, reg = <2>
+ selects channel 2 and the driver passes 0x4 to the service layer.
+ label:
+ description:
+ A descriptive name for this channel (e.g. "Main Die" or
+ "Tile Bottom Left").
+ required:
+ - reg
+ additionalProperties: false
+ required:
+ - '#address-cells'
+ - '#size-cells'
+ additionalProperties: false
+
+ voltage:
+ description:
+ The voltage node specifies mappings of voltage sensors on the Stratix 10
+ SoC FPGA analog to digital converter of the Secure Device Manager (SDM).
+ type: object
+ properties:
+ '#address-cells':
+ const: 1
+ '#size-cells':
+ const: 0
+ patternProperties:
+ "^input(@[0-9a-f]+)?$":
+ description:
+ The input node specifies each individual voltage sensor.
+ type: object
+ properties:
+ reg:
+ description:
+ Sensor channel index in the lower 16-bits (0-15). The driver
+ encodes the SMC request argument as a channel bitmask
+ (1 << channel). For example, reg = <2> selects channel 2 and
+ the driver passes 0x4 to the service layer.
+ label:
+ description:
+ A descriptive name for this channel (e.g. "0.8V VCC" or
+ "1.8V VCCIO_SDM").
+ required:
+ - reg
+ additionalProperties: false
+ required:
+ - '#address-cells'
+ - '#size-cells'
+ additionalProperties: false
+
+required:
+ - compatible
+
+additionalProperties: false
+
+examples:
+ - |
+ hwmon {
+ compatible = "altr,stratix10-hwmon";
+
+ voltage {
+ #address-cells = <1>;
+ #size-cells = <0>;
+
+ input@2 {
+ label = "0.8V VCC";
+ reg = <2>;
+ };
+
+ input@3 {
+ label = "1.8V VCCIO_SDM";
+ reg = <3>;
+ };
+
+ input@6 {
+ label = "0.9V VCCERAM";
+ reg = <6>;
+ };
+ };
+
+ temperature {
+ #address-cells = <1>;
+ #size-cells = <0>;
+
+ input@0 {
+ label = "Main Die";
+ reg = <0>;
+ };
+
+ input@1 {
+ label = "Tile Bottom Left";
+ reg = <1>;
+ };
+
+ input@2 {
+ label = "Tile Middle Left";
+ reg = <2>;
+ };
+ };
+ };
diff --git a/MAINTAINERS b/MAINTAINERS
index 6aa3fe2ee1bb..678f6c429627 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -937,6 +937,13 @@ ALPS PS/2 TOUCHPAD DRIVER
R: Pali Rohár <pali@kernel.org>
F: drivers/input/mouse/alps.*
+ALTERA STRATIX 10 SoC FPGA HWMON DRIVER
+M: Nazim Amirul <muhammad.nazim.amirul.nazle.asmade@altera.com>
+M: Tze Yee Ng <tze.yee.ng@altera.com>
+L: linux-hwmon@vger.kernel.org
+S: Maintained
+F: Documentation/devicetree/bindings/hwmon/altr,stratix10-hwmon.yaml
+
ALTERA MAILBOX DRIVER
M: Tien Sung Ang <tiensung.ang@altera.com>
S: Maintained
--
2.43.7
^ permalink raw reply related
* [PATCH 0/5] hwmon: add Altera Stratix 10 SoC FPGA hardware monitor support
From: tze.yee.ng @ 2026-06-19 9:38 UTC (permalink / raw)
To: Guenter Roeck, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
linux-hwmon, devicetree, linux-kernel, Dinh Nguyen, Mahesh Rao,
Jonathan Corbet, Shuah Khan, linux-doc
From: Tze Yee Ng <tze.yee.ng@altera.com>
This series adds hardware monitor support for Altera Stratix 10 SoC FPGA
devices. Temperature and voltage sensors are accessed through the
Stratix 10 service layer and Secure Device Manager.
Patch 1 adds the device tree binding for the hwmon node and sensor layout.
Patch 2 extends the Stratix 10 service layer binding with an optional
hwmon child node. Patch 3 adds async HWMON read commands to the service
firmware driver. Patch 4 adds the hwmon driver, using the async service
interface when available and falling back to synchronous reads otherwise.
Patch 5 enables the hwmon node on the Stratix 10 SoCDK.
Tze Yee Ng (5):
dt-bindings: hwmon: add Altera Stratix 10 hardware monitor binding
dt-bindings: firmware: svc: add hwmon property
firmware: stratix10-svc: add async HWMON read commands
hwmon: add Stratix 10 SoC FPGA hardware monitor driver
arm64: dts: socfpga: stratix10: add hwmon node
.../firmware/intel,stratix10-svc.yaml | 4 +
.../bindings/hwmon/altr,stratix10-hwmon.yaml | 164 +++++
Documentation/hwmon/index.rst | 1 +
Documentation/hwmon/stratix10-hwmon.rst | 31 +
MAINTAINERS | 9 +
.../boot/dts/altera/socfpga_stratix10.dtsi | 5 +
.../dts/altera/socfpga_stratix10_socdk.dts | 33 +
drivers/firmware/stratix10-svc.c | 12 +
drivers/hwmon/Kconfig | 10 +
drivers/hwmon/Makefile | 1 +
drivers/hwmon/stratix10-hwmon.c | 575 ++++++++++++++++++
include/linux/firmware/intel/stratix10-smc.h | 38 ++
12 files changed, 883 insertions(+)
create mode 100644 Documentation/devicetree/bindings/hwmon/altr,stratix10-hwmon.yaml
create mode 100644 Documentation/hwmon/stratix10-hwmon.rst
create mode 100644 drivers/hwmon/stratix10-hwmon.c
--
2.43.7
^ permalink raw reply
* Re: [PATCH v4 0/2] cpufreq: CPPC: add autonomous mode boot parameter support
From: Pierre Gondois @ 2026-06-19 9:29 UTC (permalink / raw)
To: Viresh Kumar, Sumit Gupta
Cc: rafael, ionela.voinescu, zhenglifeng1, zhanjie9, corbet, skhan,
rdunlap, mario.limonciello, linux-pm, linux-doc, linux-kernel,
linux-tegra, treding, jonathanh, vsethi, ksitaraman, sanjayc,
mochs, bbasu
In-Reply-To: <oxw5k2wad4vorehgmrduoxblequy3ynqufwy4sruclnh5d5wrb@awzmfafoucnn>
On 6/18/26 07:28, Viresh Kumar wrote:
> On 16-06-26, 18:22, Sumit Gupta wrote:
>> The dependency it was waiting on, the "cpufreq: Set policy->min and
>> max as real QoS constraints" series, is now in linux-pm (linux-next).
>> I rebased on top and verified autonomous mode works as expected, and
>> it applies cleanly on the current linux-next.
>>
>> The [1] reference in patch 2/2 points to v2 of that series; the merged
>> version is v3 [2].
>>
>> If there are no further comments, please consider acking and queuing
>> this for the next cycle.
> I was waiting for CPPC reviewers to provide some feedback.i
>
> Jie / Lifeng / Pierre ?
>
I think the patchset has the same issue described at:
https://lore.kernel.org/all/86780f97-29ee-4a72-b311-38c89434b707@arm.com/
I don't know if this is important to other persons,
but IMO it would be preferable to have a solution to this issue
before adding more functionalities relying on registers that are left
in an unknown state.
If there are any other opinion ?
^ permalink raw reply
* Re: [PATCH v8 13/46] KVM: guest_memfd: Add base support for KVM_SET_MEMORY_ATTRIBUTES2
From: Fuad Tabba @ 2026-06-19 9:25 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-13-9d2959357853@google.com>
On Fri, 19 Jun 2026 at 01:31, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> Introduce base support for KVM_SET_MEMORY_ATTRIBUTES2 in guest_memfd, which
> just updates attributes tracked by guest_memfd.
>
> Validate input fields in general. Guard usage of KVM_SET_MEMORY_ATTRIBUTES2
> by making sure requested attributes are supported for this instance of kvm.
>
> A new KVM_SET_MEMORY_ATTRIBUTES2 is defined to support writes (unlike
> KVM_SET_MEMORY_ATTRIBUTES) in addition to reads so it can provide error
> details to userspace. This will be used in a later patch.
>
> The two ioctls use their corresponding structs with no overlap, but
> backward compatibility is baked in for future support of
> KVM_SET_MEMORY_ATTRIBUTES2 and struct kvm_memory_attributes2 in the VM
> ioctl.
>
> The process of setting memory attributes is set up such that the later half
> will not fail due to allocation. Any necessary checks are performed before
> the point of no return.
>
> Co-developed-by: Vishal Annapurve <vannapurve@google.com>
> Signed-off-by: Vishal Annapurve <vannapurve@google.com>
> Co-developed-by: Sean Christoperson <seanjc@google.com>
> Signed-off-by: Sean Christoperson <seanjc@google.com>
> Reviewed-by: Fuad Tabba <tabba@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Note sure if it's user error on my part, if I'm applying this to the
wrong base, but I found a build break here on patch 13:
kvm_gmem_invalidate_start() doesn't exist in the base tree. The
function is kvm_gmem_invalidate_begin() here. The rename
(190cc5370a8b6) landed via a different merge path and isn't an
ancestor of the stated base.
Patches 19 and 20 have the same mismatch. Fix for all three is
s/kvm_gmem_invalidate_start/kvm_gmem_invalidate_begin/.
Cheers,
/fuad
> ---
> include/uapi/linux/kvm.h | 13 ++++++
> virt/kvm/Kconfig | 1 +
> virt/kvm/guest_memfd.c | 116 +++++++++++++++++++++++++++++++++++++++++++++++
> virt/kvm/kvm_main.c | 12 +++++
> 4 files changed, 142 insertions(+)
>
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 419011097fa8e..956877a6aab05 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1649,6 +1649,19 @@ struct kvm_memory_attributes {
> __u64 flags;
> };
>
> +#define KVM_SET_MEMORY_ATTRIBUTES2 _IOWR(KVMIO, 0xd2, struct kvm_memory_attributes2)
> +
> +struct kvm_memory_attributes2 {
> + union {
> + __u64 address;
> + __u64 offset;
> + };
> + __u64 size;
> + __u64 attributes;
> + __u64 flags;
> + __u64 reserved[12];
> +};
> +
> #define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3)
>
> #define KVM_CREATE_GUEST_MEMFD _IOWR(KVMIO, 0xd4, struct kvm_create_guest_memfd)
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 297e4399fbd49..cfa2c78ba5fb9 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -102,6 +102,7 @@ config KVM_MMU_LOCKLESS_AGING
>
> config KVM_GUEST_MEMFD
> select XARRAY_MULTI
> + select KVM_MEMORY_ATTRIBUTES
> bool
>
> config HAVE_KVM_ARCH_GMEM_PREPARE
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 65ce795c090d9..0d14548c1ed22 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -541,11 +541,127 @@ bool kvm_gmem_is_private(struct kvm *kvm, gfn_t gfn)
> }
> EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_is_private);
>
> +/*
> + * Preallocate memory for attributes to be stored on a maple tree, pointed to
> + * by mas. Adjacent ranges with attributes identical to the new attributes
> + * will be merged. Also sets mas's bounds up for storing attributes.
> + *
> + * This maintains the invariant that ranges with the same attributes will
> + * always be merged.
> + */
> +static int kvm_gmem_mas_preallocate(struct ma_state *mas, u64 attributes,
> + pgoff_t start, size_t nr_pages)
> +{
> + pgoff_t end = start + nr_pages;
> + pgoff_t last = end - 1;
> + void *entry;
> +
> + /* Try extending range. entry is NULL on overflow/wrap-around. */
> + mas_set(mas, end);
> + entry = mas_find(mas, end);
> + if (entry && xa_to_value(entry) == attributes)
> + last = mas->last;
> +
> + if (start > 0) {
> + mas_set(mas, start - 1);
> + entry = mas_find(mas, start - 1);
> + if (entry && xa_to_value(entry) == attributes)
> + start = mas->index;
> + }
> +
> + mas_set_range(mas, start, last);
> + return mas_preallocate(mas, xa_mk_value(attributes), GFP_KERNEL);
> +}
> +
> +static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
> + size_t nr_pages, uint64_t attrs)
> +{
> + struct address_space *mapping = inode->i_mapping;
> + struct gmem_inode *gi = GMEM_I(inode);
> + pgoff_t end = start + nr_pages;
> + struct maple_tree *mt;
> + struct ma_state mas;
> + int r;
> +
> + mt = &gi->attributes;
> +
> + filemap_invalidate_lock(mapping);
> +
> + mas_init(&mas, mt, start);
> + r = kvm_gmem_mas_preallocate(&mas, attrs, start, nr_pages);
> + if (r)
> + goto out;
> +
> + /*
> + * From this point on guest_memfd has performed necessary
> + * checks and can proceed to do guest-breaking changes.
> + */
> +
> + kvm_gmem_invalidate_start(inode, start, end);
> + mas_store_prealloc(&mas, xa_mk_value(attrs));
> + kvm_gmem_invalidate_end(inode, start, end);
> +out:
> + filemap_invalidate_unlock(mapping);
> + return r;
> +}
> +
> +static long kvm_gmem_set_attributes(struct file *file, void __user *argp)
> +{
> + struct gmem_file *f = file->private_data;
> + struct inode *inode = file_inode(file);
> + struct kvm_memory_attributes2 attrs;
> + size_t nr_pages;
> + pgoff_t index;
> + int i;
> +
> + if (copy_from_user(&attrs, argp, sizeof(attrs)))
> + return -EFAULT;
> +
> + if (attrs.flags)
> + return -EINVAL;
> + for (i = 0; i < ARRAY_SIZE(attrs.reserved); i++) {
> + if (attrs.reserved[i])
> + return -EINVAL;
> + }
> + if (!kvm_arch_has_private_mem(f->kvm))
> + return -EINVAL;
> + if (attrs.attributes & ~KVM_MEMORY_ATTRIBUTE_PRIVATE)
> + return -EINVAL;
> + if (attrs.size == 0 || attrs.offset + attrs.size < attrs.offset)
> + return -EINVAL;
> + if (!PAGE_ALIGNED(attrs.offset) || !PAGE_ALIGNED(attrs.size))
> + return -EINVAL;
> +
> + if (attrs.offset >= i_size_read(inode) ||
> + attrs.offset + attrs.size > i_size_read(inode))
> + return -EINVAL;
> +
> + nr_pages = attrs.size >> PAGE_SHIFT;
> + index = attrs.offset >> PAGE_SHIFT;
> + return __kvm_gmem_set_attributes(inode, index, nr_pages,
> + attrs.attributes);
> +}
> +
> +static long kvm_gmem_ioctl(struct file *file, unsigned int ioctl,
> + unsigned long arg)
> +{
> + switch (ioctl) {
> + case KVM_SET_MEMORY_ATTRIBUTES2:
> + if (!gmem_in_place_conversion)
> + return -ENOTTY;
> +
> + return kvm_gmem_set_attributes(file, (void __user *)arg);
> + default:
> + return -ENOTTY;
> + }
> +}
> +
> static struct file_operations kvm_gmem_fops = {
> .mmap = kvm_gmem_mmap,
> .open = generic_file_open,
> .release = kvm_gmem_release,
> .fallocate = kvm_gmem_fallocate,
> + .unlocked_ioctl = kvm_gmem_ioctl,
> };
>
> static int kvm_gmem_migrate_folio(struct address_space *mapping,
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 01761f6e25d25..a08b518cdb175 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -105,6 +105,18 @@ module_param(allow_unsafe_mappings, bool, 0444);
> bool __ro_after_init gmem_in_place_conversion = false;
> #endif
>
> +#define MEMORY_ATTRIBUTES_MATCH(one, two) \
> + static_assert(offsetof(struct kvm_memory_attributes, one) == \
> + offsetof(struct kvm_memory_attributes2, two)); \
> + static_assert(sizeof_field(struct kvm_memory_attributes, one) ==\
> + sizeof_field(struct kvm_memory_attributes2, two))
> +
> +/* Ensure the common parts of the two structs are identical. */
> +MEMORY_ATTRIBUTES_MATCH(address, address);
> +MEMORY_ATTRIBUTES_MATCH(size, size);
> +MEMORY_ATTRIBUTES_MATCH(attributes, attributes);
> +MEMORY_ATTRIBUTES_MATCH(flags, flags);
> +
> /*
> * Ordering of locks:
> *
>
> --
> 2.55.0.rc0.738.g0c8ab3ebcc-goog
>
>
^ permalink raw reply
* Re: [PATCH v6 06/16] iio: core: create local __iio_chan_prefix_emit() for reuse
From: Nuno Sá @ 2026-06-19 9:20 UTC (permalink / raw)
To: Rodrigo Alencar
Cc: Andy Shevchenko, rodrigo.alencar, linux-iio, devicetree,
linux-kernel, linux-doc, linux-hardening, Lars-Peter Clausen,
Michael Hennerich, Jonathan Cameron, David Lechner,
Andy Shevchenko, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
Philipp Zabel, Jonathan Corbet, Shuah Khan, Kees Cook,
Gustavo A. R. Silva
In-Reply-To: <dlisetsssjoyodmv5ubl4rzhxtla3g46mrrzv2f65nqecel5fu@dqiqsayr4aip>
On Fri, Jun 19, 2026 at 08:43:24AM +0100, Rodrigo Alencar wrote:
> On 18/06/26 21:14, Andy Shevchenko wrote:
> > On Thu, Jun 18, 2026 at 05:14:19PM +0100, Rodrigo Alencar wrote:
> > > On 18/06/26 16:06, Nuno Sá wrote:
> > > > On Thu, Jun 18, 2026 at 02:27:22PM +0100, Rodrigo Alencar via B4 Relay wrote:
> >
> > ...
> >
> > > > > + dev_attr->attr.name = kasprintf(GFP_KERNEL, "%s%s", prefix, postfix);
> > > > > + if (!dev_attr->attr.name)
> > > > > return -ENOMEM;
> > > >
> > > > I don't oppose the change. Looks like a nice cleanup.
> >
> > May I oppose it? I found use scnprintf() is harder to follow in comparison to
> > nice kasprintf() that takes care for the dynamically allocated buffer.
>
> In the next patch the function is reused in a sysfs attribute read handler,
> a context wich would not be nice to have dynamic allocation. vscnprintf() is
> the main building block of sysfs_emit() which limits the buffer length to
> a page size, so I used scnprintf() trying not to deviate much from that.
>
> kasprintf() it is still used in the caller, where the logic was a bit confusing
> as it tried to avoid multiple allocations.
>
> > Also there is a chance to get a name silently cut due to insufficient space.
> > Besides that this function can't be used (again due to 'c') in kasprintf()-like
> > wrapper. I do not consider this as a good approach. Have you looked at seq_buf
> > instead?
>
> NAME_MAX is not the maximum length a filename can have? I suppose there should be
> enough space for the channel-prefix. Indeed, seq_buf can be used and it cleans up
> things a bit as it tracks the the position in the buffer.
>
> >
> > > > But bear in mind this very sensible as any subtle mistake means ABI breakage.
> >
> > Which immediately raises a question of test coverage. Do we have one? If not,
> > this code must be accompanied with one.
>
> Agreed. Will see to have tests for v7.
libiio now has an emulator backend. Maybe something that can be used to
test this. But ideally we can have some kunit for validation.
- Nuno Sá
>
> > > Yes! I tried to be careful... this is dangerous stuff!
>
> --
> Kind regards,
>
> Rodrigo Alencar
^ permalink raw reply
* Re: [PATCH v6 06/16] iio: core: create local __iio_chan_prefix_emit() for reuse
From: Nuno Sá @ 2026-06-19 9:19 UTC (permalink / raw)
To: Andy Shevchenko
Cc: Rodrigo Alencar, rodrigo.alencar, linux-iio, devicetree,
linux-kernel, linux-doc, linux-hardening, Lars-Peter Clausen,
Michael Hennerich, Jonathan Cameron, David Lechner,
Andy Shevchenko, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
Philipp Zabel, Jonathan Corbet, Shuah Khan, Kees Cook,
Gustavo A. R. Silva
In-Reply-To: <ajQ1bZSNHQ96pyJx@ashevche-desk.local>
On Thu, Jun 18, 2026 at 09:14:05PM +0300, Andy Shevchenko wrote:
> On Thu, Jun 18, 2026 at 05:14:19PM +0100, Rodrigo Alencar wrote:
> > On 18/06/26 16:06, Nuno Sá wrote:
> > > On Thu, Jun 18, 2026 at 02:27:22PM +0100, Rodrigo Alencar via B4 Relay wrote:
>
> ...
>
> > > > + dev_attr->attr.name = kasprintf(GFP_KERNEL, "%s%s", prefix, postfix);
> > > > + if (!dev_attr->attr.name)
> > > > return -ENOMEM;
> > >
> > > I don't oppose the change. Looks like a nice cleanup.
>
> May I oppose it? I found use scnprintf() is harder to follow in comparison to
> nice kasprintf() that takes care for the dynamically allocated buffer.
Tend to agree a bit given I was used to the older code. So matching the
old logic with the new one is an exercise, yes.
>
> Also there is a chance to get a name silently cut due to insufficient space.
> Besides that this function can't be used (again due to 'c') in kasprintf()-like
> wrapper. I do not consider this as a good approach. Have you looked at seq_buf
> instead?
Not so sure the above bothers me that much.
>
> > > But bear in mind this very sensible as any subtle mistake means ABI breakage.
>
> Which immediately raises a question of test coverage. Do we have one? If not,
> this code must be accompanied with one.
The above is the more concerning part to me.
- Nuno Sá
>
> > Yes! I tried to be careful... this is dangerous stuff!
>
> --
> With Best Regards,
> Andy Shevchenko
>
>
^ permalink raw reply
* Re: [PATCH v6 06/16] iio: core: create local __iio_chan_prefix_emit() for reuse
From: Nuno Sá @ 2026-06-19 9:16 UTC (permalink / raw)
To: Rodrigo Alencar
Cc: rodrigo.alencar, linux-iio, devicetree, linux-kernel, linux-doc,
linux-hardening, Lars-Peter Clausen, Michael Hennerich,
Jonathan Cameron, David Lechner, Andy Shevchenko, Rob Herring,
Krzysztof Kozlowski, Conor Dooley, Philipp Zabel, Jonathan Corbet,
Shuah Khan, Kees Cook, Gustavo A. R. Silva
In-Reply-To: <x3aijvc4buo7aqbchikuoyyrgiq3afidtkla37h2rg4tvfdbc3@h42qp3estg2s>
On Thu, Jun 18, 2026 at 05:14:19PM +0100, Rodrigo Alencar wrote:
> On 18/06/26 16:06, Nuno Sá wrote:
> > On Thu, Jun 18, 2026 at 02:27:22PM +0100, Rodrigo Alencar via B4 Relay wrote:
> > > From: Rodrigo Alencar <rodrigo.alencar@analog.com>
> > >
> > > Move logic to create a channel prefix for naming attribute files into a
> > > separate __iio_chan_prefix_emit() function for reuse.
>
> ...
>
> > > +static int __iio_chan_prefix_emit(const struct iio_chan_spec *chan,
> > > + enum iio_shared_by shared_by,
> > > + char *buf, size_t len)
> > > +{
> > > + const char *dir = iio_direction[chan->output];
> > > + const char *type = iio_chan_type_name_spec[chan->type];
> > > + int n = 0;
> > > +
> > > + switch (shared_by) {
> > > + case IIO_SHARED_BY_ALL:
> > > + buf[0] = '\0'; /* empty channel prefix */
> > > + break;
> > > + case IIO_SHARED_BY_DIR:
> > > + n = scnprintf(buf, len, "%s", dir);
> > > + break;
> > > + case IIO_SHARED_BY_TYPE:
> > > + n = scnprintf(buf, len, "%s_%s", dir, type);
> > > + if (chan->differential)
> > > + n += scnprintf(buf + n, len - n, "-%s", type);
> > > + break;
> > > + case IIO_SEPARATE:
> > > + if (chan->indexed) {
> > > + n = scnprintf(buf, len, "%s_%s%d", dir, type,
> > > + chan->channel);
> > > + if (chan->differential)
> > > + n += scnprintf(buf + n, len - n, "-%s%d", type,
> > > + chan->channel2);
> > > + } else {
> > > + if (chan->differential) {
> > > + WARN(1, "Differential channels must be indexed\n");
> > > + return -EINVAL;
> > > + }
> > > + n = scnprintf(buf, len, "%s_%s", dir, type);
> > > + }
> > > +
> > > + if (chan->modified) {
> > > + if (chan->differential) {
> > > + WARN(1, "Differential channels can not have modifier\n");
> > > + return -EINVAL;
> >
> > WARN() looks too much to me. dev_error() as we're treating it as such. I
> > guess you don't want to pass struct device but not really an issue IMHO.
>
> __iio_device_attr_init() also used WARN(), probably because it didnt have
> access to a dev pointer. It would not be a problem to add an extra param.
Hmm, fair enough. Maybe a chance to change it. Not sure how others feel
about it.
>
> >
> > > + }
> > > + n += scnprintf(buf + n, len - n, "_%s",
> > > + iio_modifier_names[chan->channel2]);
> > > + }
> > > +
> > > + if (chan->extend_name)
> > > + n += scnprintf(buf + n, len - n, "_%s", chan->extend_name);
> > > + break;
> > > + }
> > > +
> > > + if (n > 0 && n < len - 1) { /* prefix termination if not empty */
> > > + buf[n++] = '_';
> > > + buf[n] = '\0';
> > > + }
> > > +
> >
> > Can't we handle the above in the caller on kasprintf()? Then we could
> > simplify and return in place.
>
> I felt like doing this here would get a cleaner logic in the caller, which
> would have to add the '_' conditionally.
>
I think it makes things more clear in the caller given we return n
anyways but I don't feel strong about it.
- Nuno Sá
> >
> > > + return n;
> > > +}
> > > +
> > > /**
> > > * iio_device_id() - query the unique ID for the device
> > > * @indio_dev: Device structure whose ID is being queried
> > > @@ -1100,106 +1159,19 @@ int __iio_device_attr_init(struct device_attribute *dev_attr,
> > > size_t len),
> > > enum iio_shared_by shared_by)
> > > {
> > > - int ret = 0;
> > > - char *name = NULL;
> > > - char *full_postfix;
> > > + char prefix[NAME_MAX + 1];
> > > + int ret;
> > >
> > > sysfs_attr_init(&dev_attr->attr);
> > >
> > > - /* Build up postfix of <extend_name>_<modifier>_postfix */
> > > - if (chan->modified && (shared_by == IIO_SEPARATE)) {
> > > - if (chan->extend_name)
> > > - full_postfix = kasprintf(GFP_KERNEL, "%s_%s_%s",
> > > - iio_modifier_names[chan->channel2],
> > > - chan->extend_name,
> > > - postfix);
> > > - else
> > > - full_postfix = kasprintf(GFP_KERNEL, "%s_%s",
> > > - iio_modifier_names[chan->channel2],
> > > - postfix);
> > > - } else {
> > > - if (chan->extend_name == NULL || shared_by != IIO_SEPARATE)
> > > - full_postfix = kstrdup(postfix, GFP_KERNEL);
> > > - else
> > > - full_postfix = kasprintf(GFP_KERNEL,
> > > - "%s_%s",
> > > - chan->extend_name,
> > > - postfix);
> > > - }
> > > - if (full_postfix == NULL)
> > > + ret = __iio_chan_prefix_emit(chan, shared_by, prefix, sizeof(prefix));
> > > + if (ret < 0)
> > > + return ret;
> > > +
> > > + dev_attr->attr.name = kasprintf(GFP_KERNEL, "%s%s", prefix, postfix);
> > > + if (!dev_attr->attr.name)
> > > return -ENOMEM;
> >
> > I don't oppose the change. Looks like a nice cleanup. But bear in mind
> > this very sensible as any subtle mistake means ABI breakage.
>
> Yes! I tried to be careful... this is dangerous stuff!
>
> --
> Kind regards,
>
> Rodrigo Alencar
^ permalink raw reply
* Re: [PATCH v8 10/46] KVM: guest_memfd: Wire up core private/shared attribute interfaces
From: Fuad Tabba @ 2026-06-19 8:34 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-10-9d2959357853@google.com>
On Fri, 19 Jun 2026 at 01:31, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Sean Christopherson <seanjc@google.com>
>
> With in-place conversion, guest_memfd is able to track the private/shared
> status of memory. Use a global flag to toggle between tracking
> private/shared status per-vm or within guest_memfd.
>
> When queried for supported vm memory attributes, return 0 if attributes are
> tracked in guest_memfd.
>
> When querying for memory attributes over a range, look up memory attributes
> based on the flag's state at query time.
>
> For per-GFN memory attribute queries, choosing an implementation (VM or
> guest_memfd lookup) at KVM load time.
>
> The flag is always false for now and will be made toggle-able after all
> in-place conversion features are added in subsequent patches.
>
> If/since the flag is false, if CONFIG_KVM_VM_MEMORY_ATTRIBUTES is also not
> selected, the per-GFN memory attribute query defaults to returning
> 0 (false/not private).
>
> Co-developed-by: Ackerley Tng <ackerleytng@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
> ---
> include/linux/kvm_host.h | 4 ++++
> virt/kvm/guest_memfd.c | 22 +++++++++++++++++++---
> virt/kvm/kvm_main.c | 12 +++++++++++-
> 3 files changed, 34 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 27687fb9d5201..acb552745b428 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2560,6 +2560,8 @@ static inline bool kvm_mem_range_is_private(struct kvm *kvm, gfn_t start,
> #endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
>
> #ifdef kvm_arch_has_private_mem
> +extern bool gmem_in_place_conversion;
> +
> typedef bool (kvm_mem_is_private_t)(struct kvm *kvm, gfn_t gfn);
> DECLARE_STATIC_CALL(__kvm_mem_is_private, kvm_mem_is_private_t);
>
> @@ -2568,6 +2570,8 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> return static_call(__kvm_mem_is_private)(kvm, gfn);
> }
> #else
> +#define gmem_in_place_conversion false
> +
> static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> {
> return false;
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index bca912db5be6e..e0e544ef47d69 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -926,6 +926,24 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn);
>
> #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_POPULATE
> +static bool kvm_gmem_range_is_private(struct file *file, pgoff_t index,
> + size_t nr_pages, struct kvm *kvm, gfn_t gfn)
> +{
> + struct maple_tree *mt = &GMEM_I(file_inode(file))->attributes;
> + pgoff_t end = index + nr_pages - 1;
> + void *entry;
> +
> + if (!gmem_in_place_conversion)
> + return kvm_range_has_vm_memory_attributes(kvm, gfn, gfn + nr_pages,
> + KVM_MEMORY_ATTRIBUTE_PRIVATE,
> + KVM_MEMORY_ATTRIBUTE_PRIVATE);
> +
> + mt_for_each(mt, entry, index, end) {
> + if (xa_to_value(entry) != KVM_MEMORY_ATTRIBUTE_PRIVATE)
> + return false;
> + }
> + return true;
> +}
>
> static long __kvm_gmem_populate(struct kvm *kvm, struct kvm_memory_slot *slot,
> struct file *file, gfn_t gfn, struct page *src_page,
> @@ -946,9 +964,7 @@ static long __kvm_gmem_populate(struct kvm *kvm, struct kvm_memory_slot *slot,
>
> folio_unlock(folio);
>
> - if (!kvm_range_has_vm_memory_attributes(kvm, gfn, gfn + 1,
> - KVM_MEMORY_ATTRIBUTE_PRIVATE,
> - KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
> + if (!kvm_gmem_range_is_private(file, index, 1, kvm, gfn)) {
> ret = -EINVAL;
> goto out_put_folio;
> }
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 8b238e461b854..01761f6e25d25 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -101,6 +101,10 @@ EXPORT_SYMBOL_FOR_KVM_INTERNAL(halt_poll_ns_shrink);
> static bool __ro_after_init allow_unsafe_mappings;
> module_param(allow_unsafe_mappings, bool, 0444);
>
> +#ifdef kvm_arch_has_private_mem
> +bool __ro_after_init gmem_in_place_conversion = false;
> +#endif
> +
> /*
> * Ordering of locks:
> *
> @@ -2422,6 +2426,9 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
> static u64 kvm_supported_vm_mem_attributes(struct kvm *kvm)
> {
> #ifdef kvm_arch_has_private_mem
> + if (gmem_in_place_conversion)
> + return 0;
> +
> if (!kvm || kvm_arch_has_private_mem(kvm))
> return KVM_MEMORY_ATTRIBUTE_PRIVATE;
> #endif
> @@ -2633,8 +2640,11 @@ EXPORT_STATIC_CALL_GPL(__kvm_mem_is_private);
>
> static void kvm_init_memory_attributes(void)
> {
> + if (gmem_in_place_conversion)
> + static_call_update(__kvm_mem_is_private, kvm_gmem_is_private);
> #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> - static_call_update(__kvm_mem_is_private, kvm_vm_mem_is_private);
> + else
> + static_call_update(__kvm_mem_is_private, kvm_vm_mem_is_private);
> #endif
> }
> #else
>
> --
> 2.55.0.rc0.738.g0c8ab3ebcc-goog
>
>
^ permalink raw reply
* Re: [PATCH v5 1/7] tracing/events: Fix to check the simple_tsk_fn creation
From: Masami Hiramatsu @ 2026-06-19 8:33 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: Steven Rostedt, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178165817322.269421.3992299509400184196.stgit@devnote2>
Let me pick this fix to probes/core.
Thanks,
On Wed, 17 Jun 2026 10:02:53 +0900
"Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:
> From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
>
> Sashiko pointed that this sample code does not correctly handle the
> failure of thread creation because kthread_run() can return -errno.
>
> Check the simple_tsk_fn is correctly initialized (created) or not.
>
> Link: https://sashiko.dev/#/patchset/178092865666.163648.10457567771536160909.stgit%40devnote2
>
> Fixes: 9cfe06f8cd5c ("tracing/events: add trace-events-sample")
> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> ---
> Changes in v4:
> - Fix to remove decrementing counter in error path, since foo_bar_reg() always returns 0.
> - Add a newline to error message.
> Changes in v3:
> - Recover the usage counter.
> ---
> samples/trace_events/trace-events-sample.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/samples/trace_events/trace-events-sample.c b/samples/trace_events/trace-events-sample.c
> index ecc7db237f2e..0b7a6efdb247 100644
> --- a/samples/trace_events/trace-events-sample.c
> +++ b/samples/trace_events/trace-events-sample.c
> @@ -107,6 +107,10 @@ int foo_bar_reg(void)
> * for consistency sake, we still take the thread_mutex.
> */
> simple_tsk_fn = kthread_run(simple_thread_fn, NULL, "event-sample-fn");
> + if (IS_ERR_OR_NULL(simple_tsk_fn)) {
> + pr_err("Failed to create simple_thread_fn\n");
> + simple_tsk_fn = NULL;
> + }
> out:
> mutex_unlock(&thread_mutex);
> return 0;
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH v8 09/46] KVM: guest_memfd: Introduce function to check GFN private/shared status
From: Fuad Tabba @ 2026-06-19 8:25 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-9-9d2959357853@google.com>
On Fri, 19 Jun 2026 at 01:31, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> Introduce function for KVM to check the private/shared status of guest
> memory at a given GFN.
>
> This will be used in a later patch.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
> ---
> include/linux/kvm_host.h | 2 ++
> virt/kvm/guest_memfd.c | 31 +++++++++++++++++++++++++++++++
> 2 files changed, 33 insertions(+)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 3915da2a61778..27687fb9d5201 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2575,6 +2575,8 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> #endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
>
> #ifdef CONFIG_KVM_GUEST_MEMFD
> +bool kvm_gmem_is_private(struct kvm *kvm, gfn_t gfn);
> +
> int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> gfn_t gfn, kvm_pfn_t *pfn, struct page **page,
> int *max_order);
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 8101f64e0366f..bca912db5be6e 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -510,6 +510,37 @@ static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)
> return 0;
> }
>
> +bool kvm_gmem_is_private(struct kvm *kvm, gfn_t gfn)
> +{
> + struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
> + struct inode *inode;
> +
> + /*
> + * If this gfn has no associated memslot, there's no chance of the gfn
> + * being backed by private memory, since guest_memfd must be used for
> + * private memory, and guest_memfd must be associated with some memslot.
> + */
> + if (!slot)
> + return 0;
> +
> + CLASS(gmem_get_file, file)(slot);
> + if (!file)
> + return 0;
> +
> + inode = file_inode(file);
> +
> + /*
> + * Rely on the maple tree's internal RCU lock to ensure a
> + * stable result. This result can become stale as soon as the
> + * lock is dropped, so the caller _must_ still protect
> + * consumption of private vs. shared by checking
> + * mmu_invalidate_retry_gfn() under mmu_lock to serialize
> + * against ongoing attribute updates.
> + */
> + return kvm_gmem_is_private_mem(inode, kvm_gmem_get_index(slot, gfn));
> +}
> +EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_is_private);
> +
> static struct file_operations kvm_gmem_fops = {
> .mmap = kvm_gmem_mmap,
> .open = generic_file_open,
>
> --
> 2.55.0.rc0.738.g0c8ab3ebcc-goog
>
>
^ permalink raw reply
* Re: [PATCH v8 08/46] KVM: Provide generic interface for checking memory private/shared status
From: Fuad Tabba @ 2026-06-19 8:21 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <CA+EHjTw6x-mxDnJjnhE-6SV73tMrb0paKDTtOC2j6zJ1fXZDLA@mail.gmail.com>
On Fri, 19 Jun 2026 at 09:19, Fuad Tabba <tabba@google.com> wrote:
>
> On Fri, 19 Jun 2026 at 01:31, Ackerley Tng via B4 Relay
> <devnull+ackerleytng.google.com@kernel.org> wrote:
> >
> > From: Sean Christopherson <seanjc@google.com>
> >
> > Introduce a generic kvm_mem_is_private() interface using a static call to
> > determine if a GFN is private. This allows the implementation for checking
> > a GFN's private/shared status to be set at runtime.
> >
> > In preparation for choosing implementations between a guest_memfd lookup
> > and the existing VM attribute lookup, rename the existing
> > VM-attribute-based check to kvm_vm_mem_is_private to emphasize that it
> > looks up VM attributes.
> >
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
>
> (SoB fix plz)
>
> Reviewed-by: Fuad Tabba <tabba@google.com>
>
> Cheers,
> /fuad
> > ---
> > include/linux/kvm_host.h | 12 +++++++++++-
> > virt/kvm/kvm_main.c | 15 +++++++++++++++
> > 2 files changed, 26 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index eb26d4ea8945a..3915da2a61778 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -2546,7 +2546,7 @@ bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
> > bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
> > struct kvm_gfn_range *range);
> >
> > -static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> > +static inline bool kvm_vm_mem_is_private(struct kvm *kvm, gfn_t gfn)
Should have read the Sashiko review first, but where is this used?
It's not used at all in this series...
/fuad
> > {
> > return kvm_get_vm_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
> > }
> > @@ -2557,6 +2557,16 @@ static inline bool kvm_mem_range_is_private(struct kvm *kvm, gfn_t start,
> > KVM_MEMORY_ATTRIBUTE_PRIVATE,
> > KVM_MEMORY_ATTRIBUTE_PRIVATE);
> > }
> > +#endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
> > +
> > +#ifdef kvm_arch_has_private_mem
> > +typedef bool (kvm_mem_is_private_t)(struct kvm *kvm, gfn_t gfn);
> > +DECLARE_STATIC_CALL(__kvm_mem_is_private, kvm_mem_is_private_t);
> > +
> > +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> > +{
> > + return static_call(__kvm_mem_is_private)(kvm, gfn);
> > +}
> > #else
> > static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> > {
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 6669f1477013c..8b238e461b854 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -2627,6 +2627,20 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > }
> > #endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
> >
> > +#ifdef kvm_arch_has_private_mem
> > +DEFINE_STATIC_CALL_RET0(__kvm_mem_is_private, kvm_mem_is_private_t);
> > +EXPORT_STATIC_CALL_GPL(__kvm_mem_is_private);
> > +
> > +static void kvm_init_memory_attributes(void)
> > +{
> > +#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> > + static_call_update(__kvm_mem_is_private, kvm_vm_mem_is_private);
> > +#endif
> > +}
> > +#else
> > +static void kvm_init_memory_attributes(void) { }
> > +#endif
> > +
> > struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
> > {
> > return __gfn_to_memslot(kvm_memslots(kvm), gfn);
> > @@ -6528,6 +6542,7 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module)
> > kvm_preempt_ops.sched_in = kvm_sched_in;
> > kvm_preempt_ops.sched_out = kvm_sched_out;
> >
> > + kvm_init_memory_attributes();
> > kvm_init_debug();
> >
> > r = kvm_vfio_ops_init();
> >
> > --
> > 2.55.0.rc0.738.g0c8ab3ebcc-goog
> >
> >
^ permalink raw reply
* Re: [PATCH v8 08/46] KVM: Provide generic interface for checking memory private/shared status
From: Fuad Tabba @ 2026-06-19 8:19 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-8-9d2959357853@google.com>
On Fri, 19 Jun 2026 at 01:31, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Sean Christopherson <seanjc@google.com>
>
> Introduce a generic kvm_mem_is_private() interface using a static call to
> determine if a GFN is private. This allows the implementation for checking
> a GFN's private/shared status to be set at runtime.
>
> In preparation for choosing implementations between a guest_memfd lookup
> and the existing VM attribute lookup, rename the existing
> VM-attribute-based check to kvm_vm_mem_is_private to emphasize that it
> looks up VM attributes.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
(SoB fix plz)
Reviewed-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
> ---
> include/linux/kvm_host.h | 12 +++++++++++-
> virt/kvm/kvm_main.c | 15 +++++++++++++++
> 2 files changed, 26 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index eb26d4ea8945a..3915da2a61778 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2546,7 +2546,7 @@ bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
> bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
> struct kvm_gfn_range *range);
>
> -static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> +static inline bool kvm_vm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> {
> return kvm_get_vm_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
> }
> @@ -2557,6 +2557,16 @@ static inline bool kvm_mem_range_is_private(struct kvm *kvm, gfn_t start,
> KVM_MEMORY_ATTRIBUTE_PRIVATE,
> KVM_MEMORY_ATTRIBUTE_PRIVATE);
> }
> +#endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
> +
> +#ifdef kvm_arch_has_private_mem
> +typedef bool (kvm_mem_is_private_t)(struct kvm *kvm, gfn_t gfn);
> +DECLARE_STATIC_CALL(__kvm_mem_is_private, kvm_mem_is_private_t);
> +
> +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> +{
> + return static_call(__kvm_mem_is_private)(kvm, gfn);
> +}
> #else
> static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> {
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 6669f1477013c..8b238e461b854 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2627,6 +2627,20 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> }
> #endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
>
> +#ifdef kvm_arch_has_private_mem
> +DEFINE_STATIC_CALL_RET0(__kvm_mem_is_private, kvm_mem_is_private_t);
> +EXPORT_STATIC_CALL_GPL(__kvm_mem_is_private);
> +
> +static void kvm_init_memory_attributes(void)
> +{
> +#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> + static_call_update(__kvm_mem_is_private, kvm_vm_mem_is_private);
> +#endif
> +}
> +#else
> +static void kvm_init_memory_attributes(void) { }
> +#endif
> +
> struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
> {
> return __gfn_to_memslot(kvm_memslots(kvm), gfn);
> @@ -6528,6 +6542,7 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module)
> kvm_preempt_ops.sched_in = kvm_sched_in;
> kvm_preempt_ops.sched_out = kvm_sched_out;
>
> + kvm_init_memory_attributes();
> kvm_init_debug();
>
> r = kvm_vfio_ops_init();
>
> --
> 2.55.0.rc0.738.g0c8ab3ebcc-goog
>
>
^ permalink raw reply
* Re: [PATCH v8 07/46] KVM: Rename memory attribute APIs to prepare for in-place gmem conversion
From: Fuad Tabba @ 2026-06-19 8:16 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-7-9d2959357853@google.com>
On Fri, 19 Jun 2026 at 01:31, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Sean Christopherson <seanjc@google.com>
>
> Rename memory attribute APIs to add a "vm_" in the name in anticipation of
> moving PRIVATE tracking into guest_memfd, to allow in-place conversion
> between SHARED and PRIVATE. At that point, there will effectively be two
> (potential) sources of memory attributes: the VM and guest_memfd.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Missing SoB (other patches as well, I won't mention it again). But for
this (and other patches I review with a missing SoB fixed):
Reviewed-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
> ---
> arch/x86/kvm/mmu/mmu.c | 6 +++---
> include/linux/kvm_host.h | 15 +++++++++++----
> virt/kvm/guest_memfd.c | 6 +++---
> virt/kvm/kvm_main.c | 16 ++++++++--------
> 4 files changed, 25 insertions(+), 18 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e0005a21b6e22..cbc50aef801fb 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -8087,11 +8087,11 @@ static bool hugepage_has_attrs(struct kvm *kvm, struct kvm_memory_slot *slot,
> const unsigned long end = start + KVM_PAGES_PER_HPAGE(level);
>
> if (level == PG_LEVEL_2M)
> - return kvm_range_has_memory_attributes(kvm, start, end, ~0, attrs);
> + return kvm_range_has_vm_memory_attributes(kvm, start, end, ~0, attrs);
>
> for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1)) {
> if (hugepage_test_mixed(slot, gfn, level - 1) ||
> - attrs != kvm_get_memory_attributes(kvm, gfn))
> + attrs != kvm_get_vm_memory_attributes(kvm, gfn))
> return false;
> }
> return true;
> @@ -8191,7 +8191,7 @@ void kvm_mmu_init_memslot_memory_attributes(struct kvm *kvm,
> * be manually checked as the attributes may already be mixed.
> */
> for (gfn = start; gfn < end; gfn += nr_pages) {
> - unsigned long attrs = kvm_get_memory_attributes(kvm, gfn);
> + unsigned long attrs = kvm_get_vm_memory_attributes(kvm, gfn);
>
> if (hugepage_has_attrs(kvm, slot, gfn, level, attrs))
> hugepage_clear_mixed(slot, gfn, level);
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index d370e834d619e..eb26d4ea8945a 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2534,13 +2534,13 @@ static inline bool kvm_memslot_is_gmem_only(const struct kvm_memory_slot *slot)
> }
>
> #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> -static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
> +static inline unsigned long kvm_get_vm_memory_attributes(struct kvm *kvm, gfn_t gfn)
> {
> return xa_to_value(xa_load(&kvm->mem_attr_array, gfn));
> }
>
> -bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> - unsigned long mask, unsigned long attrs);
> +bool kvm_range_has_vm_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> + unsigned long mask, unsigned long attrs);
> bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
> struct kvm_gfn_range *range);
> bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
> @@ -2548,7 +2548,14 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
>
> static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> {
> - return kvm_get_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
> + return kvm_get_vm_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
> +}
> +static inline bool kvm_mem_range_is_private(struct kvm *kvm, gfn_t start,
> + gfn_t end)
> +{
> + return kvm_range_has_vm_memory_attributes(kvm, start, end,
> + KVM_MEMORY_ATTRIBUTE_PRIVATE,
> + KVM_MEMORY_ATTRIBUTE_PRIVATE);
> }
> #else
> static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index b4c24fdf159f6..8101f64e0366f 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -915,9 +915,9 @@ static long __kvm_gmem_populate(struct kvm *kvm, struct kvm_memory_slot *slot,
>
> folio_unlock(folio);
>
> - if (!kvm_range_has_memory_attributes(kvm, gfn, gfn + 1,
> - KVM_MEMORY_ATTRIBUTE_PRIVATE,
> - KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
> + if (!kvm_range_has_vm_memory_attributes(kvm, gfn, gfn + 1,
> + KVM_MEMORY_ATTRIBUTE_PRIVATE,
> + KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
> ret = -EINVAL;
> goto out_put_folio;
> }
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 7b989b659cf82..6669f1477013c 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2419,7 +2419,7 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
> #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
>
> #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> -static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> +static u64 kvm_supported_vm_mem_attributes(struct kvm *kvm)
> {
> #ifdef kvm_arch_has_private_mem
> if (!kvm || kvm_arch_has_private_mem(kvm))
> @@ -2433,19 +2433,19 @@ static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> * Returns true if _all_ gfns in the range [@start, @end) have attributes
> * such that the bits in @mask match @attrs.
> */
> -bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> - unsigned long mask, unsigned long attrs)
> +bool kvm_range_has_vm_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> + unsigned long mask, unsigned long attrs)
> {
> XA_STATE(xas, &kvm->mem_attr_array, start);
> unsigned long index;
> void *entry;
>
> - mask &= kvm_supported_mem_attributes(kvm);
> + mask &= kvm_supported_vm_mem_attributes(kvm);
> if (attrs & ~mask)
> return false;
>
> if (end == start + 1)
> - return (kvm_get_memory_attributes(kvm, start) & mask) == attrs;
> + return (kvm_get_vm_memory_attributes(kvm, start) & mask) == attrs;
>
> guard(rcu)();
> if (!attrs)
> @@ -2567,7 +2567,7 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> mutex_lock(&kvm->slots_lock);
>
> /* Nothing to do if the entire range has the desired attributes. */
> - if (kvm_range_has_memory_attributes(kvm, start, end, ~0, attributes))
> + if (kvm_range_has_vm_memory_attributes(kvm, start, end, ~0, attributes))
> goto out_unlock;
>
> /*
> @@ -2606,7 +2606,7 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> /* flags is currently not used. */
> if (attrs->flags)
> return -EINVAL;
> - if (attrs->attributes & ~kvm_supported_mem_attributes(kvm))
> + if (attrs->attributes & ~kvm_supported_vm_mem_attributes(kvm))
> return -EINVAL;
> if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
> return -EINVAL;
> @@ -4926,7 +4926,7 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> return 1;
> #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> case KVM_CAP_MEMORY_ATTRIBUTES:
> - return kvm_supported_mem_attributes(kvm);
> + return kvm_supported_vm_mem_attributes(kvm);
> #endif
> #ifdef CONFIG_KVM_GUEST_MEMFD
> case KVM_CAP_GUEST_MEMFD:
>
> --
> 2.55.0.rc0.738.g0c8ab3ebcc-goog
>
>
^ permalink raw reply
* Re: [PATCH v8 05/46] KVM: Make CONFIG_KVM_VM_MEMORY_ATTRIBUTES selectable
From: Fuad Tabba @ 2026-06-19 8:12 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-5-9d2959357853@google.com>
On Fri, 19 Jun 2026 at 01:31, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> Make CONFIG_KVM_VM_MEMORY_ATTRIBUTES selectable, only for (CoCo) VM types
> that might use vm_memory_attributes.
>
> Also document CONFIG_KVM_VM_MEMORY_ATTRIBUTES to specifically be about the
> private/shared attribute.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
You're missing a SoB, but with that fixed:
Reviewed-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
> ---
> arch/x86/kvm/Kconfig | 9 +++++----
> 1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 24f96396cfa1c..c28393dc664eb 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -81,13 +81,16 @@ config KVM_WERROR
> If in doubt, say "N".
>
> config KVM_VM_MEMORY_ATTRIBUTES
> - bool
> + depends on KVM_SW_PROTECTED_VM || KVM_INTEL_TDX || KVM_AMD_SEV
> + bool "Enable per-VM PRIVATE vs. SHARED attributes (for CoCo VMs)"
> + help
> + Enable support for tracking PRIVATE vs. SHARED memory using per-VM
> + memory attributes.
>
> config KVM_SW_PROTECTED_VM
> bool "Enable support for KVM software-protected VMs"
> depends on EXPERT
> depends on KVM_X86 && X86_64
> - select KVM_VM_MEMORY_ATTRIBUTES
> help
> Enable support for KVM software-protected VMs. Currently, software-
> protected VMs are purely a development and testing vehicle for
> @@ -138,7 +141,6 @@ config KVM_INTEL_TDX
> bool "Intel Trust Domain Extensions (TDX) support"
> default y
> depends on INTEL_TDX_HOST
> - select KVM_VM_MEMORY_ATTRIBUTES
> select HAVE_KVM_ARCH_GMEM_POPULATE
> help
> Provides support for launching Intel Trust Domain Extensions (TDX)
> @@ -162,7 +164,6 @@ config KVM_AMD_SEV
> depends on KVM_AMD && X86_64
> depends on CRYPTO_DEV_SP_PSP && !(KVM_AMD=y && CRYPTO_DEV_CCP_DD=m)
> select ARCH_HAS_CC_PLATFORM
> - select KVM_VM_MEMORY_ATTRIBUTES
> select HAVE_KVM_ARCH_GMEM_PREPARE
> select HAVE_KVM_ARCH_GMEM_INVALIDATE
> select HAVE_KVM_ARCH_GMEM_POPULATE
>
> --
> 2.55.0.rc0.738.g0c8ab3ebcc-goog
>
>
^ permalink raw reply
* Re: [PATCH v8 04/46] KVM: Decouple kvm_has_arch_private_mem from CONFIG_KVM_VM_MEMORY_ATTRIBUTES
From: Fuad Tabba @ 2026-06-19 8:10 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-4-9d2959357853@google.com>
On Fri, 19 Jun 2026 at 01:31, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Sean Christopherson <seanjc@google.com>
>
> When memory attributes become trackable in guest_memfd, the concept of
> having private memory is no longer dependent on
> CONFIG_KVM_VM_MEMORY_ATTRIBUTES.
>
> With this, on x86, kvm_arch_has_private_mem() is defined if some CoCo
> platform support (or the testing CONFIG_KVM_SW_PROTECTED_VM) is compiled
> in.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Co-developed-by: Ackerley Tng <ackerleytng@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
> ---
> arch/x86/include/asm/kvm_host.h | 4 +++-
> include/linux/kvm_host.h | 2 +-
> 2 files changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 8e8eb8a5e8a6b..1bde67cf6eb0e 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2394,7 +2394,9 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
> int tdp_max_root_level, int tdp_huge_page_level);
>
>
> -#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> +#if defined(CONFIG_KVM_SW_PROTECTED_VM) || \
> + defined(CONFIG_KVM_INTEL_TDX) || \
> + defined(CONFIG_KVM_AMD_SEV)
> #define kvm_arch_has_private_mem(kvm) ((kvm)->arch.has_private_mem)
> #endif
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 201d0f2143976..d370e834d619e 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -722,7 +722,7 @@ static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
> }
> #endif
>
> -#ifndef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> +#ifndef kvm_arch_has_private_mem
> static inline bool kvm_arch_has_private_mem(struct kvm *kvm)
> {
> return false;
>
> --
> 2.55.0.rc0.738.g0c8ab3ebcc-goog
>
>
^ permalink raw reply
* Re: [PATCH v4 3/5] rpmsg: virtio_rpmsg_bus: get buffer size from config space
From: Arnaud POULIQUEN @ 2026-06-19 7:45 UTC (permalink / raw)
To: tanmay.shah, andersson, mathieu.poirier, corbet, skhan
Cc: linux-remoteproc, linux-doc, linux-kernel
In-Reply-To: <1251e3e1-80fe-4146-a1f8-5eb251a323be@amd.com>
On 6/18/26 18:31, Shah, Tanmay wrote:
>
>
> On 6/18/2026 3:32 AM, Arnaud POULIQUEN wrote:
>>
>>
>> On 6/17/26 19:41, Shah, Tanmay wrote:
>>>
>>>
>>> On 6/17/2026 4:15 AM, Arnaud POULIQUEN wrote:
>>>> Hi Tanmay,
>>>>
>>>> On 6/15/26 22:20, Tanmay Shah wrote:
>>>>> 512 bytes isn't always suitable for all case, let firmware
>>>>> maker decide the best value from resource table.
>>>>> enable by VIRTIO_RPMSG_F_BUFSZ feature bit.
>>>>>
>>>>> Signed-off-by: Tanmay Shah <tanmay.shah@amd.com>
>>>>> ---
>>>>>
>>>>> Changes in v4: squash to virtio rpmsg config patch
>>>>> - Introduce new patch to modify rpmsg.rst documentation
>>>>> - check version is always 1.
>>>>> - check size field is same as size of struct virtio_rpmsg_config
>>>>> - introduce alignment field
>>>>> - check alignment field is power of 2
>>>>> - check tx and rx buf size is aligned with alignment passed in the
>>>>> structure
>>>>>
>>>>> Changes in v3:
>>>>> - change version field from u16 to u8
>>>>> - introduce size field in the rpmsg_virtio_config structure
>>>>> - check version field is set to any non-zero value.
>>>>> - check size field is not 0.
>>>>> - Remove field for private config, as not needed for now.
>>>>> - add documentation of rpmsg_virtio_config structure
>>>>>
>>>>> drivers/rpmsg/virtio_rpmsg_bus.c | 129 +++++++++++++++++++++++
>>>>> +-----
>>>>> include/linux/rpmsg/virtio_rpmsg.h | 50 +++++++++++
>>>>> 2 files changed, 160 insertions(+), 19 deletions(-)
>>>>> create mode 100644 include/linux/rpmsg/virtio_rpmsg.h
>>>>>
>>>>> diff --git a/drivers/rpmsg/virtio_rpmsg_bus.c b/drivers/rpmsg/
>>>>> virtio_rpmsg_bus.c
>>>>> index 99df1ae07055..a59925f870a4 100644
>>>>> --- a/drivers/rpmsg/virtio_rpmsg_bus.c
>>>>> +++ b/drivers/rpmsg/virtio_rpmsg_bus.c
>>>>> @@ -15,11 +15,13 @@
>>>>> #include <linux/idr.h>
>>>>> #include <linux/jiffies.h>
>>>>> #include <linux/kernel.h>
>>>>> +#include <linux/log2.h>
>>>>> #include <linux/module.h>
>>>>> #include <linux/mutex.h>
>>>>> #include <linux/rpmsg.h>
>>>>> #include <linux/rpmsg/byteorder.h>
>>>>> #include <linux/rpmsg/ns.h>
>>>>> +#include <linux/rpmsg/virtio_rpmsg.h>
>>>>> #include <linux/scatterlist.h>
>>>>> #include <linux/slab.h>
>>>>> #include <linux/sched.h>
>>>>> @@ -39,7 +41,8 @@
>>>>> * @tx_bufs: kernel address of tx buffers
>>>>> * @num_rx_buf: total number of rx buffers
>>>>> * @num_tx_buf: total number of tx buffers
>>>>> - * @buf_size: size of one rx or tx buffer
>>>>> + * @rx_buf_size: size of one rx buffer
>>>>> + * @tx_buf_size: size of one tx buffer
>>>>> * @last_tx_buf: index of last tx buffer used
>>>>> * @bufs_dma: dma base addr of the buffers
>>>>> * @tx_lock: protects svq and tx_bufs, to allow concurrent
>>>>> senders.
>>>>> @@ -59,7 +62,8 @@ struct virtproc_info {
>>>>> void *rx_bufs, *tx_bufs;
>>>>> unsigned int num_rx_buf;
>>>>> unsigned int num_tx_buf;
>>>>> - unsigned int buf_size;
>>>>> + unsigned int rx_buf_size;
>>>>> + unsigned int tx_buf_size;
>>>>> int last_tx_buf;
>>>>> dma_addr_t bufs_dma;
>>>>> struct mutex tx_lock;
>>>>> @@ -68,9 +72,6 @@ struct virtproc_info {
>>>>> wait_queue_head_t sendq;
>>>>> };
>>>>> -/* The feature bitmap for virtio rpmsg */
>>>>> -#define VIRTIO_RPMSG_F_NS 0 /* RP supports name service
>>>>> notifications */
>>>>> -
>>>>> /**
>>>>> * struct rpmsg_hdr - common header for all rpmsg messages
>>>>> * @src: source address
>>>>> @@ -128,7 +129,7 @@ struct virtio_rpmsg_channel {
>>>>> * processor.
>>>>> */
>>>>> #define MAX_RPMSG_NUM_BUFS (256)
>>>>> -#define MAX_RPMSG_BUF_SIZE (512)
>>>>> +#define DEFAULT_RPMSG_BUF_SIZE (512)
>>>>> /*
>>>>> * Local addresses are dynamically allocated on-demand.
>>>>> @@ -444,7 +445,7 @@ static void *get_a_tx_buf(struct virtproc_info
>>>>> *vrp)
>>>>> /* either pick the next unused tx buffer */
>>>>> if (vrp->last_tx_buf < vrp->num_tx_buf)
>>>>> - ret = vrp->tx_bufs + vrp->buf_size * vrp->last_tx_buf++;
>>>>> + ret = vrp->tx_bufs + vrp->tx_buf_size * vrp->last_tx_buf++;
>>>>> /* or recycle a used one */
>>>>> else
>>>>> ret = virtqueue_get_buf(vrp->svq, &len);
>>>>> @@ -514,7 +515,7 @@ static int rpmsg_send_offchannel_raw(struct
>>>>> rpmsg_device *rpdev,
>>>>> * messaging), or to improve the buffer allocator, to support
>>>>> * variable-length buffer sizes.
>>>>> */
>>>>> - if (len > vrp->buf_size - sizeof(struct rpmsg_hdr)) {
>>>>> + if (len > vrp->tx_buf_size - sizeof(struct rpmsg_hdr)) {
>>>>> dev_err(dev, "message is too big (%d)\n", len);
>>>>> return -EMSGSIZE;
>>>>> }
>>>>> @@ -647,7 +648,7 @@ static ssize_t virtio_rpmsg_get_mtu(struct
>>>>> rpmsg_endpoint *ept)
>>>>> struct rpmsg_device *rpdev = ept->rpdev;
>>>>> struct virtio_rpmsg_channel *vch =
>>>>> to_virtio_rpmsg_channel(rpdev);
>>>>> - return vch->vrp->buf_size - sizeof(struct rpmsg_hdr);
>>>>> + return vch->vrp->tx_buf_size - sizeof(struct rpmsg_hdr);
>>>>> }
>>>>> static int rpmsg_recv_single(struct virtproc_info *vrp, struct
>>>>> device *dev,
>>>>> @@ -673,7 +674,7 @@ static int rpmsg_recv_single(struct virtproc_info
>>>>> *vrp, struct device *dev,
>>>>> * We currently use fixed-sized buffers, so trivially sanitize
>>>>> * the reported payload length.
>>>>> */
>>>>> - if (len > vrp->buf_size ||
>>>>> + if (len > vrp->rx_buf_size ||
>>>>> msg_len > (len - sizeof(struct rpmsg_hdr))) {
>>>>> dev_warn(dev, "inbound msg too big: (%d, %d)\n", len,
>>>>> msg_len);
>>>>> return -EINVAL;
>>>>> @@ -706,7 +707,7 @@ static int rpmsg_recv_single(struct virtproc_info
>>>>> *vrp, struct device *dev,
>>>>> dev_warn_ratelimited(dev, "msg received with no
>>>>> recipient\n");
>>>>> /* publish the real size of the buffer */
>>>>> - rpmsg_sg_init(&sg, msg, vrp->buf_size);
>>>>> + rpmsg_sg_init(&sg, msg, vrp->rx_buf_size);
>>>>> /* add the buffer back to the remote processor's virtqueue */
>>>>> err = virtqueue_add_inbuf(vrp->rvq, &sg, 1, msg, GFP_KERNEL);
>>>>> @@ -820,10 +821,13 @@ static int rpmsg_probe(struct virtio_device
>>>>> *vdev)
>>>>> struct virtproc_info *vrp;
>>>>> struct virtio_rpmsg_channel *vch = NULL;
>>>>> struct rpmsg_device *rpdev_ns, *rpdev_ctrl;
>>>>> + u16 rpmsg_buf_align = 0;
>>>>> void *bufs_va;
>>>>> int err = 0, i;
>>>>> size_t total_buf_space;
>>>>> bool notify;
>>>>> + u8 version;
>>>>> + u16 size;
>>>>> vrp = kzalloc_obj(*vrp);
>>>>> if (!vrp)
>>>>> @@ -855,9 +859,90 @@ static int rpmsg_probe(struct virtio_device *vdev)
>>>>> else
>>>>> vrp->num_tx_buf = MAX_RPMSG_NUM_BUFS;
>>>>> - vrp->buf_size = MAX_RPMSG_BUF_SIZE;
>>>>> + /*
>>>>> + * If VIRTIO_RPMSG_F_BUFSZ feature is supported, then configure
>>>>> buf
>>>>> + * size from virtio device config space from the resource table.
>>>>> + * If the feature is not supported, then assign default buf size.
>>>>> + */
>>>>> + if (virtio_has_feature(vdev, VIRTIO_RPMSG_F_BUFSZ)) {
>>>>> + virtio_cread(vdev, struct virtio_rpmsg_config,
>>>>> + version, &version);
>>>>> +
>>>>> + /* for now we support only v1 */
>>>>> + if (version != RPMSG_VDEV_CONFIG_V1) {
>>>>> + dev_err(&vdev->dev,
>>>>> + "unsupported vdev config version %u\n", version);
>>>>> + err = -EINVAL;
>>>>> + goto vqs_del;
>>>>> + }
>>>>> +
>>>>> + /* size of the config space must match */
>>>>> + virtio_cread(vdev, struct virtio_rpmsg_config,
>>>>> + size, &size);
>>>>> + if (size != sizeof(struct virtio_rpmsg_config)) {
>>>>> + dev_err(&vdev->dev, "invalid size of vdev config %u\n",
>>>>> + size);
>>>>> + err = -EINVAL;
>>>>> + goto vqs_del;
>>>>> + }
>>>>> - total_buf_space = (vrp->num_rx_buf + vrp->num_tx_buf) * vrp-
>>>>>> buf_size;
>>>>> + /*
>>>>> + * Optional alignment applied to each buffer size and to
>>>>> the TX
>>>>> + * buffer base address (e.g. to align buffers on a cache
>>>>> line).
>>>>> + * It must be a power of two; zero means no extra alignment.
>>>>> + */
>>>>> + virtio_cread(vdev, struct virtio_rpmsg_config,
>>>>> + rpmsg_buf_align, &rpmsg_buf_align);
>>>>> + if (rpmsg_buf_align && !is_power_of_2(rpmsg_buf_align)) {
>>>>> + dev_err(&vdev->dev,
>>>>> + "bad vdev config: rpmsg_buf_align %u is not a power
>>>>> of two\n",
>>>>> + rpmsg_buf_align);
>>>>> + err = -EINVAL;
>>>>> + goto vqs_del;
>>>>> + }
>>>>> +
>>>>> + /* note: tx and rx are defined from remote view */
>>>>> + virtio_cread(vdev, struct virtio_rpmsg_config,
>>>>> + txbuf_size, &vrp->rx_buf_size);
>>>>> + virtio_cread(vdev, struct virtio_rpmsg_config,
>>>>> + rxbuf_size, &vrp->tx_buf_size);
>>>>> +
>>>>> + /* The buffers must hold at least the rpmsg header */
>>>>> + if (vrp->rx_buf_size < sizeof(struct rpmsg_hdr) ||
>>>>> + vrp->tx_buf_size < sizeof(struct rpmsg_hdr)) {
>>>>> + dev_err(&vdev->dev,
>>>>> + "bad vdev config: rx buf sz = %u, tx buf sz = %u\n",
>>>>> + vrp->rx_buf_size, vrp->tx_buf_size);
>>>>> + err = -EINVAL;
>>>>> + goto vqs_del;
>>>>> + }
>>>>> +
>>>>> + /*
>>>>> + * The buffer size must be aligned to the provided
>>>>> alignment for
>>>>> + * so that the start address of tx bufs can be aligned.
>>>>> + */
>>>>
>>>> 'tx' to remove as it also concerns Rx buffers
>>>>
>>>
>>> Ack.
>>>
>>>>
>>>> What about removing this check to manage alignment during buffer
>>>> allocation?
>>>>
>>>> For example, if the alignment is on a 64-bit address and the tx_buffer
>>>> and rx_buffer sizes are 40 bytes, 48 bytes can be allocated in memory
>>>> for each buffer, and the virtio descriptor can be filled with aligned
>>>> addresses.
>>>>
>>>> In other words, the rpmsg_buf_align field contains the alignment
>>>> constraint from the remote processor. If the Linux kernel wants to
>>>> impose another alignment constraint, it must test or update
>>>> rpmsg_buf_align, but it must not impose alignment on the buffer size.
>>>>
>>>>
>>>
>>> This part I don't understand. `rpmsg_buf_align` is alignment for only
>>> single buffer size. The linux kernel is checking that single rx buf size
>>> and tx buf size is aligned with `rpmsg_buf_align` as firmware has
>>> claimed.
>>>
>>> For reference the openamp-system-reference PR:
>>> https://github.com/OpenAMP/openamp-system-reference/pull/106/changes
>>>
>>> .vdev_config = {
>>> .version = 1,
>>> .reserved = 0,
>>> .size = (uint16_t)(sizeof(struct rpmsg_virtio_config) -
>>> sizeof(bool)),
>>> .alignment = RPMSG_BUF_ALIGN,
>>> .reserved1 = 0,
>>> /* Tx for host */
>>> .h2r_buf_size = metal_align_up(4096, RPMSG_BUF_ALIGN),
>>> /* Rx for host */
>>> .r2h_buf_size = metal_align_up(4096, RPMSG_BUF_ALIGN),
>>> },
>>>
>>> IIUC, The linux kernel is not really supposed to modify
>>> `rpmsg_buf_align`. It only uses it to check that firmware has assigned
>>> correct size of single rx and tx buffer.
>>>
>>>
>>> When the linux kernel uses dma_alloc_coherent() API it aligns total
>>> buffer size with page size. That is different than single tx buf size
>>> and single rx buf size. The total buf size alignment to page size is
>>> irrelevant to `rpmsg_buf_align` field.
>>>
>>> Please let me know if I am missing something or didn't understand your
>>> comment. I prefer that `rpmsg_buf_align` should be only modified by the
>>> firmware and not the linux kernel.
>>
>>
>> Sorry it was unclear, let try to reexplain my suggestion:
>>
>> Two alignment constraints can apply:
>> - The remote processor can require an alignment through
>> vdev_config::alignment.
>> - The main processor, which runs Linux or another operating system (OS),
>> can require a different alignment, for example, for cache alignment.
>> In current Linux implementation no constraint in Linux.
>> nevertheless I would be in favor of taking into account such future
>> constraint without imposing constraint on the buffer sizes.
>
> Is this ever going to be ture? Is it ever possible that Linux and remote
> has different cache alignment? IIUC, both will be using same cache and
> so same alignment will be applicable. That is why only signle alignment
> is required.
Some remote processors, for example, some Arm Cortex-M33, do not
integrate cache. Even if cache exists, cache can be enabled on one
processor, but not on the other.
>
>> Based on that in short term the local 'rpmsg_buf_align' would still
>> computed
>> only from vdev_config::alignment (not update of vdev_config::alignment).
>>
>> virtio_cread(vdev, struct virtio_rpmsg_config,
>> rpmsg_buf_align, &rpmsg_buf_align);
>>
>> Then you could use use ALIGN() helper:
>>
>> unsigned int rx_buf_align_size = ALIGN(vrp->rx_buf_size,
>> rpmsg_buf_align);
>> unsigned int tx_buf_align_size = ALIGN(vrp->tx_buf_size,
>> rpmsg_buf_align);
>>
>
> This is where I have different opinion. Instead of Linux using ALIGN()
> macro, can we expect that firmware must assign the aligned buffer size
> with vdev_config::rpmsg_buf_align? And so Linux will fail if the buffer
> size is not aligned already from the firmware side. That is why I had
> introduced checks instead of doing alignment by linux.
>
>> total_buf_space = (vrp->num_rx_buf * rx_buf_align_size) +
>> (vrp->num_tx_buf * tx_buf_align_size);
>>
>> vrp->tx_bufs = bufs_va + vrp->num_rx_buf * rx_buf_align_size;
>>
>> Apply the same rule to cpu_addr in the vring descriptor:
>>
>> void *cpu_addr = vrp->rx_bufs + i * rx_buf_align_size;
>>
>> rpmsg_sg_init(&sg, cpu_addr, vrp->rx_buf_size);
>>
>> With this approach, the buffer addresses remain aligned
>> independently of vdev_config::Rxbuf_size and vdev_config::txbuf_size.
>> Don't hesitate if it is still not clear!
>
> How they remain aligned independent of tx/rx_buf_size? tx_bufs address
> is still calculated based on rx_buf_align_size, so its alignment still
> depends on rx_buf_align_size which is derived using
> vdev_config::rpmsg_buf_align.>
> I think we are trying to achive the same thing, but implementation is
> differnt. We just need to decide where the alignment should be done?
>
> Either on the linux side? Or in the firmware resource table?
>
> I prefer that the firmware should already provide aligned buffer size,
> and Linux should only check it. If alignment is not done, then simply
> fail with error. That way, firmware also knows the correct size of the
> buffer. If Linux does the alignment, then the firmware is not aware of
> the correct size that is used by the linux.
>
> I am open to move the alignment operation to the linux side with the
> reasonable justification.
That remains a suggestion. My main concern with the implementation is
that RPMsg size should depend only on the max playlod size needed, not
also on the memory alignment.
If this constraint is kept, it must be imposed on all other non-Linux
solutions. Otherwise, the remote implementation depends on the main
processor implementation.
From my POV, It would be preferable not to impose such constraint when
possible.
Thanks,
Arnaud
>
> Thank You,
> Tanmay
>
>>>
>>>
>>>>> + if (rpmsg_buf_align &&
>>>>> + (!IS_ALIGNED(vrp->rx_buf_size, rpmsg_buf_align) ||
>>>>> + !IS_ALIGNED(vrp->tx_buf_size, rpmsg_buf_align))) {
>>>>> + dev_err(&vdev->dev,
>>>>> + "bad vdev config: buf sizes (rx %u, tx %u) not
>>>>> aligned to %u\n",
>>>>> + vrp->rx_buf_size, vrp->tx_buf_size,
>>>>> + rpmsg_buf_align);
>>>>> + err = -EINVAL;
>>>>> + goto vqs_del;
>>>>> + }
>>>>> +
>>>>> + dev_dbg(&vdev->dev,
>>>>> + "vdev config: ver=%u, align=0x%x, rx sz = 0x%x, tx sz =
>>>>> 0x%x\n",
>>>>> + version, rpmsg_buf_align, vrp->rx_buf_size,
>>>>> + vrp->tx_buf_size);
>>>>> + } else {
>>>>> + vrp->rx_buf_size = DEFAULT_RPMSG_BUF_SIZE;
>>>>> + vrp->tx_buf_size = DEFAULT_RPMSG_BUF_SIZE;
>>>>> + }
>>>>> +
>>>>> + total_buf_space = (vrp->num_rx_buf * vrp->rx_buf_size) +
>>>>> + (vrp->num_tx_buf * vrp->tx_buf_size);
>>>>> /* allocate coherent memory for the buffers */
>>>>> bufs_va = dma_alloc_coherent(vdev->dev.parent,
>>>>> @@ -874,15 +959,20 @@ static int rpmsg_probe(struct virtio_device
>>>>> *vdev)
>>>>> /* first part of the buffers is dedicated for RX */
>>>>> vrp->rx_bufs = bufs_va;
>>>>> - /* and second part is dedicated for TX */
>>>>> - vrp->tx_bufs = bufs_va + vrp->num_rx_buf * vrp->buf_size;
>>>>> + /*
>>>>> + * Here buf_va is aligned to a page. Also rx buf size is aligned
>>>>> with
>>>>> + * cache line alignment provided by the firmware, so tx buf's
>>>>> start
>>>>> + * address is guranteed to be aligned with the alignment
>>>>> provided by
>>>>> + * the firmware.
>>>>> + */
>>>>> + vrp->tx_bufs = bufs_va + (vrp->num_rx_buf * vrp->rx_buf_size);
>>>>> /* set up the receive buffers */
>>>>> for (i = 0; i < vrp->num_rx_buf; i++) {
>>>>> struct scatterlist sg;
>>>>> - void *cpu_addr = vrp->rx_bufs + i * vrp->buf_size;
>>>>> + void *cpu_addr = vrp->rx_bufs + i * vrp->rx_buf_size;
>>>>> - rpmsg_sg_init(&sg, cpu_addr, vrp->buf_size);
>>>>> + rpmsg_sg_init(&sg, cpu_addr, vrp->rx_buf_size);
>>>>> err = virtqueue_add_inbuf(vrp->rvq, &sg, 1, cpu_addr,
>>>>> GFP_KERNEL);
>>>>> @@ -965,8 +1055,8 @@ static int rpmsg_remove_device(struct device
>>>>> *dev, void *data)
>>>>> static void rpmsg_remove(struct virtio_device *vdev)
>>>>> {
>>>>> struct virtproc_info *vrp = vdev->priv;
>>>>> - unsigned int num_bufs = vrp->num_rx_buf + vrp->num_tx_buf;
>>>>> - size_t total_buf_space = num_bufs * vrp->buf_size;
>>>>> + size_t total_buf_space = (vrp->num_rx_buf * vrp->rx_buf_size) +
>>>>> + (vrp->num_tx_buf * vrp->tx_buf_size);
>>>>> int ret;
>>>>> virtio_reset_device(vdev);
>>>>> @@ -992,6 +1082,7 @@ static struct virtio_device_id id_table[] = {
>>>>> static unsigned int features[] = {
>>>>> VIRTIO_RPMSG_F_NS,
>>>>> + VIRTIO_RPMSG_F_BUFSZ,
>>>>> };
>>>>> static struct virtio_driver virtio_ipc_driver = {
>>>>> diff --git a/include/linux/rpmsg/virtio_rpmsg.h b/include/linux/rpmsg/
>>>>> virtio_rpmsg.h
>>>>> new file mode 100644
>>>>> index 000000000000..7e14da68fd17
>>>>> --- /dev/null
>>>>> +++ b/include/linux/rpmsg/virtio_rpmsg.h
>>>>> @@ -0,0 +1,50 @@
>>>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>>>> +/*
>>>>> + * Copyright (C) Pinecone Inc. 2019
>>>>> + * Copyright (C) Xiang Xiao <xiaoxiang@pinecone.net>
>>>>> + * Copyright (C) Advanced Micro Devices, Inc. 2026
>>>>> + */
>>>>> +
>>>>> +#ifndef _LINUX_VIRTIO_RPMSG_H
>>>>> +#define _LINUX_VIRTIO_RPMSG_H
>>>>> +
>>>>> +#include <linux/types.h>
>>>>> +#include <linux/virtio_types.h>
>>>>> +
>>>>> +/* The feature bitmap for virtio rpmsg */
>>>>> +#define VIRTIO_RPMSG_F_NS 0 /* RP supports name service
>>>>> notifications */
>>>>> +#define VIRTIO_RPMSG_F_BUFSZ 1 /* RP get buffer size from config
>>>>> space */
>>>>> +
>>>>> +/* Version of struct virtio_rpmsg_config understood by this driver */
>>>>> +#define RPMSG_VDEV_CONFIG_V1 1
>>>>> +
>>>>> +/**
>>>>> + * struct virtio_rpmsg_config - config space for rpmsg virtio device
>>>>> + *
>>>>> + * @version: version of this structure, currently
>>>>> %RPMSG_VDEV_CONFIG_V1.
>>>>> + * @reserved: reserved for padding, must be zero.
>>>>> + * @size: size of this structure in bytes.
>>>>> + * @rpmsg_buf_align: required alignment in bytes for each buffer.
>>>>> Must be a
>>>>> + * power of two so that both the buffer sizes and the TX buffer
>>>>> + * base address can be aligned (e.g. to a cache line).
>>>>> + * @reserved1: reserved for padding, must be zero. Keeps the
>>>>> following 32-bit
>>>>> + * fields naturally aligned.
>>>>> + * @txbuf_size: Tx buf size from remote's view. For Linux this is
>>>>> rx buf size.
>>>>> + * @rxbuf_size: Rx buf size from remote's view. For Linux this is
>>>>> tx buf size.
>>>>> + *
>>>>> + * This is the configuration structure shared by the device and the
>>>>> driver,
>>>>> + * read when %VIRTIO_RPMSG_F_BUFSZ is negotiated. The fields are laid
>>>>> out so
>>>>> + * the structure is naturally 32-bit aligned.
>>>>> + */
>>>>> +struct virtio_rpmsg_config {
>>>>> + u8 version;
>>>>> + u8 reserved;
>>>>
>>>> Why about defining the version type to u16 to avoid the reserved field?
>>>>
>>>>> + __virtio16 size;
>>>>> + __virtio16 rpmsg_buf_align;
>>>>> + __virtio16 reserved1;
>>>>
>>>> Seems useless if __packed prevents the compiler from inserting extra
>>>> padding
>>>> bytes between fields,
>>>>
>>>>> + /* The tx/rx individual buffer size (if VIRTIO_RPMSG_F_BUFSZ) */
>>>>> + __virtio32 txbuf_size;
>>>>> + __virtio32 rxbuf_size;
>>>>> +} __packed;
>>>>
>>>> proposal
>>>>
>>>> +struct virtio_rpmsg_config {
>>>> + __virtio16 version;
>>>> + __virtio16 size;
>>>> + /* The tx/rx individual buffer size (if VIRTIO_RPMSG_F_BUFSZ) */
>>>> + __virtio32 txbuf_size;
>>>> + __virtio32 rxbuf_size;
>>>> + __virtio16 rpmsg_buf_align;
>>>> +} __packed;
>>>> +
>>>>
>>>
>>> I am okay with the above proposal with minor difference:
>>>
>>> My proposal:
>>>
>>> +struct virtio_rpmsg_config {
>>> + u8 version;
>>> + __virtio16 size;
>>> + __virtio16 rpmsg_buf_align;
>>> + /* The tx/rx individual buffer size (if VIRTIO_RPMSG_F_BUFSZ) */
>>> + __virtio32 txbuf_size;
>>> + __virtio32 rxbuf_size;
>>> +} __packed;
>>>
>>> I just want to keep version field 8-bit, as we will probably never use
>>> upper byte of that field if we use 16-bit. Rest is okay. If the
>>> strucutre is packed then reserved bytes are not needed.
>>>
>>> Please let me know your view.
>>
>> No strong opinion on that. In the end, this structure is read only one
>> time.
>> If it is acceptable to Mathieu, it is acceptable to me.
>>
>> Thanks,
>> Arnaud
>>
>>>
>>> Thanks,
>>> Tanmay
>>>
>>>
>>>> Regards,
>>>> Arnaud
>>>>
>>>>> +
>>>>> +#endif /* _LINUX_VIRTIO_RPMSG_H */
>>>>
>>>
>>
>
^ permalink raw reply
* Re: [PATCH v6 06/16] iio: core: create local __iio_chan_prefix_emit() for reuse
From: Rodrigo Alencar @ 2026-06-19 7:43 UTC (permalink / raw)
To: Andy Shevchenko, Rodrigo Alencar
Cc: Nuno Sá, rodrigo.alencar, linux-iio, devicetree,
linux-kernel, linux-doc, linux-hardening, Lars-Peter Clausen,
Michael Hennerich, Jonathan Cameron, David Lechner,
Andy Shevchenko, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
Philipp Zabel, Jonathan Corbet, Shuah Khan, Kees Cook,
Gustavo A. R. Silva
In-Reply-To: <ajQ1bZSNHQ96pyJx@ashevche-desk.local>
On 18/06/26 21:14, Andy Shevchenko wrote:
> On Thu, Jun 18, 2026 at 05:14:19PM +0100, Rodrigo Alencar wrote:
> > On 18/06/26 16:06, Nuno Sá wrote:
> > > On Thu, Jun 18, 2026 at 02:27:22PM +0100, Rodrigo Alencar via B4 Relay wrote:
>
> ...
>
> > > > + dev_attr->attr.name = kasprintf(GFP_KERNEL, "%s%s", prefix, postfix);
> > > > + if (!dev_attr->attr.name)
> > > > return -ENOMEM;
> > >
> > > I don't oppose the change. Looks like a nice cleanup.
>
> May I oppose it? I found use scnprintf() is harder to follow in comparison to
> nice kasprintf() that takes care for the dynamically allocated buffer.
In the next patch the function is reused in a sysfs attribute read handler,
a context wich would not be nice to have dynamic allocation. vscnprintf() is
the main building block of sysfs_emit() which limits the buffer length to
a page size, so I used scnprintf() trying not to deviate much from that.
kasprintf() it is still used in the caller, where the logic was a bit confusing
as it tried to avoid multiple allocations.
> Also there is a chance to get a name silently cut due to insufficient space.
> Besides that this function can't be used (again due to 'c') in kasprintf()-like
> wrapper. I do not consider this as a good approach. Have you looked at seq_buf
> instead?
NAME_MAX is not the maximum length a filename can have? I suppose there should be
enough space for the channel-prefix. Indeed, seq_buf can be used and it cleans up
things a bit as it tracks the the position in the buffer.
>
> > > But bear in mind this very sensible as any subtle mistake means ABI breakage.
>
> Which immediately raises a question of test coverage. Do we have one? If not,
> this code must be accompanied with one.
Agreed. Will see to have tests for v7.
> > Yes! I tried to be careful... this is dangerous stuff!
--
Kind regards,
Rodrigo Alencar
^ permalink raw reply
* [PATCH v2 4/4] rv/rtapp: Add wakeup monitor
From: Nam Cao @ 2026-06-19 7:21 UTC (permalink / raw)
To: Gabriele Monaco, Steven Rostedt, linux-trace-kernel, linux-doc,
linux-kernel
Cc: Nam Cao
In-Reply-To: <cover.1781852967.git.namcao@linutronix.de>
Add a wakeup monitor to detect a lower-priority task waking up a
higher-priority task.
The rtapp/sleep monitor already detects this. However, that monitor
triggers an error in the context of the wakee task and user only gets
the stacktrace of that task. It is also extremely useful to get the
stacktrace of the waker task, which this monitor offers. In other
words, this monitor complements the rtapp/sleep monitor.
Signed-off-by: Nam Cao <namcao@linutronix.de>
---
Documentation/trace/rv/monitor_rtapp.rst | 20 +++
kernel/trace/rv/Kconfig | 1 +
kernel/trace/rv/Makefile | 1 +
kernel/trace/rv/monitors/rtapp/Kconfig | 2 +-
kernel/trace/rv/monitors/wakeup/Kconfig | 16 ++
kernel/trace/rv/monitors/wakeup/wakeup.c | 153 ++++++++++++++++++
kernel/trace/rv/monitors/wakeup/wakeup.h | 92 +++++++++++
.../trace/rv/monitors/wakeup/wakeup_trace.h | 14 ++
kernel/trace/rv/rv_trace.h | 1 +
tools/verification/models/rtapp/wakeup.ltl | 5 +
10 files changed, 304 insertions(+), 1 deletion(-)
create mode 100644 kernel/trace/rv/monitors/wakeup/Kconfig
create mode 100644 kernel/trace/rv/monitors/wakeup/wakeup.c
create mode 100644 kernel/trace/rv/monitors/wakeup/wakeup.h
create mode 100644 kernel/trace/rv/monitors/wakeup/wakeup_trace.h
create mode 100644 tools/verification/models/rtapp/wakeup.ltl
diff --git a/Documentation/trace/rv/monitor_rtapp.rst b/Documentation/trace/rv/monitor_rtapp.rst
index 502d3ea412eb..238b59395ff5 100644
--- a/Documentation/trace/rv/monitor_rtapp.rst
+++ b/Documentation/trace/rv/monitor_rtapp.rst
@@ -124,3 +124,23 @@ to handle some special cases:
real-time-safe because preemption is disabled for the duration.
- `FUTEX_LOCK_PI` is included in the allowlist for the same reason as
`BLOCK_ON_RT_MUTEX`.
+
+Monitor wakeup
+++++++++++++++
+
+The `wakeup` monitor reports real-time threads being woken by lower-priority threads,
+which is a hint of priority inversion. Its specification is::
+
+ RULE = always (((RT and USER_THREAD) imply
+ (not (WOKEN_BY_LOWER_PRIO or WOKEN_BY_SOFTIRQ)) or ALLOWLIST))
+
+ ALLOWLIST = BLOCK_ON_RT_MUTEX
+ or FUTEX_LOCK_PI
+
+The `sleep` monitor already reports this type of problem. The difference is the
+context in which the problem is reported. While the `sleep` monitor reports the problem
+in the context of the wakee, this `wakeup` monitor reports the problem in the context of
+the waker. This monitor complement the `sleep` monitor, giving user better
+understanding of the issue. For instance, to debug a lower-priority task waking a
+higher-priority task scenario, user can enable both `wakeup` monitor and `sleep`
+monitor to get the stack traces of both tasks.
diff --git a/kernel/trace/rv/Kconfig b/kernel/trace/rv/Kconfig
index 3884b14df375..4d3a14a0bac2 100644
--- a/kernel/trace/rv/Kconfig
+++ b/kernel/trace/rv/Kconfig
@@ -76,6 +76,7 @@ source "kernel/trace/rv/monitors/opid/Kconfig"
source "kernel/trace/rv/monitors/rtapp/Kconfig"
source "kernel/trace/rv/monitors/pagefault/Kconfig"
source "kernel/trace/rv/monitors/sleep/Kconfig"
+source "kernel/trace/rv/monitors/wakeup/Kconfig"
# Add new rtapp monitors here
source "kernel/trace/rv/monitors/stall/Kconfig"
diff --git a/kernel/trace/rv/Makefile b/kernel/trace/rv/Makefile
index 94498da35b37..c2c0e4142eb4 100644
--- a/kernel/trace/rv/Makefile
+++ b/kernel/trace/rv/Makefile
@@ -20,6 +20,7 @@ obj-$(CONFIG_RV_MON_OPID) += monitors/opid/opid.o
obj-$(CONFIG_RV_MON_STALL) += monitors/stall/stall.o
obj-$(CONFIG_RV_MON_DEADLINE) += monitors/deadline/deadline.o
obj-$(CONFIG_RV_MON_NOMISS) += monitors/nomiss/nomiss.o
+obj-$(CONFIG_RV_MON_WAKEUP) += monitors/wakeup/wakeup.o
# Add new monitors here
obj-$(CONFIG_RV_REACTORS) += rv_reactors.o
obj-$(CONFIG_RV_REACT_PRINTK) += reactor_printk.o
diff --git a/kernel/trace/rv/monitors/rtapp/Kconfig b/kernel/trace/rv/monitors/rtapp/Kconfig
index 1ce9370a9ba8..1fcd7a400ded 100644
--- a/kernel/trace/rv/monitors/rtapp/Kconfig
+++ b/kernel/trace/rv/monitors/rtapp/Kconfig
@@ -1,6 +1,6 @@
config RV_MON_RTAPP
depends on RV
- depends on RV_PER_TASK_MONITORS >= 2
+ depends on RV_PER_TASK_MONITORS >= 3
bool "rtapp monitor"
help
Collection of monitors to check for common problems with real-time
diff --git a/kernel/trace/rv/monitors/wakeup/Kconfig b/kernel/trace/rv/monitors/wakeup/Kconfig
new file mode 100644
index 000000000000..ec3a5c06a8c4
--- /dev/null
+++ b/kernel/trace/rv/monitors/wakeup/Kconfig
@@ -0,0 +1,16 @@
+# SPDX-License-Identifier: GPL-2.0-only
+#
+config RV_MON_WAKEUP
+ depends on RV
+ depends on RV_MON_RTAPP
+ depends on HAVE_SYSCALL_TRACEPOINTS
+ default y
+ select LTL_MON_EVENTS_ID
+ bool "wakeup monitor"
+ help
+ This monitor detects a lower-priority task waking up a
+ higher-priority task. The RV_MON_SLEEP monitor already
+ detects this case, but this monitor detects in the context
+ of the waker task instead. This and RV_MON_SLEEP can be
+ enabled together to get the stacktrace of both the waker
+ task and the wakee task.
diff --git a/kernel/trace/rv/monitors/wakeup/wakeup.c b/kernel/trace/rv/monitors/wakeup/wakeup.c
new file mode 100644
index 000000000000..01b47416f24e
--- /dev/null
+++ b/kernel/trace/rv/monitors/wakeup/wakeup.c
@@ -0,0 +1,153 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/ftrace.h>
+#include <linux/tracepoint.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/rv.h>
+#include <rv/instrumentation.h>
+
+#define MODULE_NAME "wakeup"
+
+#include <trace/events/syscalls.h>
+#include <trace/events/sched.h>
+#include <trace/events/lock.h>
+#include <uapi/linux/futex.h>
+
+#include <rv_trace.h>
+#include <monitors/rtapp/rtapp.h>
+
+
+#ifndef __NR_futex
+#define __NR_futex (-__COUNTER__)
+#endif
+#ifndef __NR_futex_time64
+#define __NR_futex_time64 (-__COUNTER__)
+#endif
+
+#include "wakeup.h"
+#include <rv/ltl_monitor.h>
+
+static void ltl_atoms_fetch(struct task_struct *task, struct ltl_monitor *mon)
+{
+ /*
+ * This includes "actual" real-time tasks and also PI-boosted
+ * tasks. A task being PI-boosted means it is blocking an "actual"
+ * real-task, therefore it should also obey the monitor's rule,
+ * otherwise the "actual" real-task may be delayed.
+ */
+ ltl_atom_set(mon, LTL_RT, rt_or_dl_task(task));
+}
+
+static void ltl_atoms_init(struct task_struct *task, struct ltl_monitor *mon, bool task_creation)
+{
+ ltl_atom_set(mon, LTL_WOKEN_BY_LOWER_PRIO, false);
+ ltl_atom_set(mon, LTL_WOKEN_BY_SOFTIRQ, false);
+
+ if (task_creation) {
+ ltl_atom_set(mon, LTL_BLOCK_ON_RT_MUTEX, false);
+ ltl_atom_set(mon, LTL_FUTEX_LOCK_PI, false);
+ }
+
+ ltl_atom_set(mon, LTL_USER_THREAD, !(task->flags & PF_KTHREAD));
+}
+
+static void handle_sched_waking(void *data, struct task_struct *task)
+{
+ if (in_task()) {
+ if (current->prio > task->prio)
+ ltl_atom_pulse(task, LTL_WOKEN_BY_LOWER_PRIO, true);
+ } else if (in_serving_softirq()) {
+ ltl_atom_pulse(task, LTL_WOKEN_BY_SOFTIRQ, true);
+ }
+}
+
+static void handle_contention_begin(void *data, void *lock, unsigned int flags)
+{
+ if (flags & LCB_F_RT)
+ ltl_atom_update(current, LTL_BLOCK_ON_RT_MUTEX, true);
+}
+
+static void handle_contention_end(void *data, void *lock, int ret)
+{
+ ltl_atom_update(current, LTL_BLOCK_ON_RT_MUTEX, false);
+}
+
+static void handle_sys_enter(void *data, struct pt_regs *regs, long id)
+{
+ unsigned long args[6];
+ int op, cmd;
+
+ switch (id) {
+ case __NR_futex:
+ case __NR_futex_time64:
+ syscall_get_arguments(current, regs, args);
+ op = args[1];
+ cmd = op & FUTEX_CMD_MASK;
+
+ switch (cmd) {
+ case FUTEX_LOCK_PI:
+ case FUTEX_LOCK_PI2:
+ ltl_atom_update(current, LTL_FUTEX_LOCK_PI, true);
+ break;
+ }
+ break;
+ }
+}
+
+static void handle_sys_exit(void *data, struct pt_regs *regs, long ret)
+{
+ ltl_atom_update(current, LTL_FUTEX_LOCK_PI, false);
+}
+
+static int enable_wakeup(void)
+{
+ int retval;
+
+ retval = ltl_monitor_init();
+ if (retval)
+ return retval;
+
+ rv_attach_trace_probe("rtapp_wakeup", sched_waking, handle_sched_waking);
+ rv_attach_trace_probe("rtapp_wakeup", contention_begin, handle_contention_begin);
+ rv_attach_trace_probe("rtapp_wakeup", contention_end, handle_contention_end);
+ rv_attach_trace_probe("rtapp_wakeup", sys_enter, handle_sys_enter);
+ rv_attach_trace_probe("rtapp_wakeup", sys_exit, handle_sys_exit);
+
+ return 0;
+}
+
+static void disable_wakeup(void)
+{
+ rv_detach_trace_probe("rtapp_wakeup", sched_waking, handle_sched_waking);
+ rv_detach_trace_probe("rtapp_wakeup", contention_begin, handle_contention_begin);
+ rv_detach_trace_probe("rtapp_wakeup", contention_end, handle_contention_end);
+ rv_detach_trace_probe("rtapp_wakeup", sys_enter, handle_sys_enter);
+ rv_detach_trace_probe("rtapp_wakeup", sys_exit, handle_sys_exit);
+
+ ltl_monitor_destroy();
+}
+
+static struct rv_monitor rv_wakeup = {
+ .name = "wakeup",
+ .description = "Monitor that real-time tasks are not woken by lower-priority tasks",
+ .enable = enable_wakeup,
+ .disable = disable_wakeup,
+};
+
+static int __init register_wakeup(void)
+{
+ return rv_register_monitor(&rv_wakeup, &rv_rtapp);
+}
+
+static void __exit unregister_wakeup(void)
+{
+ rv_unregister_monitor(&rv_wakeup);
+}
+
+module_init(register_wakeup);
+module_exit(unregister_wakeup);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Nam Cao <namcao@linutronix.de>");
+MODULE_DESCRIPTION("Monitor that real-time tasks are not woken by lower-priority tasks");
diff --git a/kernel/trace/rv/monitors/wakeup/wakeup.h b/kernel/trace/rv/monitors/wakeup/wakeup.h
new file mode 100644
index 000000000000..6f80da64e0e1
--- /dev/null
+++ b/kernel/trace/rv/monitors/wakeup/wakeup.h
@@ -0,0 +1,92 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/*
+ * C implementation of Buchi automaton, automatically generated by
+ * tools/verification/rvgen from the linear temporal logic specification.
+ * For further information, see kernel documentation:
+ * Documentation/trace/rv/linear_temporal_logic.rst
+ */
+
+#include <linux/rv.h>
+
+#define MONITOR_NAME wakeup
+
+enum ltl_atom {
+ LTL_BLOCK_ON_RT_MUTEX,
+ LTL_FUTEX_LOCK_PI,
+ LTL_RT,
+ LTL_USER_THREAD,
+ LTL_WOKEN_BY_LOWER_PRIO,
+ LTL_WOKEN_BY_SOFTIRQ,
+ LTL_NUM_ATOM
+};
+static_assert(LTL_NUM_ATOM <= RV_MAX_LTL_ATOM);
+
+static const char *ltl_atom_str(enum ltl_atom atom)
+{
+ static const char *const names[] = {
+ "bl_on_rt_mu",
+ "fu_lo_pi",
+ "rt",
+ "us_th",
+ "wo_lo_pr",
+ "wo_so",
+ };
+
+ return names[atom];
+}
+
+enum ltl_buchi_state {
+ S0,
+ RV_NUM_BA_STATES
+};
+static_assert(RV_NUM_BA_STATES <= RV_MAX_BA_STATES);
+
+static void ltl_start(struct task_struct *task, struct ltl_monitor *mon)
+{
+ bool woken_by_softirq = test_bit(LTL_WOKEN_BY_SOFTIRQ, mon->atoms);
+ bool woken_by_lower_prio = test_bit(LTL_WOKEN_BY_LOWER_PRIO, mon->atoms);
+ bool user_thread = test_bit(LTL_USER_THREAD, mon->atoms);
+ bool rt = test_bit(LTL_RT, mon->atoms);
+ bool futex_lock_pi = test_bit(LTL_FUTEX_LOCK_PI, mon->atoms);
+ bool block_on_rt_mutex = test_bit(LTL_BLOCK_ON_RT_MUTEX, mon->atoms);
+ bool val9 = block_on_rt_mutex || futex_lock_pi;
+ bool val6 = !woken_by_softirq;
+ bool val5 = !woken_by_lower_prio;
+ bool val8 = val5 && val6;
+ bool val10 = val8 || val9;
+ bool val3 = !user_thread;
+ bool val2 = !rt;
+ bool val4 = val2 || val3;
+ bool val11 = val4 || val10;
+
+ if (val11)
+ __set_bit(S0, mon->states);
+}
+
+static void
+ltl_possible_next_states(struct ltl_monitor *mon, unsigned int state, unsigned long *next)
+{
+ bool woken_by_softirq = test_bit(LTL_WOKEN_BY_SOFTIRQ, mon->atoms);
+ bool woken_by_lower_prio = test_bit(LTL_WOKEN_BY_LOWER_PRIO, mon->atoms);
+ bool user_thread = test_bit(LTL_USER_THREAD, mon->atoms);
+ bool rt = test_bit(LTL_RT, mon->atoms);
+ bool futex_lock_pi = test_bit(LTL_FUTEX_LOCK_PI, mon->atoms);
+ bool block_on_rt_mutex = test_bit(LTL_BLOCK_ON_RT_MUTEX, mon->atoms);
+ bool val9 = block_on_rt_mutex || futex_lock_pi;
+ bool val6 = !woken_by_softirq;
+ bool val5 = !woken_by_lower_prio;
+ bool val8 = val5 && val6;
+ bool val10 = val8 || val9;
+ bool val3 = !user_thread;
+ bool val2 = !rt;
+ bool val4 = val2 || val3;
+ bool val11 = val4 || val10;
+
+ switch (state) {
+ case S0:
+ if (val11)
+ __set_bit(S0, next);
+ break;
+ }
+}
diff --git a/kernel/trace/rv/monitors/wakeup/wakeup_trace.h b/kernel/trace/rv/monitors/wakeup/wakeup_trace.h
new file mode 100644
index 000000000000..7e056183f920
--- /dev/null
+++ b/kernel/trace/rv/monitors/wakeup/wakeup_trace.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/*
+ * Snippet to be included in rv_trace.h
+ */
+
+#ifdef CONFIG_RV_MON_WAKEUP
+DEFINE_EVENT(event_ltl_monitor_id, event_wakeup,
+ TP_PROTO(struct task_struct *task, char *states, char *atoms, char *next),
+ TP_ARGS(task, states, atoms, next));
+DEFINE_EVENT(error_ltl_monitor_id, error_wakeup,
+ TP_PROTO(struct task_struct *task),
+ TP_ARGS(task));
+#endif /* CONFIG_RV_MON_WAKEUP */
diff --git a/kernel/trace/rv/rv_trace.h b/kernel/trace/rv/rv_trace.h
index 9622c269789c..2f8a932432c9 100644
--- a/kernel/trace/rv/rv_trace.h
+++ b/kernel/trace/rv/rv_trace.h
@@ -241,6 +241,7 @@ DECLARE_EVENT_CLASS(error_ltl_monitor_id,
);
#include <monitors/pagefault/pagefault_trace.h>
#include <monitors/sleep/sleep_trace.h>
+#include <monitors/wakeup/wakeup_trace.h>
// Add new monitors based on CONFIG_LTL_MON_EVENTS_ID here
#endif /* CONFIG_LTL_MON_EVENTS_ID */
diff --git a/tools/verification/models/rtapp/wakeup.ltl b/tools/verification/models/rtapp/wakeup.ltl
new file mode 100644
index 000000000000..a5d63ca0811a
--- /dev/null
+++ b/tools/verification/models/rtapp/wakeup.ltl
@@ -0,0 +1,5 @@
+RULE = always (((RT and USER_THREAD) imply
+ (not (WOKEN_BY_LOWER_PRIO or WOKEN_BY_SOFTIRQ)) or ALLOWLIST))
+
+ALLOWLIST = BLOCK_ON_RT_MUTEX
+ or FUTEX_LOCK_PI
--
2.47.3
^ permalink raw reply related
* [PATCH v2 3/4] rv/rtapp/sleep: Stop monitoring kernel threads
From: Nam Cao @ 2026-06-19 7:21 UTC (permalink / raw)
To: Gabriele Monaco, Steven Rostedt, linux-trace-kernel, linux-doc,
linux-kernel
Cc: Nam Cao, Sebastian Andrzej Siewior
In-Reply-To: <cover.1781852967.git.namcao@linutronix.de>
The rtapp/sleep monitor's primary purpose is detecting common mistakes
with user-space real-time design. Monitoring real-time issues with
kernel threads is a bonus.
However, accomodating kernel threads complicates the monitor due to
the edge cases which is seen by the monitor as lower-priority task
waking higher-priority task:
- kthread_stop() wakes up the task in order to stop it.
- The rcu thread and migration thread can be woken by any task.
- The ktimerd thread is woken near the end of irq_exit_rcu(), where
the preempt counter is "broken" and falsely says this is task
context. This requires the monitor to use the hardirq_context flag
instead of the preempt counter.
Beside complicating the monitor, the final case also requires enabling
CONFIG_TRACE_IRQFLAGS (so that "hardirq_context" can be used). This
adds overhead to the kernel even when the monitor is not active. This
may be an obstacle to enabling this monitor in distros' kernels.
Furthermore, kernel threads usually are started before the monitor is
enabled. Consequently, the threads' states (i.o.w. the monitor's
atomic propositions for the threads) are not fully known to the
monitor. As a result, the kernel threads mostly cannot be monitored.
Overall, the downsides of accomodating kernel threads outweights the
benefits. Thus, exclude kernel threads to simplify the monitor.
Signed-off-by: Nam Cao <namcao@linutronix.de>
---
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
Documentation/trace/rv/monitor_rtapp.rst | 22 ++---
kernel/trace/rv/monitors/sleep/Kconfig | 1 -
kernel/trace/rv/monitors/sleep/sleep.c | 39 +-------
kernel/trace/rv/monitors/sleep/sleep.h | 104 +++++++++-------------
tools/verification/models/rtapp/sleep.ltl | 7 +-
5 files changed, 54 insertions(+), 119 deletions(-)
diff --git a/Documentation/trace/rv/monitor_rtapp.rst b/Documentation/trace/rv/monitor_rtapp.rst
index 570be67a8f3b..502d3ea412eb 100644
--- a/Documentation/trace/rv/monitor_rtapp.rst
+++ b/Documentation/trace/rv/monitor_rtapp.rst
@@ -93,9 +93,9 @@ assessment.
The monitor's specification is::
- RULE = always ((RT and SLEEP) imply (RT_FRIENDLY_SLEEP or ALLOWLIST))
+ RULE = always ((RT and SLEEP and USER_THREAD) imply (RT_FRIENDLY_SLEEP or ALLOWLIST))
- RT_FRIENDLY_SLEEP = (RT_VALID_SLEEP_REASON or KERNEL_THREAD)
+ RT_FRIENDLY_SLEEP = RT_VALID_SLEEP_REASON
and ((not SCHEDULE_IN) until RT_FRIENDLY_WAKE)
RT_VALID_SLEEP_REASON = FUTEX_WAIT
@@ -110,23 +110,13 @@ The monitor's specification is::
or WOKEN_BY_HARDIRQ
or WOKEN_BY_NMI
or ABORT_SLEEP
- or KTHREAD_SHOULD_STOP
ALLOWLIST = BLOCK_ON_RT_MUTEX
or FUTEX_LOCK_PI
- or TASK_IS_RCU
- or TASK_IS_MIGRATION
-
-Beside the scenarios described above, this specification also handle some
-special cases:
-
- - `KERNEL_THREAD`: kernel tasks do not have any pattern that can be recognized
- as valid real-time sleeping reasons. Therefore sleeping reason is not
- checked for kernel tasks.
- - `KTHREAD_SHOULD_STOP`: a non-real-time thread may stop a real-time kernel
- thread by waking it and waiting for it to exit (`kthread_stop()`). This
- wakeup is safe for real-time.
- - `ALLOWLIST`: to handle known false positives with the kernel.
+
+Beside the scenarios described above, this specification also defines an allow list
+to handle some special cases:
+
- `BLOCK_ON_RT_MUTEX` is included in the allowlist due to its implementation.
In the release path of rt_mutex, a boosted task is de-boosted before waking
the rt_mutex's waiter. Consequently, the monitor may see a real-time-unsafe
diff --git a/kernel/trace/rv/monitors/sleep/Kconfig b/kernel/trace/rv/monitors/sleep/Kconfig
index 6b7a122e7b47..d6ec3e9a91b6 100644
--- a/kernel/trace/rv/monitors/sleep/Kconfig
+++ b/kernel/trace/rv/monitors/sleep/Kconfig
@@ -5,7 +5,6 @@ config RV_MON_SLEEP
select RV_LTL_MONITOR
depends on HAVE_SYSCALL_TRACEPOINTS
depends on RV_MON_RTAPP
- select TRACE_IRQFLAGS
default y
select LTL_MON_EVENTS_ID
bool "sleep monitor"
diff --git a/kernel/trace/rv/monitors/sleep/sleep.c b/kernel/trace/rv/monitors/sleep/sleep.c
index 638be7d8747f..aa5a984853b5 100644
--- a/kernel/trace/rv/monitors/sleep/sleep.c
+++ b/kernel/trace/rv/monitors/sleep/sleep.c
@@ -43,7 +43,6 @@ static void ltl_atoms_init(struct task_struct *task, struct ltl_monitor *mon, bo
ltl_atom_set(mon, LTL_WOKEN_BY_EQUAL_OR_HIGHER_PRIO, false);
if (task_creation) {
- ltl_atom_set(mon, LTL_KTHREAD_SHOULD_STOP, false);
ltl_atom_set(mon, LTL_NANOSLEEP_CLOCK_REALTIME, false);
ltl_atom_set(mon, LTL_NANOSLEEP_TIMER_ABSTIME, false);
ltl_atom_set(mon, LTL_CLOCK_NANOSLEEP, false);
@@ -53,33 +52,7 @@ static void ltl_atoms_init(struct task_struct *task, struct ltl_monitor *mon, bo
ltl_atom_set(mon, LTL_BLOCK_ON_RT_MUTEX, false);
}
- if (task->flags & PF_KTHREAD) {
- ltl_atom_set(mon, LTL_KERNEL_THREAD, true);
-
- /* kernel tasks do not do syscall */
- ltl_atom_set(mon, LTL_FUTEX_WAIT, false);
- ltl_atom_set(mon, LTL_FUTEX_LOCK_PI, false);
- ltl_atom_set(mon, LTL_NANOSLEEP_CLOCK_REALTIME, false);
- ltl_atom_set(mon, LTL_NANOSLEEP_TIMER_ABSTIME, false);
- ltl_atom_set(mon, LTL_CLOCK_NANOSLEEP, false);
- ltl_atom_set(mon, LTL_EPOLL_WAIT, false);
-
- if (strstarts(task->comm, "migration/"))
- ltl_atom_set(mon, LTL_TASK_IS_MIGRATION, true);
- else
- ltl_atom_set(mon, LTL_TASK_IS_MIGRATION, false);
-
- if (strstarts(task->comm, "rcu"))
- ltl_atom_set(mon, LTL_TASK_IS_RCU, true);
- else
- ltl_atom_set(mon, LTL_TASK_IS_RCU, false);
- } else {
- ltl_atom_set(mon, LTL_KTHREAD_SHOULD_STOP, false);
- ltl_atom_set(mon, LTL_KERNEL_THREAD, false);
- ltl_atom_set(mon, LTL_TASK_IS_RCU, false);
- ltl_atom_set(mon, LTL_TASK_IS_MIGRATION, false);
- }
-
+ ltl_atom_set(mon, LTL_USER_THREAD, !(task->flags & PF_KTHREAD));
}
static void handle_sched_set_state(void *data, struct task_struct *task, int state)
@@ -97,7 +70,7 @@ static void handle_sched_exit(void *data, bool is_switch)
static void handle_sched_waking(void *data, struct task_struct *task)
{
- if (this_cpu_read(hardirq_context)) {
+ if (in_hardirq()) {
ltl_atom_pulse(task, LTL_WOKEN_BY_HARDIRQ, true);
} else if (in_task()) {
if (current->prio <= task->prio)
@@ -181,12 +154,6 @@ static void handle_sys_exit(void *data, struct pt_regs *regs, long ret)
ltl_atom_update(current, LTL_CLOCK_NANOSLEEP, false);
}
-static void handle_kthread_stop(void *data, struct task_struct *task)
-{
- /* FIXME: this could race with other tracepoint handlers */
- ltl_atom_update(task, LTL_KTHREAD_SHOULD_STOP, true);
-}
-
static int enable_sleep(void)
{
int retval;
@@ -200,7 +167,6 @@ static int enable_sleep(void)
rv_attach_trace_probe("rtapp_sleep", sched_set_state_tp, handle_sched_set_state);
rv_attach_trace_probe("rtapp_sleep", contention_begin, handle_contention_begin);
rv_attach_trace_probe("rtapp_sleep", contention_end, handle_contention_end);
- rv_attach_trace_probe("rtapp_sleep", sched_kthread_stop, handle_kthread_stop);
rv_attach_trace_probe("rtapp_sleep", sys_enter, handle_sys_enter);
rv_attach_trace_probe("rtapp_sleep", sys_exit, handle_sys_exit);
return 0;
@@ -213,7 +179,6 @@ static void disable_sleep(void)
rv_detach_trace_probe("rtapp_sleep", sched_set_state_tp, handle_sched_set_state);
rv_detach_trace_probe("rtapp_sleep", contention_begin, handle_contention_begin);
rv_detach_trace_probe("rtapp_sleep", contention_end, handle_contention_end);
- rv_detach_trace_probe("rtapp_sleep", sched_kthread_stop, handle_kthread_stop);
rv_detach_trace_probe("rtapp_sleep", sys_enter, handle_sys_enter);
rv_detach_trace_probe("rtapp_sleep", sys_exit, handle_sys_exit);
diff --git a/kernel/trace/rv/monitors/sleep/sleep.h b/kernel/trace/rv/monitors/sleep/sleep.h
index 2fe2ec7edae8..44e593f41e6a 100644
--- a/kernel/trace/rv/monitors/sleep/sleep.h
+++ b/kernel/trace/rv/monitors/sleep/sleep.h
@@ -18,15 +18,12 @@ enum ltl_atom {
LTL_EPOLL_WAIT,
LTL_FUTEX_LOCK_PI,
LTL_FUTEX_WAIT,
- LTL_KERNEL_THREAD,
- LTL_KTHREAD_SHOULD_STOP,
LTL_NANOSLEEP_CLOCK_REALTIME,
LTL_NANOSLEEP_TIMER_ABSTIME,
LTL_RT,
LTL_SCHEDULE_IN,
LTL_SLEEP,
- LTL_TASK_IS_MIGRATION,
- LTL_TASK_IS_RCU,
+ LTL_USER_THREAD,
LTL_WOKEN_BY_EQUAL_OR_HIGHER_PRIO,
LTL_WOKEN_BY_HARDIRQ,
LTL_WOKEN_BY_NMI,
@@ -43,15 +40,12 @@ static const char *ltl_atom_str(enum ltl_atom atom)
"ep_wa",
"fu_lo_pi",
"fu_wa",
- "ker_th",
- "kth_sh_st",
"na_cl_re",
"na_ti_ab",
"rt",
"sch_in",
"sle",
- "ta_mi",
- "ta_rc",
+ "us_th",
"wo_eq_hi_pr",
"wo_ha",
"wo_nm",
@@ -79,46 +73,41 @@ static void ltl_start(struct task_struct *task, struct ltl_monitor *mon)
bool woken_by_hardirq = test_bit(LTL_WOKEN_BY_HARDIRQ, mon->atoms);
bool woken_by_equal_or_higher_prio = test_bit(LTL_WOKEN_BY_EQUAL_OR_HIGHER_PRIO,
mon->atoms);
- bool task_is_rcu = test_bit(LTL_TASK_IS_RCU, mon->atoms);
- bool task_is_migration = test_bit(LTL_TASK_IS_MIGRATION, mon->atoms);
+ bool user_thread = test_bit(LTL_USER_THREAD, mon->atoms);
bool sleep = test_bit(LTL_SLEEP, mon->atoms);
bool schedule_in = test_bit(LTL_SCHEDULE_IN, mon->atoms);
bool rt = test_bit(LTL_RT, mon->atoms);
bool nanosleep_timer_abstime = test_bit(LTL_NANOSLEEP_TIMER_ABSTIME, mon->atoms);
bool nanosleep_clock_realtime = test_bit(LTL_NANOSLEEP_CLOCK_REALTIME, mon->atoms);
- bool kthread_should_stop = test_bit(LTL_KTHREAD_SHOULD_STOP, mon->atoms);
- bool kernel_thread = test_bit(LTL_KERNEL_THREAD, mon->atoms);
bool futex_wait = test_bit(LTL_FUTEX_WAIT, mon->atoms);
bool futex_lock_pi = test_bit(LTL_FUTEX_LOCK_PI, mon->atoms);
bool epoll_wait = test_bit(LTL_EPOLL_WAIT, mon->atoms);
bool clock_nanosleep = test_bit(LTL_CLOCK_NANOSLEEP, mon->atoms);
bool block_on_rt_mutex = test_bit(LTL_BLOCK_ON_RT_MUTEX, mon->atoms);
bool abort_sleep = test_bit(LTL_ABORT_SLEEP, mon->atoms);
- bool val41 = task_is_rcu || task_is_migration;
- bool val42 = futex_lock_pi || val41;
- bool val5 = block_on_rt_mutex || val42;
- bool val33 = abort_sleep || kthread_should_stop;
- bool val34 = woken_by_nmi || val33;
- bool val35 = woken_by_hardirq || val34;
- bool val14 = woken_by_equal_or_higher_prio || val35;
+ bool val7 = block_on_rt_mutex || futex_lock_pi;
+ bool val32 = woken_by_nmi || abort_sleep;
+ bool val33 = woken_by_hardirq || val32;
+ bool val14 = woken_by_equal_or_higher_prio || val33;
bool val13 = !schedule_in;
bool val25 = !nanosleep_clock_realtime;
bool val26 = nanosleep_timer_abstime && val25;
bool val18 = clock_nanosleep && val26;
bool val20 = val18 || epoll_wait;
- bool val9 = futex_wait || val20;
- bool val11 = val9 || kernel_thread;
+ bool val11 = futex_wait || val20;
+ bool val3 = !user_thread;
bool val2 = !sleep;
+ bool val4 = val2 || val3;
bool val1 = !rt;
- bool val3 = val1 || val2;
+ bool val5 = val1 || val4;
- if (val3)
+ if (val5)
__set_bit(S0, mon->states);
if (val11 && val13)
__set_bit(S1, mon->states);
if (val11 && val14)
__set_bit(S4, mon->states);
- if (val5)
+ if (val7)
__set_bit(S5, mon->states);
}
@@ -129,130 +118,125 @@ ltl_possible_next_states(struct ltl_monitor *mon, unsigned int state, unsigned l
bool woken_by_hardirq = test_bit(LTL_WOKEN_BY_HARDIRQ, mon->atoms);
bool woken_by_equal_or_higher_prio = test_bit(LTL_WOKEN_BY_EQUAL_OR_HIGHER_PRIO,
mon->atoms);
- bool task_is_rcu = test_bit(LTL_TASK_IS_RCU, mon->atoms);
- bool task_is_migration = test_bit(LTL_TASK_IS_MIGRATION, mon->atoms);
+ bool user_thread = test_bit(LTL_USER_THREAD, mon->atoms);
bool sleep = test_bit(LTL_SLEEP, mon->atoms);
bool schedule_in = test_bit(LTL_SCHEDULE_IN, mon->atoms);
bool rt = test_bit(LTL_RT, mon->atoms);
bool nanosleep_timer_abstime = test_bit(LTL_NANOSLEEP_TIMER_ABSTIME, mon->atoms);
bool nanosleep_clock_realtime = test_bit(LTL_NANOSLEEP_CLOCK_REALTIME, mon->atoms);
- bool kthread_should_stop = test_bit(LTL_KTHREAD_SHOULD_STOP, mon->atoms);
- bool kernel_thread = test_bit(LTL_KERNEL_THREAD, mon->atoms);
bool futex_wait = test_bit(LTL_FUTEX_WAIT, mon->atoms);
bool futex_lock_pi = test_bit(LTL_FUTEX_LOCK_PI, mon->atoms);
bool epoll_wait = test_bit(LTL_EPOLL_WAIT, mon->atoms);
bool clock_nanosleep = test_bit(LTL_CLOCK_NANOSLEEP, mon->atoms);
bool block_on_rt_mutex = test_bit(LTL_BLOCK_ON_RT_MUTEX, mon->atoms);
bool abort_sleep = test_bit(LTL_ABORT_SLEEP, mon->atoms);
- bool val41 = task_is_rcu || task_is_migration;
- bool val42 = futex_lock_pi || val41;
- bool val5 = block_on_rt_mutex || val42;
- bool val33 = abort_sleep || kthread_should_stop;
- bool val34 = woken_by_nmi || val33;
- bool val35 = woken_by_hardirq || val34;
- bool val14 = woken_by_equal_or_higher_prio || val35;
+ bool val7 = block_on_rt_mutex || futex_lock_pi;
+ bool val32 = woken_by_nmi || abort_sleep;
+ bool val33 = woken_by_hardirq || val32;
+ bool val14 = woken_by_equal_or_higher_prio || val33;
bool val13 = !schedule_in;
bool val25 = !nanosleep_clock_realtime;
bool val26 = nanosleep_timer_abstime && val25;
bool val18 = clock_nanosleep && val26;
bool val20 = val18 || epoll_wait;
- bool val9 = futex_wait || val20;
- bool val11 = val9 || kernel_thread;
+ bool val11 = futex_wait || val20;
+ bool val3 = !user_thread;
bool val2 = !sleep;
+ bool val4 = val2 || val3;
bool val1 = !rt;
- bool val3 = val1 || val2;
+ bool val5 = val1 || val4;
switch (state) {
case S0:
- if (val3)
+ if (val5)
__set_bit(S0, next);
if (val11 && val13)
__set_bit(S1, next);
if (val11 && val14)
__set_bit(S4, next);
- if (val5)
+ if (val7)
__set_bit(S5, next);
break;
case S1:
if (val11 && val13)
__set_bit(S1, next);
- if (val13 && val3)
+ if (val13 && val5)
__set_bit(S2, next);
- if (val14 && val3)
+ if (val14 && val5)
__set_bit(S3, next);
if (val11 && val14)
__set_bit(S4, next);
- if (val13 && val5)
+ if (val13 && val7)
__set_bit(S6, next);
- if (val14 && val5)
+ if (val14 && val7)
__set_bit(S7, next);
break;
case S2:
if (val11 && val13)
__set_bit(S1, next);
- if (val13 && val3)
+ if (val13 && val5)
__set_bit(S2, next);
- if (val14 && val3)
+ if (val14 && val5)
__set_bit(S3, next);
if (val11 && val14)
__set_bit(S4, next);
- if (val13 && val5)
+ if (val13 && val7)
__set_bit(S6, next);
- if (val14 && val5)
+ if (val14 && val7)
__set_bit(S7, next);
break;
case S3:
- if (val3)
+ if (val5)
__set_bit(S0, next);
if (val11 && val13)
__set_bit(S1, next);
if (val11 && val14)
__set_bit(S4, next);
- if (val5)
+ if (val7)
__set_bit(S5, next);
break;
case S4:
- if (val3)
+ if (val5)
__set_bit(S0, next);
if (val11 && val13)
__set_bit(S1, next);
if (val11 && val14)
__set_bit(S4, next);
- if (val5)
+ if (val7)
__set_bit(S5, next);
break;
case S5:
- if (val3)
+ if (val5)
__set_bit(S0, next);
if (val11 && val13)
__set_bit(S1, next);
if (val11 && val14)
__set_bit(S4, next);
- if (val5)
+ if (val7)
__set_bit(S5, next);
break;
case S6:
if (val11 && val13)
__set_bit(S1, next);
- if (val13 && val3)
+ if (val13 && val5)
__set_bit(S2, next);
- if (val14 && val3)
+ if (val14 && val5)
__set_bit(S3, next);
if (val11 && val14)
__set_bit(S4, next);
- if (val13 && val5)
+ if (val13 && val7)
__set_bit(S6, next);
- if (val14 && val5)
+ if (val14 && val7)
__set_bit(S7, next);
break;
case S7:
- if (val3)
+ if (val5)
__set_bit(S0, next);
if (val11 && val13)
__set_bit(S1, next);
if (val11 && val14)
__set_bit(S4, next);
- if (val5)
+ if (val7)
__set_bit(S5, next);
break;
}
diff --git a/tools/verification/models/rtapp/sleep.ltl b/tools/verification/models/rtapp/sleep.ltl
index 5923e58d7810..4d78fdd204c0 100644
--- a/tools/verification/models/rtapp/sleep.ltl
+++ b/tools/verification/models/rtapp/sleep.ltl
@@ -1,6 +1,6 @@
-RULE = always ((RT and SLEEP) imply (RT_FRIENDLY_SLEEP or ALLOWLIST))
+RULE = always ((RT and SLEEP and USER_THREAD) imply (RT_FRIENDLY_SLEEP or ALLOWLIST))
-RT_FRIENDLY_SLEEP = (RT_VALID_SLEEP_REASON or KERNEL_THREAD)
+RT_FRIENDLY_SLEEP = RT_VALID_SLEEP_REASON
and ((not SCHEDULE_IN) until RT_FRIENDLY_WAKE)
RT_VALID_SLEEP_REASON = FUTEX_WAIT
@@ -15,9 +15,6 @@ RT_FRIENDLY_WAKE = WOKEN_BY_EQUAL_OR_HIGHER_PRIO
or WOKEN_BY_HARDIRQ
or WOKEN_BY_NMI
or ABORT_SLEEP
- or KTHREAD_SHOULD_STOP
ALLOWLIST = BLOCK_ON_RT_MUTEX
or FUTEX_LOCK_PI
- or TASK_IS_RCU
- or TASK_IS_MIGRATION
--
2.47.3
^ permalink raw reply related
* [PATCH v2 2/4] rv/rtapp/sleep: Update nanosleep rule
From: Nam Cao @ 2026-06-19 7:21 UTC (permalink / raw)
To: Gabriele Monaco, Steven Rostedt, linux-trace-kernel, linux-doc,
linux-kernel
Cc: Nam Cao
In-Reply-To: <cover.1781852967.git.namcao@linutronix.de>
CLOCK_REALTIME is the only clock that often is misused in real-time
applications. The other clocks either are safe for real-time uses
(CLOCK_TAI, CLOCK_MONOTONIC, CLOCK_BOOTTIME) or are unlikely to be misused
(CLOCK_AUX, CLOCK_PROCESS_CPUTIME_ID).
Update the monitor to only warn about CLOCK_REALTIME.
While at it, update the out-of-sync documentation.
Signed-off-by: Nam Cao <namcao@linutronix.de>
---
Documentation/trace/rv/monitor_rtapp.rst | 17 +++++---
kernel/trace/rv/monitors/sleep/sleep.c | 12 ++----
kernel/trace/rv/monitors/sleep/sleep.h | 52 +++++++++++------------
tools/verification/models/rtapp/sleep.ltl | 2 +-
4 files changed, 39 insertions(+), 44 deletions(-)
diff --git a/Documentation/trace/rv/monitor_rtapp.rst b/Documentation/trace/rv/monitor_rtapp.rst
index 01656bf7080a..570be67a8f3b 100644
--- a/Documentation/trace/rv/monitor_rtapp.rst
+++ b/Documentation/trace/rv/monitor_rtapp.rst
@@ -51,12 +51,13 @@ The `sleep` monitor reports real-time threads sleeping in a manner that may
cause undesirable latency. Real-time applications should only put a real-time
thread to sleep for one of the following reasons:
- - Cyclic work: real-time thread sleeps waiting for the next cycle. For this
- case, only the `clock_nanosleep` syscall should be used with `TIMER_ABSTIME`
- (to avoid time drift) and `CLOCK_MONOTONIC` (to avoid the clock being
- changed). No other method is safe for real-time. For example, threads
- waiting for timerfd can be woken by softirq which provides no real-time
- guarantee.
+ - Cyclic work: real-time thread sleeps waiting for the next
+ cycle. For this case, only the `clock_nanosleep` syscall should be
+ used with `TIMER_ABSTIME` (to avoid time drift). Additionally,
+ `CLOCK_REALTIME` should not be used (to avoid the clock being
+ changed). No other method is safe for real-time. For example,
+ threads waiting for timerfd can be woken by softirq which provides
+ no real-time guarantee.
- Real-time thread waiting for something to happen (e.g. another thread
releasing shared resources, or a completion signal from another thread). In
this case, only futexes (FUTEX_LOCK_PI, FUTEX_LOCK_PI2 or one of
@@ -99,14 +100,16 @@ The monitor's specification is::
RT_VALID_SLEEP_REASON = FUTEX_WAIT
or RT_FRIENDLY_NANOSLEEP
+ or EPOLL_WAIT
RT_FRIENDLY_NANOSLEEP = CLOCK_NANOSLEEP
and NANOSLEEP_TIMER_ABSTIME
- and NANOSLEEP_CLOCK_MONOTONIC
+ and not NANOSLEEP_CLOCK_REALTIME
RT_FRIENDLY_WAKE = WOKEN_BY_EQUAL_OR_HIGHER_PRIO
or WOKEN_BY_HARDIRQ
or WOKEN_BY_NMI
+ or ABORT_SLEEP
or KTHREAD_SHOULD_STOP
ALLOWLIST = BLOCK_ON_RT_MUTEX
diff --git a/kernel/trace/rv/monitors/sleep/sleep.c b/kernel/trace/rv/monitors/sleep/sleep.c
index d6b677fab8f8..638be7d8747f 100644
--- a/kernel/trace/rv/monitors/sleep/sleep.c
+++ b/kernel/trace/rv/monitors/sleep/sleep.c
@@ -44,8 +44,7 @@ static void ltl_atoms_init(struct task_struct *task, struct ltl_monitor *mon, bo
if (task_creation) {
ltl_atom_set(mon, LTL_KTHREAD_SHOULD_STOP, false);
- ltl_atom_set(mon, LTL_NANOSLEEP_CLOCK_MONOTONIC, false);
- ltl_atom_set(mon, LTL_NANOSLEEP_CLOCK_TAI, false);
+ ltl_atom_set(mon, LTL_NANOSLEEP_CLOCK_REALTIME, false);
ltl_atom_set(mon, LTL_NANOSLEEP_TIMER_ABSTIME, false);
ltl_atom_set(mon, LTL_CLOCK_NANOSLEEP, false);
ltl_atom_set(mon, LTL_FUTEX_WAIT, false);
@@ -60,8 +59,7 @@ static void ltl_atoms_init(struct task_struct *task, struct ltl_monitor *mon, bo
/* kernel tasks do not do syscall */
ltl_atom_set(mon, LTL_FUTEX_WAIT, false);
ltl_atom_set(mon, LTL_FUTEX_LOCK_PI, false);
- ltl_atom_set(mon, LTL_NANOSLEEP_CLOCK_MONOTONIC, false);
- ltl_atom_set(mon, LTL_NANOSLEEP_CLOCK_TAI, false);
+ ltl_atom_set(mon, LTL_NANOSLEEP_CLOCK_REALTIME, false);
ltl_atom_set(mon, LTL_NANOSLEEP_TIMER_ABSTIME, false);
ltl_atom_set(mon, LTL_CLOCK_NANOSLEEP, false);
ltl_atom_set(mon, LTL_EPOLL_WAIT, false);
@@ -136,8 +134,7 @@ static void handle_sys_enter(void *data, struct pt_regs *regs, long id)
case __NR_clock_nanosleep_time64:
#endif
syscall_get_arguments(current, regs, args);
- ltl_atom_set(mon, LTL_NANOSLEEP_CLOCK_MONOTONIC, args[0] == CLOCK_MONOTONIC);
- ltl_atom_set(mon, LTL_NANOSLEEP_CLOCK_TAI, args[0] == CLOCK_TAI);
+ ltl_atom_set(mon, LTL_NANOSLEEP_CLOCK_REALTIME, args[0] == CLOCK_REALTIME);
ltl_atom_set(mon, LTL_NANOSLEEP_TIMER_ABSTIME, args[1] == TIMER_ABSTIME);
ltl_atom_update(current, LTL_CLOCK_NANOSLEEP, true);
break;
@@ -178,8 +175,7 @@ static void handle_sys_exit(void *data, struct pt_regs *regs, long ret)
ltl_atom_set(mon, LTL_FUTEX_LOCK_PI, false);
ltl_atom_set(mon, LTL_FUTEX_WAIT, false);
- ltl_atom_set(mon, LTL_NANOSLEEP_CLOCK_MONOTONIC, false);
- ltl_atom_set(mon, LTL_NANOSLEEP_CLOCK_TAI, false);
+ ltl_atom_set(mon, LTL_NANOSLEEP_CLOCK_REALTIME, false);
ltl_atom_set(mon, LTL_NANOSLEEP_TIMER_ABSTIME, false);
ltl_atom_set(mon, LTL_EPOLL_WAIT, false);
ltl_atom_update(current, LTL_CLOCK_NANOSLEEP, false);
diff --git a/kernel/trace/rv/monitors/sleep/sleep.h b/kernel/trace/rv/monitors/sleep/sleep.h
index 403dc2852c52..2fe2ec7edae8 100644
--- a/kernel/trace/rv/monitors/sleep/sleep.h
+++ b/kernel/trace/rv/monitors/sleep/sleep.h
@@ -20,8 +20,7 @@ enum ltl_atom {
LTL_FUTEX_WAIT,
LTL_KERNEL_THREAD,
LTL_KTHREAD_SHOULD_STOP,
- LTL_NANOSLEEP_CLOCK_MONOTONIC,
- LTL_NANOSLEEP_CLOCK_TAI,
+ LTL_NANOSLEEP_CLOCK_REALTIME,
LTL_NANOSLEEP_TIMER_ABSTIME,
LTL_RT,
LTL_SCHEDULE_IN,
@@ -46,8 +45,7 @@ static const char *ltl_atom_str(enum ltl_atom atom)
"fu_wa",
"ker_th",
"kth_sh_st",
- "na_cl_mo",
- "na_cl_ta",
+ "na_cl_re",
"na_ti_ab",
"rt",
"sch_in",
@@ -87,8 +85,7 @@ static void ltl_start(struct task_struct *task, struct ltl_monitor *mon)
bool schedule_in = test_bit(LTL_SCHEDULE_IN, mon->atoms);
bool rt = test_bit(LTL_RT, mon->atoms);
bool nanosleep_timer_abstime = test_bit(LTL_NANOSLEEP_TIMER_ABSTIME, mon->atoms);
- bool nanosleep_clock_tai = test_bit(LTL_NANOSLEEP_CLOCK_TAI, mon->atoms);
- bool nanosleep_clock_monotonic = test_bit(LTL_NANOSLEEP_CLOCK_MONOTONIC, mon->atoms);
+ bool nanosleep_clock_realtime = test_bit(LTL_NANOSLEEP_CLOCK_REALTIME, mon->atoms);
bool kthread_should_stop = test_bit(LTL_KTHREAD_SHOULD_STOP, mon->atoms);
bool kernel_thread = test_bit(LTL_KERNEL_THREAD, mon->atoms);
bool futex_wait = test_bit(LTL_FUTEX_WAIT, mon->atoms);
@@ -97,17 +94,17 @@ static void ltl_start(struct task_struct *task, struct ltl_monitor *mon)
bool clock_nanosleep = test_bit(LTL_CLOCK_NANOSLEEP, mon->atoms);
bool block_on_rt_mutex = test_bit(LTL_BLOCK_ON_RT_MUTEX, mon->atoms);
bool abort_sleep = test_bit(LTL_ABORT_SLEEP, mon->atoms);
- bool val42 = task_is_rcu || task_is_migration;
- bool val43 = futex_lock_pi || val42;
- bool val5 = block_on_rt_mutex || val43;
- bool val34 = abort_sleep || kthread_should_stop;
- bool val35 = woken_by_nmi || val34;
- bool val36 = woken_by_hardirq || val35;
- bool val14 = woken_by_equal_or_higher_prio || val36;
+ bool val41 = task_is_rcu || task_is_migration;
+ bool val42 = futex_lock_pi || val41;
+ bool val5 = block_on_rt_mutex || val42;
+ bool val33 = abort_sleep || kthread_should_stop;
+ bool val34 = woken_by_nmi || val33;
+ bool val35 = woken_by_hardirq || val34;
+ bool val14 = woken_by_equal_or_higher_prio || val35;
bool val13 = !schedule_in;
- bool val26 = nanosleep_clock_monotonic || nanosleep_clock_tai;
- bool val27 = nanosleep_timer_abstime && val26;
- bool val18 = clock_nanosleep && val27;
+ bool val25 = !nanosleep_clock_realtime;
+ bool val26 = nanosleep_timer_abstime && val25;
+ bool val18 = clock_nanosleep && val26;
bool val20 = val18 || epoll_wait;
bool val9 = futex_wait || val20;
bool val11 = val9 || kernel_thread;
@@ -138,8 +135,7 @@ ltl_possible_next_states(struct ltl_monitor *mon, unsigned int state, unsigned l
bool schedule_in = test_bit(LTL_SCHEDULE_IN, mon->atoms);
bool rt = test_bit(LTL_RT, mon->atoms);
bool nanosleep_timer_abstime = test_bit(LTL_NANOSLEEP_TIMER_ABSTIME, mon->atoms);
- bool nanosleep_clock_tai = test_bit(LTL_NANOSLEEP_CLOCK_TAI, mon->atoms);
- bool nanosleep_clock_monotonic = test_bit(LTL_NANOSLEEP_CLOCK_MONOTONIC, mon->atoms);
+ bool nanosleep_clock_realtime = test_bit(LTL_NANOSLEEP_CLOCK_REALTIME, mon->atoms);
bool kthread_should_stop = test_bit(LTL_KTHREAD_SHOULD_STOP, mon->atoms);
bool kernel_thread = test_bit(LTL_KERNEL_THREAD, mon->atoms);
bool futex_wait = test_bit(LTL_FUTEX_WAIT, mon->atoms);
@@ -148,17 +144,17 @@ ltl_possible_next_states(struct ltl_monitor *mon, unsigned int state, unsigned l
bool clock_nanosleep = test_bit(LTL_CLOCK_NANOSLEEP, mon->atoms);
bool block_on_rt_mutex = test_bit(LTL_BLOCK_ON_RT_MUTEX, mon->atoms);
bool abort_sleep = test_bit(LTL_ABORT_SLEEP, mon->atoms);
- bool val42 = task_is_rcu || task_is_migration;
- bool val43 = futex_lock_pi || val42;
- bool val5 = block_on_rt_mutex || val43;
- bool val34 = abort_sleep || kthread_should_stop;
- bool val35 = woken_by_nmi || val34;
- bool val36 = woken_by_hardirq || val35;
- bool val14 = woken_by_equal_or_higher_prio || val36;
+ bool val41 = task_is_rcu || task_is_migration;
+ bool val42 = futex_lock_pi || val41;
+ bool val5 = block_on_rt_mutex || val42;
+ bool val33 = abort_sleep || kthread_should_stop;
+ bool val34 = woken_by_nmi || val33;
+ bool val35 = woken_by_hardirq || val34;
+ bool val14 = woken_by_equal_or_higher_prio || val35;
bool val13 = !schedule_in;
- bool val26 = nanosleep_clock_monotonic || nanosleep_clock_tai;
- bool val27 = nanosleep_timer_abstime && val26;
- bool val18 = clock_nanosleep && val27;
+ bool val25 = !nanosleep_clock_realtime;
+ bool val26 = nanosleep_timer_abstime && val25;
+ bool val18 = clock_nanosleep && val26;
bool val20 = val18 || epoll_wait;
bool val9 = futex_wait || val20;
bool val11 = val9 || kernel_thread;
diff --git a/tools/verification/models/rtapp/sleep.ltl b/tools/verification/models/rtapp/sleep.ltl
index 464c84b9df87..5923e58d7810 100644
--- a/tools/verification/models/rtapp/sleep.ltl
+++ b/tools/verification/models/rtapp/sleep.ltl
@@ -9,7 +9,7 @@ RT_VALID_SLEEP_REASON = FUTEX_WAIT
RT_FRIENDLY_NANOSLEEP = CLOCK_NANOSLEEP
and NANOSLEEP_TIMER_ABSTIME
- and (NANOSLEEP_CLOCK_MONOTONIC or NANOSLEEP_CLOCK_TAI)
+ and not NANOSLEEP_CLOCK_REALTIME
RT_FRIENDLY_WAKE = WOKEN_BY_EQUAL_OR_HIGHER_PRIO
or WOKEN_BY_HARDIRQ
--
2.47.3
^ permalink raw reply related
* [PATCH v2 1/4] rv/rtapp/sleep: Make the error more informative for user
From: Nam Cao @ 2026-06-19 7:21 UTC (permalink / raw)
To: Gabriele Monaco, Steven Rostedt, linux-trace-kernel, linux-doc,
linux-kernel
Cc: Nam Cao
In-Reply-To: <cover.1781852967.git.namcao@linutronix.de>
The rtapp/sleep monitor detects real-time tasks which go to sleep in an
real-time-unsafe manner. If this happen, the monitor triggers a trace event
in the sched_wakeup tracepoint's handler.
However, the invoking context of that trace event is not the most
informative, because of the stack trace of that event is the wakeup's code
path which is not very helpful:
74.669317: rv:error_sleep: condvar[254]: violation detected
ltl_validate+0x345 ([kernel.kallsyms])
handle_sched_wakeup+0x34 ([kernel.kallsyms])
ttwu_do_activate+0xff ([kernel.kallsyms])
sched_ttwu_pending+0x104 ([kernel.kallsyms])
__flush_smp_call_function_queue+0x15b ([kernel.kallsyms])
__sysvec_call_function_single+0x18 ([kernel.kallsyms])
sysvec_call_function_single+0x66 ([kernel.kallsyms])
asm_sysvec_call_function_single+0x1a ([kernel.kallsyms])
pv_native_safe_halt+0xf ([kernel.kallsyms])
default_idle+0x9 ([kernel.kallsyms])
default_idle_call+0x33 ([kernel.kallsyms])
do_idle+0x234 ([kernel.kallsyms])
cpu_startup_entry+0x24 ([kernel.kallsyms])
start_secondary+0xf8 ([kernel.kallsyms])
common_startup_64+0x13e ([kernel.kallsyms])
What would be much more valuable is the stack trace of the task itself.
Instead of using the sched_wakeup tracepoint, use the sched_exit
tracepoint. This makes the event happen in the task's context, making
the stack trace far more informative for user:
rv:error_sleep: condvar[254]: violation detected
ltl_validate+0x345 ([kernel.kallsyms])
handle_sched_exit+0x39 ([kernel.kallsyms])
__schedule+0x80f ([kernel.kallsyms])
schedule+0x22 ([kernel.kallsyms])
futex_do_wait+0x33 ([kernel.kallsyms])
__futex_wait+0x8c ([kernel.kallsyms])
futex_wait+0x73 ([kernel.kallsyms])
do_futex+0xc6 ([kernel.kallsyms])
__x64_sys_futex+0x121 ([kernel.kallsyms])
do_syscall_64+0xf3 ([kernel.kallsyms])
entry_SYSCALL_64_after_hwframe+0x77 ([kernel.kallsyms])
__futex_abstimed_wait_common64+0xc6 (inlined)
__futex_abstimed_wait_common+0xc6 (/usr/lib/x86_64-linux-gnu/libc.so.6)
Signed-off-by: Nam Cao <namcao@linutronix.de>
---
Documentation/trace/rv/monitor_rtapp.rst | 2 +-
kernel/trace/rv/monitors/sleep/sleep.c | 10 +++++-----
kernel/trace/rv/monitors/sleep/sleep.h | 14 +++++++-------
tools/verification/models/rtapp/sleep.ltl | 2 +-
4 files changed, 14 insertions(+), 14 deletions(-)
diff --git a/Documentation/trace/rv/monitor_rtapp.rst b/Documentation/trace/rv/monitor_rtapp.rst
index c8104eda924a..01656bf7080a 100644
--- a/Documentation/trace/rv/monitor_rtapp.rst
+++ b/Documentation/trace/rv/monitor_rtapp.rst
@@ -95,7 +95,7 @@ The monitor's specification is::
RULE = always ((RT and SLEEP) imply (RT_FRIENDLY_SLEEP or ALLOWLIST))
RT_FRIENDLY_SLEEP = (RT_VALID_SLEEP_REASON or KERNEL_THREAD)
- and ((not WAKE) until RT_FRIENDLY_WAKE)
+ and ((not SCHEDULE_IN) until RT_FRIENDLY_WAKE)
RT_VALID_SLEEP_REASON = FUTEX_WAIT
or RT_FRIENDLY_NANOSLEEP
diff --git a/kernel/trace/rv/monitors/sleep/sleep.c b/kernel/trace/rv/monitors/sleep/sleep.c
index 8dfe5ec13e19..d6b677fab8f8 100644
--- a/kernel/trace/rv/monitors/sleep/sleep.c
+++ b/kernel/trace/rv/monitors/sleep/sleep.c
@@ -36,7 +36,7 @@ static void ltl_atoms_fetch(struct task_struct *task, struct ltl_monitor *mon)
static void ltl_atoms_init(struct task_struct *task, struct ltl_monitor *mon, bool task_creation)
{
ltl_atom_set(mon, LTL_SLEEP, false);
- ltl_atom_set(mon, LTL_WAKE, false);
+ ltl_atom_set(mon, LTL_SCHEDULE_IN, false);
ltl_atom_set(mon, LTL_ABORT_SLEEP, false);
ltl_atom_set(mon, LTL_WOKEN_BY_HARDIRQ, false);
ltl_atom_set(mon, LTL_WOKEN_BY_NMI, false);
@@ -92,9 +92,9 @@ static void handle_sched_set_state(void *data, struct task_struct *task, int sta
ltl_atom_pulse(task, LTL_ABORT_SLEEP, true);
}
-static void handle_sched_wakeup(void *data, struct task_struct *task)
+static void handle_sched_exit(void *data, bool is_switch)
{
- ltl_atom_pulse(task, LTL_WAKE, true);
+ ltl_atom_pulse(current, LTL_SCHEDULE_IN, true);
}
static void handle_sched_waking(void *data, struct task_struct *task)
@@ -200,7 +200,7 @@ static int enable_sleep(void)
return retval;
rv_attach_trace_probe("rtapp_sleep", sched_waking, handle_sched_waking);
- rv_attach_trace_probe("rtapp_sleep", sched_wakeup, handle_sched_wakeup);
+ rv_attach_trace_probe("rtapp_sleep", sched_exit_tp, handle_sched_exit);
rv_attach_trace_probe("rtapp_sleep", sched_set_state_tp, handle_sched_set_state);
rv_attach_trace_probe("rtapp_sleep", contention_begin, handle_contention_begin);
rv_attach_trace_probe("rtapp_sleep", contention_end, handle_contention_end);
@@ -213,7 +213,7 @@ static int enable_sleep(void)
static void disable_sleep(void)
{
rv_detach_trace_probe("rtapp_sleep", sched_waking, handle_sched_waking);
- rv_detach_trace_probe("rtapp_sleep", sched_wakeup, handle_sched_wakeup);
+ rv_detach_trace_probe("rtapp_sleep", sched_exit_tp, handle_sched_exit);
rv_detach_trace_probe("rtapp_sleep", sched_set_state_tp, handle_sched_set_state);
rv_detach_trace_probe("rtapp_sleep", contention_begin, handle_contention_begin);
rv_detach_trace_probe("rtapp_sleep", contention_end, handle_contention_end);
diff --git a/kernel/trace/rv/monitors/sleep/sleep.h b/kernel/trace/rv/monitors/sleep/sleep.h
index 95dc2727c059..403dc2852c52 100644
--- a/kernel/trace/rv/monitors/sleep/sleep.h
+++ b/kernel/trace/rv/monitors/sleep/sleep.h
@@ -24,10 +24,10 @@ enum ltl_atom {
LTL_NANOSLEEP_CLOCK_TAI,
LTL_NANOSLEEP_TIMER_ABSTIME,
LTL_RT,
+ LTL_SCHEDULE_IN,
LTL_SLEEP,
LTL_TASK_IS_MIGRATION,
LTL_TASK_IS_RCU,
- LTL_WAKE,
LTL_WOKEN_BY_EQUAL_OR_HIGHER_PRIO,
LTL_WOKEN_BY_HARDIRQ,
LTL_WOKEN_BY_NMI,
@@ -50,10 +50,10 @@ static const char *ltl_atom_str(enum ltl_atom atom)
"na_cl_ta",
"na_ti_ab",
"rt",
- "sl",
+ "sch_in",
+ "sle",
"ta_mi",
"ta_rc",
- "wak",
"wo_eq_hi_pr",
"wo_ha",
"wo_nm",
@@ -81,10 +81,10 @@ static void ltl_start(struct task_struct *task, struct ltl_monitor *mon)
bool woken_by_hardirq = test_bit(LTL_WOKEN_BY_HARDIRQ, mon->atoms);
bool woken_by_equal_or_higher_prio = test_bit(LTL_WOKEN_BY_EQUAL_OR_HIGHER_PRIO,
mon->atoms);
- bool wake = test_bit(LTL_WAKE, mon->atoms);
bool task_is_rcu = test_bit(LTL_TASK_IS_RCU, mon->atoms);
bool task_is_migration = test_bit(LTL_TASK_IS_MIGRATION, mon->atoms);
bool sleep = test_bit(LTL_SLEEP, mon->atoms);
+ bool schedule_in = test_bit(LTL_SCHEDULE_IN, mon->atoms);
bool rt = test_bit(LTL_RT, mon->atoms);
bool nanosleep_timer_abstime = test_bit(LTL_NANOSLEEP_TIMER_ABSTIME, mon->atoms);
bool nanosleep_clock_tai = test_bit(LTL_NANOSLEEP_CLOCK_TAI, mon->atoms);
@@ -104,7 +104,7 @@ static void ltl_start(struct task_struct *task, struct ltl_monitor *mon)
bool val35 = woken_by_nmi || val34;
bool val36 = woken_by_hardirq || val35;
bool val14 = woken_by_equal_or_higher_prio || val36;
- bool val13 = !wake;
+ bool val13 = !schedule_in;
bool val26 = nanosleep_clock_monotonic || nanosleep_clock_tai;
bool val27 = nanosleep_timer_abstime && val26;
bool val18 = clock_nanosleep && val27;
@@ -132,10 +132,10 @@ ltl_possible_next_states(struct ltl_monitor *mon, unsigned int state, unsigned l
bool woken_by_hardirq = test_bit(LTL_WOKEN_BY_HARDIRQ, mon->atoms);
bool woken_by_equal_or_higher_prio = test_bit(LTL_WOKEN_BY_EQUAL_OR_HIGHER_PRIO,
mon->atoms);
- bool wake = test_bit(LTL_WAKE, mon->atoms);
bool task_is_rcu = test_bit(LTL_TASK_IS_RCU, mon->atoms);
bool task_is_migration = test_bit(LTL_TASK_IS_MIGRATION, mon->atoms);
bool sleep = test_bit(LTL_SLEEP, mon->atoms);
+ bool schedule_in = test_bit(LTL_SCHEDULE_IN, mon->atoms);
bool rt = test_bit(LTL_RT, mon->atoms);
bool nanosleep_timer_abstime = test_bit(LTL_NANOSLEEP_TIMER_ABSTIME, mon->atoms);
bool nanosleep_clock_tai = test_bit(LTL_NANOSLEEP_CLOCK_TAI, mon->atoms);
@@ -155,7 +155,7 @@ ltl_possible_next_states(struct ltl_monitor *mon, unsigned int state, unsigned l
bool val35 = woken_by_nmi || val34;
bool val36 = woken_by_hardirq || val35;
bool val14 = woken_by_equal_or_higher_prio || val36;
- bool val13 = !wake;
+ bool val13 = !schedule_in;
bool val26 = nanosleep_clock_monotonic || nanosleep_clock_tai;
bool val27 = nanosleep_timer_abstime && val26;
bool val18 = clock_nanosleep && val27;
diff --git a/tools/verification/models/rtapp/sleep.ltl b/tools/verification/models/rtapp/sleep.ltl
index 6f26c4810f78..464c84b9df87 100644
--- a/tools/verification/models/rtapp/sleep.ltl
+++ b/tools/verification/models/rtapp/sleep.ltl
@@ -1,7 +1,7 @@
RULE = always ((RT and SLEEP) imply (RT_FRIENDLY_SLEEP or ALLOWLIST))
RT_FRIENDLY_SLEEP = (RT_VALID_SLEEP_REASON or KERNEL_THREAD)
- and ((not WAKE) until RT_FRIENDLY_WAKE)
+ and ((not SCHEDULE_IN) until RT_FRIENDLY_WAKE)
RT_VALID_SLEEP_REASON = FUTEX_WAIT
or RT_FRIENDLY_NANOSLEEP
--
2.47.3
^ permalink raw reply related
* [PATCH v2 0/4] rv: rtapp monitor update
From: Nam Cao @ 2026-06-19 7:21 UTC (permalink / raw)
To: Gabriele Monaco, Steven Rostedt, linux-trace-kernel, linux-doc,
linux-kernel
Cc: Nam Cao
A couple of minor improvements to the rtapp monitor:
- Making the monitor more informative to user by changing the
context of the tracepoint into the monitored task itself, not the
IPI wakeup path.
- and update the allow list regarding clock_nanosleep syscall.
- Stop monitoring the kernel threads to simplify the monitors.
- Add a new rtapp/wakeup monitor to give complement the rtapp/sleep
monitor.
v2..v1 https://lore.kernel.org/lkml/cover.1779176466.git.namcao@linutronix.de/
- Use clearer waker/wakee terminologies
- Fix build issue
- Add new patch "rv/rtapp/sleep: Stop monitoring kernel threads"
- Require RV_PER_TASK_MONITORS >= 3
Nam Cao (4):
rv/rtapp/sleep: Make the error more informative for user
rv/rtapp/sleep: Update nanosleep rule
rv/rtapp/sleep: Stop monitoring kernel threads
rv/rtapp: Add wakeup monitor
Documentation/trace/rv/monitor_rtapp.rst | 61 ++++---
kernel/trace/rv/Kconfig | 1 +
kernel/trace/rv/Makefile | 1 +
kernel/trace/rv/monitors/rtapp/Kconfig | 2 +-
kernel/trace/rv/monitors/sleep/Kconfig | 1 -
kernel/trace/rv/monitors/sleep/sleep.c | 59 ++-----
kernel/trace/rv/monitors/sleep/sleep.h | 142 +++++++---------
kernel/trace/rv/monitors/wakeup/Kconfig | 16 ++
kernel/trace/rv/monitors/wakeup/wakeup.c | 153 ++++++++++++++++++
kernel/trace/rv/monitors/wakeup/wakeup.h | 92 +++++++++++
.../trace/rv/monitors/wakeup/wakeup_trace.h | 14 ++
kernel/trace/rv/rv_trace.h | 1 +
tools/verification/models/rtapp/sleep.ltl | 11 +-
tools/verification/models/rtapp/wakeup.ltl | 5 +
14 files changed, 396 insertions(+), 163 deletions(-)
create mode 100644 kernel/trace/rv/monitors/wakeup/Kconfig
create mode 100644 kernel/trace/rv/monitors/wakeup/wakeup.c
create mode 100644 kernel/trace/rv/monitors/wakeup/wakeup.h
create mode 100644 kernel/trace/rv/monitors/wakeup/wakeup_trace.h
create mode 100644 tools/verification/models/rtapp/wakeup.ltl
--
2.47.3
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox