* [PATCH RESEND 3/3] Docs/{ABI,admin-guide}/damon: fix various typoes
From: SeongJae Park @ 2026-05-20 1:21 UTC (permalink / raw)
To: Andrew Morton
Cc: Zenghui Yu, Liam R. Howlett, David Hildenbrand, Jonathan Corbet,
Lorenzo Stoakes, Michal Hocko, Mike Rapoport, SeongJae Park,
Shuah Khan, Suren Baghdasaryan, Vlastimil Babka, damon, linux-doc,
linux-kernel, linux-mm
In-Reply-To: <20260520012104.93602-1-sj@kernel.org>
From: Zenghui Yu <zenghui.yu@linux.dev>
``damon_target_idx`` was wrongly written as ``target_idx`` in the docs. Fix
it all over the place, as well as the wrong directory count, grammar, etc.
Signed-off-by: Zenghui Yu <zenghui.yu@linux.dev>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: SeongJae Park <sj@kernel.org>
---
Changes from v1
- v1: https://lore.kernel.org/20260517182624.7167-1-zenghui.yu@linux.dev
- Collect Reviewed-by: tag.
- Rebase to latest mm-new.
.../ABI/testing/sysfs-kernel-mm-damon | 2 +-
Documentation/admin-guide/mm/damon/usage.rst | 18 +++++++++---------
2 files changed, 10 insertions(+), 10 deletions(-)
diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-damon b/Documentation/ABI/testing/sysfs-kernel-mm-damon
index ee29d4e204ffa..b73e6bc28ea5f 100644
--- a/Documentation/ABI/testing/sysfs-kernel-mm-damon
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-damon
@@ -452,7 +452,7 @@ Description: If 'hugepage_size' is written to the 'type' file, writing to
or reading from this file sets or gets the maximum size of the
hugepage for the filter.
-What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/core_filters/<F>/target_idx
+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/core_filters/<F>/damon_target_idx
Date: Feb 2025
Contact: SeongJae Park <sj@kernel.org>
Description: If 'target' is written to the 'type' file, writing to or
diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst
index 0d6a27dc97b0a..d46875e603d86 100644
--- a/Documentation/admin-guide/mm/damon/usage.rst
+++ b/Documentation/admin-guide/mm/damon/usage.rst
@@ -97,7 +97,7 @@ comma (",").
│ │ │ │ │ │ │ │ │ 0/target_metric,target_value,current_value,nid,path
│ │ │ │ │ │ │ :ref:`watermarks <sysfs_watermarks>`/metric,interval_us,high,mid,low
│ │ │ │ │ │ │ :ref:`{core_,ops_,}filters <sysfs_filters>`/nr_filters
- │ │ │ │ │ │ │ │ 0/type,matching,allow,memcg_path,addr_start,addr_end,target_idx,min,max
+ │ │ │ │ │ │ │ │ 0/type,matching,allow,memcg_path,addr_start,addr_end,damon_target_idx,min,max
│ │ │ │ │ │ │ :ref:`dests <damon_sysfs_dests>`/nr_dests
│ │ │ │ │ │ │ │ 0/id,weight
│ │ │ │ │ │ │ :ref:`stats <sysfs_schemes_stats>`/nr_tried,sz_tried,nr_applied,sz_applied,sz_ops_filter_passed,qt_exceeds,nr_snapshots,max_nr_snapshots
@@ -374,7 +374,7 @@ to ``N-1``. Each directory represents each DAMON-based operation scheme.
schemes/<N>/
------------
-In each scheme directory, eight directories (``access_pattern``, ``quotas``,
+In each scheme directory, nine directories (``access_pattern``, ``quotas``,
``watermarks``, ``core_filters``, ``ops_filters``, ``filters``, ``dests``,
``stats``, and ``tried_regions``) and three files (``action``, ``target_nid``
and ``apply_interval``) exist.
@@ -492,7 +492,7 @@ given DAMON-based operation scheme.
Under the watermarks directory, five files (``metric``, ``interval_us``,
``high``, ``mid``, and ``low``) for setting the metric, the time interval
between check of the metric, and the three watermarks exist. You can set and
-get the five values by writing to the files, respectively.
+get the five values by writing to and reading from the files, respectively.
Keywords and meanings of those that can be written to the ``metric`` file are
as below.
@@ -500,7 +500,7 @@ as below.
- none: Ignore the watermarks
- free_mem_rate: System's free memory rate (per thousand)
-The ``interval`` should written in microseconds unit.
+The ``interval_us`` should be written in microseconds unit.
.. _sysfs_filters:
@@ -528,9 +528,9 @@ in the numeric order.
Each filter directory contains nine files, namely ``type``, ``matching``,
``allow``, ``memcg_path``, ``addr_start``, ``addr_end``, ``min``, ``max``
-and ``target_idx``. To ``type`` file, you can write the type of the filter.
-Refer to :ref:`the design doc <damon_design_damos_filters>` for available type
-names, their meaning and on what layer those are handled.
+and ``damon_target_idx``. To ``type`` file, you can write the type of the
+filter. Refer to :ref:`the design doc <damon_design_damos_filters>` for
+available type names, their meaning and on what layer those are handled.
For ``memcg`` type, you can specify the memory cgroup of the interest by
writing the path of the memory cgroup from the cgroups mount point to
@@ -540,7 +540,7 @@ files, respectively. For ``hugepage_size`` type, you can specify the minimum
and maximum size of the range (closed interval) to ``min`` and ``max`` files,
respectively. For ``target`` type, you can specify the index of the target
between the list of the DAMON context's monitoring targets list to
-``target_idx`` file.
+``damon_target_idx`` file.
You can write ``Y`` or ``N`` to ``matching`` file to specify whether the filter
is for memory that matches the ``type``. You can write ``Y`` or ``N`` to
@@ -731,7 +731,7 @@ show results using tracepoint supporting tools like ``perf``. For example::
Each line of the perf script output represents each monitoring region. The
first five fields are as usual other tracepoint outputs. The sixth field
-(``target_id=X``) shows the ide of the monitoring target of the region. The
+(``target_id=X``) shows the id of the monitoring target of the region. The
seventh field (``nr_regions=X``) shows the total number of monitoring regions
for the target. The eighth field (``X-Y:``) shows the start (``X``) and end
(``Y``) addresses of the region in bytes. The ninth field (``X``) shows the
--
2.47.3
^ permalink raw reply related
* [PATCH RESEND 2/3] Docs/mm/damon/design: fix three typos
From: SeongJae Park @ 2026-05-20 1:21 UTC (permalink / raw)
To: Andrew Morton
Cc: Sakurai Shun, Liam R. Howlett, David Hildenbrand, Jonathan Corbet,
Lorenzo Stoakes, Michal Hocko, Mike Rapoport, SeongJae Park,
Shuah Khan, Suren Baghdasaryan, Vlastimil Babka, damon, linux-doc,
linux-kernel, linux-mm
In-Reply-To: <20260520012104.93602-1-sj@kernel.org>
From: Sakurai Shun <ssh1326@icloud.com>
L140: "unsinged" -> "unsigned"
L371: "sampleing" -> "sampling"
L387: "multipled" -> "multiplied"
Signed-off-by: Sakurai Shun <ssh1326@icloud.com>
Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: SeongJae Park <sj@kernel.org>
---
Changes from v2
- v2: https://lore.kernel.org/20260517073433.3015-1-ssh1326@icloud.com
- Collect Reviewed-by: and Acked-by: tags.
- Update commit subject prefix.
- Rebase to latest mm-new.
Changes from v1
- v1: https://lore.kernel.org/20260516093552.8404-1-cheesecake2960@icloud.com
- Use real author name.
Documentation/mm/damon/design.rst | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/Documentation/mm/damon/design.rst b/Documentation/mm/damon/design.rst
index a24f9f00d1837..2da7ca0d3d17a 100644
--- a/Documentation/mm/damon/design.rst
+++ b/Documentation/mm/damon/design.rst
@@ -147,7 +147,7 @@ as Idle page tracking does.
Address Unit
------------
-DAMON core layer uses ``unsinged long`` type for monitoring target address
+DAMON core layer uses ``unsigned long`` type for monitoring target address
ranges. In some cases, the address space for a given operations set could be
too large to be handled with the type. ARM (32-bit) with large physical
address extension is an example. For such cases, a per-operations set
@@ -417,7 +417,7 @@ with theoretical maximum ``nr_accesses``, which can be calculated as
``aggregation interval / sampling interval``.
The mechanism calculates the ratio of access events for ``aggrs`` aggregations,
-and increases or decrease the ``sampleing interval`` and ``aggregation
+and increases or decrease the ``sampling interval`` and ``aggregation
interval`` in same ratio, if the observed access ratio is lower or higher than
the target, respectively. The ratio of the intervals change is decided in
proportion to the distance between current samples ratio and the target ratio.
@@ -433,7 +433,7 @@ The tuning is turned off by default, and need to be set explicitly by the user.
As a rule of thumbs and the Parreto principle, 4% access samples ratio target
is recommended. Note that Parreto principle (80/20 rule) has applied twice.
That is, assumes 4% (20% of 20%) DAMON-observed access events ratio (source)
-to capture 64% (80% multipled by 80%) real access events (outcomes).
+to capture 64% (80% multiplied by 80%) real access events (outcomes).
To know how user-space can use this feature via :ref:`DAMON sysfs interface
<sysfs_interface>`, refer to :ref:`intervals_goal
--
2.47.3
^ permalink raw reply related
* [PATCH RESEND 0/3] mm/damon: reposting three reviewed patches
From: SeongJae Park @ 2026-05-20 1:20 UTC (permalink / raw)
To: Andrew Morton
Cc: SeongJae Park, Liam R. Howlett, David Hildenbrand,
Jonathan Corbet, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
Shuah Khan, Suren Baghdasaryan, Vlastimil Babka, damon, linux-doc,
linux-kernel, linux-mm
This series reposts patches that were reviewed by the DAMON maintainer
but not yet added to mm-new, for more visibility. From their last
posted versions, only minor changes including commit subject update,
Reviewed-by: and Acked-by: tags collection, and rebasing to latest
mm-new were made by the DAMON maintainer.
Sakurai Shun (1):
Docs/mm/damon/design: fix three typos
Zenghui Yu (1):
Docs/{ABI,admin-guide}/damon: fix various typoes
niecheng (1):
mm/damon/core: clarify next_intervals_tune_sis update path
.../ABI/testing/sysfs-kernel-mm-damon | 2 +-
Documentation/admin-guide/mm/damon/usage.rst | 18 +++++++++---------
Documentation/mm/damon/design.rst | 6 +++---
mm/damon/core.c | 3 +++
4 files changed, 16 insertions(+), 13 deletions(-)
base-commit: f4f9ecd6383da01075a0be8f9c08b82152061fd0
--
2.47.3
^ permalink raw reply
* Re: [PATCH v2 5/6] gpio: remove machine hogs
From: Dmitry Torokhov @ 2026-05-20 0:46 UTC (permalink / raw)
To: Bartosz Golaszewski
Cc: Linus Walleij, Bartosz Golaszewski, Geert Uytterhoeven,
Frank Rowand, Mika Westerberg, Andy Shevchenko, Aaro Koskinen,
Janusz Krzysztofik, Tony Lindgren, Russell King, Jonathan Corbet,
Shuah Khan, linux-gpio, linux-kernel, linux-acpi,
linux-arm-kernel, linux-omap, linux-doc
In-Reply-To: <20260309-gpio-hog-fwnode-v2-5-4e61f3dbf06a@oss.qualcomm.com>
On Mon, Mar 09, 2026 at 01:42:41PM +0100, Bartosz Golaszewski wrote:
> With no more users, remove legacy machine hog API from the kernel.
>
> Reviewed-by: Linus Walleij <linusw@kernel.org>
> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
> Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
Argh! What is the replacement for this? I have patches for rsk7203 to
use them to get rid of legacy gpio use, like this:
diff --git a/arch/sh/boards/mach-rsk/devices-rsk7203.c b/arch/sh/boards/mach-rsk/devices-rsk7203.c
index f8760a91e2f1..5bbd3b31cffb 100644
--- a/arch/sh/boards/mach-rsk/devices-rsk7203.c
+++ b/arch/sh/boards/mach-rsk/devices-rsk7203.c
@@ -12,7 +12,7 @@
#include <linux/smsc911x.h>
#include <linux/input.h>
#include <linux/io.h>
-#include <linux/gpio.h>
+#include <linux/gpio/consumer.h>
#include <linux/gpio/machine.h>
#include <linux/gpio/property.h>
#include <asm/machvec.h>
@@ -165,6 +165,19 @@ static const struct platform_device_info rsk7203_devices[] __initconst = {
},
};
+/* The base of the function GPIOs in the flat enum */
+#define SH7203_FN_BASE GPIO_FN_PINT7_PB
+
+static struct gpiod_hog rsk7203_gpio_hogs[] = {
+ GPIO_HOG("sh7203_pfc-fn", GPIO_FN_TXD0 - SH7203_FN_BASE,
+ "TXD0", GPIO_ACTIVE_HIGH, GPIOD_ASIS),
+ GPIO_HOG("sh7203_pfc-fn", GPIO_FN_RXD0 - SH7203_FN_BASE,
+ "RXD0", GPIO_ACTIVE_HIGH, GPIOD_ASIS),
+ GPIO_HOG("sh7203_pfc-fn", GPIO_FN_IRQ0_PB - SH7203_FN_BASE,
+ "IRQ0_PB", GPIO_ACTIVE_HIGH, GPIOD_ASIS),
+ { }
+};
+
static int __init rsk7203_devices_setup(void)
{
struct platform_device *pd;
@@ -172,12 +185,10 @@ static int __init rsk7203_devices_setup(void)
int i;
/* Select pins for SCIF0 */
- gpio_request(GPIO_FN_TXD0, NULL);
- gpio_request(GPIO_FN_RXD0, NULL);
+ gpiod_add_hogs(rsk7203_gpio_hogs);
/* Setup LAN9118: CS1 in 16-bit Big Endian Mode, IRQ0 at Port B */
__raw_writel(0x36db0400, 0xfffc0008); /* CS1BCR */
- gpio_request(GPIO_FN_IRQ0_PB, NULL);
error = software_node_register_node_group(rsk7203_swnodes);
if (error) {
If there is no replacement maybe we can resurrect this? Or shoudl we
have add swnode support for hogs?
Thanks.
--
Dmitry
^ permalink raw reply related
* Re: [PATCH net-next v2 2/2] net: ti: icssg: Add HSR and LRE PA statistics
From: Jakub Kicinski @ 2026-05-19 23:56 UTC (permalink / raw)
To: Luka Gejak
Cc: MD Danish Anwar, Felix Maurer, David S. Miller, Eric Dumazet,
Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan,
Roger Quadros, Andrew Lunn, Meghana Malladi, Jacob Keller,
David Carlier, Vadim Fedorenko, Kevin Hao, netdev, linux-doc,
linux-kernel, linux-arm-kernel, Vladimir Oltean
In-Reply-To: <E30AAC96-01D2-4A23-B562-126087DEB7FA@linux.dev>
On Tue, 19 May 2026 07:55:55 +0200 Luka Gejak wrote:
> On May 19, 2026 3:45:06 AM GMT+02:00, Jakub Kicinski <kuba@kernel.org> wrote:
> >On Thu, 14 May 2026 13:26:05 +0530 MD Danish Anwar wrote:
> >> Add new firmware PA statistics counters for HSR and LRE to the ethtool
> >> statistics exposed by the ICSSG driver.
> >>
> >> New statistics added:
> >> - FW_HSR_FWD_CHECK_FAIL_DROP: Packets dropped on the HSR forwarding path
> >> - FW_HSR_HE_CHECK_FAIL_DROP: Packets dropped on the HSR host egress path
> >> - FW_HSR_SKIP_HOST_DUP_DISCARD_FRAMES: Frames with duplicate discard
> >> skipped
> >> - FW_LRE_CNT_UNIQUE/DUPLICATE/MULTIPLE_RX: LRE duplicate detection
> >> counters
> >> - FW_LRE_CNT_RX/TX: LRE per-port frame counters
> >> - FW_LRE_CNT_OWN_RX: Own HSR tagged frames received
> >> - FW_LRE_CNT_ERRWRONGLAN: Frames with wrong LAN identifier (PRP)
> >>
> >> Document the new HSR/LRE statistics in icssg_prueth.rst.
> >
> >To an untrained eye these stats look like stuff that could
> >be standardized across drivers.
> >
> >Luka, Felix, others on CC, do you think we should expose these
> >from HSR over netlink as "standard" offload stats different drivers
> >can plug into or not worth it?
>
> I think there is a case for standardizing part of this, but I would
> not standardize the whole set as-is.
>
> The LRE counters look generic enough to me, especially:
> - unique rx
> - duplicate rx
> - multiple rx
> - rx / tx
> - own rx
> - wrong LAN, PRP only
>
> Those are protocol/LRE concepts rather than TI firmware details, so
> exposing them from the HSR/PRP layer sounds useful. I would expect
> both the software implementation and offloaded implementations to be
> able to provide at least some of them, with unsupported counters
> omitted or reported as not available.
> I would not put the firmware check/drop counters in the same standard
> bucket, though:
> - FW_HSR_FWD_CHECK_FAIL_DROP
> - FW_HSR_HE_CHECK_FAIL_DROP
> - FW_HSR_SKIP_HOST_DUP_DISCARD_FRAMES
Thanks for the breakdown!
> Those sound more like implementation/debug counters for the ICSSG
> firmware pipeline. They are still useful in ethtool driver stats, but
> I would be hesitant to bake their exact semantics into HSR UAPI.
> So my preference would be:
> 1. Keep driver-private ethtool stats for the full firmware counter set.
> 2. Add a small HSR/PRP standard stats set separately, limited to
> well-defined LRE counters.
> 3. Make the HSR layer expose them, with offload drivers plugging in via
> an optional callback or offload stats op.
> 4. Define the counters carefully, including whether they are per-HSR
> device or per-port A/B, and what PRP-only counters mean for HSR.
>
> I do not think this patch should blindly become the UAPI definition,
Not at all, the unique / multiple stats gave me pause. We should
only put in the standard API what can be easily and unambiguously
defined given the protocol spec.
> but I do think it points at a useful follow-up. If we want to avoid
> adding driver-private names first and then standardizing different
> names later, then it may be worth asking Danish to split the
> protocol-level LRE counters out and route those through a common HSR
> stats interface.
As a general policy we ask for standard stats to be added first and
ethtool to only contain what didn't fit in the standard ones.
There are some technical reasons but it's mostly a mindset thing.
^ permalink raw reply
* Re: [PATCH v11 4/5] platform/chrome: Protect cros_ec_device lifecycle with revocable
From: Jason Gunthorpe @ 2026-05-19 23:55 UTC (permalink / raw)
To: Tzung-Bi Shih, Danilo Krummrich
Cc: Arnd Bergmann, Greg Kroah-Hartman, Bartosz Golaszewski,
Linus Walleij, Benson Leung, linux-kernel, chrome-platform,
driver-core, linux-doc, linux-gpio, Rafael J. Wysocki,
Jonathan Corbet, Shuah Khan, Laurent Pinchart, Wolfram Sang,
Johan Hovold, Paul E . McKenney
In-Reply-To: <agiCQQO9KGoMS1Jj@tzungbi-laptop>
On Sat, May 16, 2026 at 10:42:09PM +0800, Tzung-Bi Shih wrote:
> On Thu, May 14, 2026 at 01:00:43PM -0300, Jason Gunthorpe wrote:
> > On Thu, May 14, 2026 at 03:34:12AM +0000, Tzung-Bi Shih wrote:
> >
> > > To help me understand, could you elaborate on why the revocable mechanism
> > > isn't suitable here?
> >
> > Stay within one driver. Create the revokable is probe, consume it
> > within that drivers fops/etc, destroy it on remove. Do not randomly
> > pass it to other drivers.
>
> In that sense, after applying [1], does the patch make sense to you?
It is better, but you can see revokable is creating contortions that
don't make sense:
> @@ -223,11 +223,9 @@ static int cros_ec_chardev_open(struct inode *inode, struct file *filp)
> ret = blocking_notifier_chain_register(&pdata->subscribers,
> &priv->notifier);
> if (ret) {
> - scoped_guard(rwsem_read, &pdata->ec_dev_sem) {
> - if (pdata->ec_dev)
> - dev_err(pdata->ec_dev->dev,
> - "failed to register event notifier\n");
> - }
> + revocable_try_access_or_skip_scoped(&pdata->ec_rev, ec_dev)
> + dev_err(ec_dev->dev,
> + "failed to register event notifier\n");
It is impossible for ec_dev to be null here, the misc_unregister does
fence open.
> @@ -482,11 +483,12 @@ static int cros_ec_chardev_probe(struct platform_device *pdev)
> static void cros_ec_chardev_remove(struct platform_device *pdev)
> {
> struct chardev_pdata *pdata = platform_get_drvdata(pdev);
> + struct cros_ec_device *ec_dev;
>
> - blocking_notifier_chain_unregister(&pdata->ec_dev->event_notifier,
> - &pdata->relay);
> - scoped_guard(rwsem_write, &pdata->ec_dev_sem)
> - pdata->ec_dev = NULL;
> + revocable_try_access_or_skip_scoped(&pdata->ec_rev, ec_dev)
> + blocking_notifier_chain_unregister(&ec_dev->event_notifier,
> + &pdata->relay);
> + revocable_revoke(&pdata->ec_rev);
And this is complete garbage nonsense, we are in a driver bound
context about to revoke the revokable, it is not optional, it can't
fail, if it doesn't we can't skip the unregister or it will eventually
crash.
I said it before, but to re-iterate - what this scheme fails to
capture from rust is the most important detail - the driver bound
checking that confirms the content is valid without any need for
locking or possibility of failure.
Open, remove, are both bound contexts that can never fail to obtain
their protected content and don't need srcu locking.
I don't konw what Danilo thinks, but as the rust side has evolved I
think it was a mistake to combine the revocable and SRCU
together. Having two primitives would make more sense
The first is "this value is only valid under driver bound, present
your thing proving driver bound and you can get the value". This would
be fully 0 cost.
The second is "My callchain doesn't have a way to get driver bound,
so this widget will try to open a SRCU critical section that produces
it".
Each driver could have many of the first but needs only one of the
second. The second is the "code smell" that says something is not
great by not properly managing driver bound. Since a driver needs only
one of the widgets it would solve the repeated srcu problem on unbind.
From that lens you can see how troubled this C version is, it promotes
the "code smell" SRCU API into the only API and makes it first class
promoting its use and ignores/obfuscates/worsens the actual API we
want people to use: prove you have a driver bound. :(
Jason
^ permalink raw reply
* Re: [PATCH v4 04/30] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration
From: Dongli Zhang @ 2026-05-19 23:34 UTC (permalink / raw)
To: David Woodhouse, kvm
Cc: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Thomas Gleixner,
Sean Christopherson, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
Dave Hansen, Vitaly Kuznetsov, x86, Marc Zyngier, Juergen Gross,
Boris Ostrovsky, Paul Durrant, Jonathan Cameron, Sascha Bischoff,
Jack Allister, Joey Gouly, joe.jin, linux-doc, linux-kernel,
xen-devel, linux-kselftest
In-Reply-To: <32ca0a8da4bfb1e92013a7f75e0ff7541ebcd6a6.camel@infradead.org>
On 2026-05-19 3:43 PM, David Woodhouse wrote:
> On Tue, 2026-05-19 at 14:23 -0700, Dongli Zhang wrote:
>> I think I now understand why I feel like I am always asking weird questions. I
>> have been thinking about how to account for downtime, so I see
>> KVM_SET_CLOCK_GUEST as a supplement to KVM_SET_CLOCK.
>
> I do not believe in "downtime". There is no such thing.
> There is only "steal time".
Or "leap seconds" as used in the document?
https://lore.kernel.org/all/20240522001817.619072-8-dwmw2@infradead.org
>
> If I recall correctly what we described in
> https://lore.kernel.org/all/20240522001817.619072-8-dwmw2@infradead.org/
> I don't think we actually needed KVM_SET_CLOCK at all, did we?
Here I partially copied the content from the link.
The 2nd step of destination VMM is to invoke KVM_SET_CLOCK ioctl.
---
From the destination VMM process:
-4. Invoke the KVM_SET_CLOCK ioctl, providing the source nanoseconds from
- kvmclock (guest_src) and CLOCK_REALTIME (host_src) in their respective
+4. Before creating the vCPUs, invoke the KVM_SET_TSC_KHZ ioctl on the VM, to
+ set the scaled frequency of the guest's TSC (freq).
+
+5. Invoke the KVM_SET_CLOCK ioctl, providing the source nanoseconds from
+ kvmclock (guest_src) and CLOCK_REALTIME (time_src) in their respective
fields. Ensure that the KVM_CLOCK_REALTIME flag is set in the provided
structure.
- KVM will advance the VM's kvmclock to account for elapsed time since
- recording the clock values. Note that this will cause problems in
+ KVM will restore the VM's kvmclock, accounting for elapsed time since
+ the clock values were recorded. Note that this will cause problems in
the guest (e.g., timeouts) unless CLOCK_REALTIME is synchronized
between the source and destination, and a reasonably short time passes
- between the source pausing the VMs and the destination executing
- steps 4-7.
+ between the source pausing the VMs and the destination resuming them.
+ Due to the KVM_[SG]ET_CLOCK API using CLOCK_REALTIME instead of
+ CLOCK_TAI, leap seconds during the migration may also introduce errors.
--
>>>
>>> So your "MASTERCLOCK_UPDATE is pending and there is no pending
>>> CLOCK_UPDATE" doesn't make much sense to me. If MASTERCLOCK_UPDATE is
>>> pending, then there *will* be a CLOCK_UPDATE pending.
>>
>> Suppose the VM is stopped and the master clock is active.
>
> I don't know what it means for a VM to be 'stopped'. Do you mean that
> all vCPUs happen to be experiencing steal time at the present moment?
Taking QEMU as an example, all vCPU threads remain asleep in host userspace
without having a chance to invoke KVM_RUN. As a result, none of the vCPUs can
enter KVM kernel mode to process any pending requests.
This is the state before QEMU resumes from live migration or live update.
(qemu) stop
(qemu) info status
VM status: paused
According to my understanding, older KVM versions even required the userspace
VMM to keep vCPUs in userspace to avoid racing with KVM_RUN.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kvm/x86.c?h=v5.15#n6090
case KVM_SET_CLOCK: {
... ...
/*
* TODO: userspace has to take care of races with VCPU_RUN, so
* kvm_gen_update_masterclock() can be cut down to locked
* pvclock_update_vm_gtod_copy().
*/
>
>> Suddenly, we change the host clocksource from TSC to HPET. pvclock_gtod_notify()
>> may call pvclock_gtod_update_fn() to set a pending KVM_REQ_MASTERCLOCK_UPDATE
>> for all vCPUs. Unless the pending KVM_REQ_MASTERCLOCK_UPDATE is processed by
>> kvm_update_masterclock(), kvm_end_pvclock_update() will not set a pending
>> KVM_REQ_CLOCK_UPDATE.
>
> You say 'Unless'... do you mean 'Until'?
Until.
>
>> Therefore, this is a scenario in which only KVM_REQ_MASTERCLOCK_UPDATE is pending.
>>
>> I do not think this scenario is important. I am just curious about the expected
>> way to implement similar code in the future :)
>
> I think that's working correctly. Until the master clock has *actually*
> been updated, there's no point in setting CLOCK_UPDATE for each vCPU to
> disseminate the new information to its own pvclock?
Thank you very much for helping confirm this.
>
>> For the live migration scenario, the current QEMU implementation not only fails
>> to account for downtime, but also has a drift issue. That is what I would like
>> to address in QEMU.
>
> Again, restore the gTSC as accurately as possible. Probably by working
> out for *yourself* the relationships of the source and destination host
> TSCs to real time, and then reconstituting on the destination using TSC
> offset just as for live migration.
>
> And then use KVM_SET_CLOCK_GUEST too.
>
> That's what I attempted to document in
> https://lore.kernel.org/all/20240522001817.619072-8-dwmw2@infradead.org/
> and should probably revive.
I would really appreciate it if this document could be revived. I don't see it
in your most recent v4 PATCH 7. It is very helpful as a guideline for how
userspace VMMs should take advantage of these APIs.
Thank you very much!
Dongli Zhang
^ permalink raw reply
* Re: [PATCH] Documentation: KVM: Document guest-visible compatibility expectations
From: David Woodhouse @ 2026-05-19 23:33 UTC (permalink / raw)
To: Oliver Upton
Cc: Paolo Bonzini, Marc Zyngier, Will Deacon, Jonathan Corbet,
Shuah Khan, kvm, Linux Doc Mailing List,
Kernel Mailing List, Linux, Sean Christopherson, Jim Mattson,
Joey Gouly, Suzuki K Poulose, Zenghui Yu, Catalin Marinas,
Raghavendra Rao Ananta, Eric Auger, Kees Cook, Arnd Bergmann,
Nathan Chancellor, linux-arm-kernel, kvmarm, linux-kselftest
In-Reply-To: <agzq5kwzuJvd7Mh5@kernel.org>
[-- Attachment #1: Type: text/plain, Size: 6032 bytes --]
On Tue, 2026-05-19 at 15:57 -0700, Oliver Upton wrote:
> On Tue, May 19, 2026 at 10:58:05PM +0100, David Woodhouse wrote:
> > On Tue, 2026-05-19 at 14:10 -0700, Oliver Upton wrote:
> > > And in the absence of clear evidence of a guest depending on the broken
> > > IGROUPR behavior, I don't see how the guest-side changes of Christoffer's
> > > series are any different from the multitude of bug fixes that we take
> > > every single release cycle. It is an unfortunate bug and I concur with
> > > Marc that it doesn't seem like the sort of thing a guest could rely
> > > upon.
> >
> > I find this concerning, because I've already explained this.
> >
> > There is a very real possibility of guests simply not *noticing* that
> > they had bugs in this area, as it didn't *matter* what they wrote to
> > these registers since it never worked.
> >
> > There is an even larger possibility of guests having worked around the
> > original issue by *detecting* whether the registers were actually
> > writable before choosing to use the alternative groups. And if such a
> > guest launches on a new kernel and then needs to be rolled back to an
> > older kernel, that will also break.
>
> The onus is on you to substantiate this claim. I would imagine after
> carrying the revert for so long that there must be at least one example
> of such a guest?
What? No. We have *avoided* having the bug, specifically so that we do
not find out the consequences of the bug.
> What ifs and maybes do not meet the bar, in my opinion, for preserving
> bug emulation in KVM. Of course there could be a little flexibility with
> that but we need to have some way of discriminating between bug fixes
> and genuine guest expectations around the behavior of virtual hardware.
I believe you have this completely backwards.
The expectation of KVM is that do not change guest visible behaviour if
there's any reasonable chance that it might cause problems.
A stable and mature platform doesn't get to play in its ivory tower and
randomly inflict breakage on guests because they "deserve it".
I've literally explained the potential failure modes, including the one
on rollback if a guest *does* change the group configuration and then
needs to be rolled back to the older kernel that doesn't support it.
And yes, "ifs and maybes" absolutely *are* the quality bar expected by
KVM because — again, as already explained more than once — as we
accumulate a bunch of such "unlikely" breakages in a fleet upgrade
from, say, 6.1 to 6.12, the likelihood of *one* of them actually
turning out to afflict *one* of the zoo of guest operating systems
approaches 1.
We don't get to just YOLO it.
> > > Wrong or not, this behavior is documented unambiguously. From the VGICv2
> > > UAPI documentation:
> > >
> > > """
> > > Userspace should set GICD_IIDR before setting any other registers (both
> > > KVM_DEV_ARM_VGIC_GRP_DIST_REGS and KVM_DEV_ARM_VGIC_GRP_CPU_REGS) to ensure
> > > the expected behavior. Unless GICD_IIDR has been set from userspace, writes
> > > to the interrupt group registers (GICD_IGROUPR) are ignored.
> > > """
> > >
> > > I'm not inclined to change that.
> >
> > That'll all very well... but as far as I can tell, QEMU *doesn't* set
> > GICD_IIDR, so it still gets the bizarre behaviour where the *guest* can
> > write the registers, but userspace can't. So it looks like it'll work
> > except migration will fail. Am I missing something?
>
> That's exactly it, and why I said tying up UAPI opt-in with
> guest-visible registers is a really bad idea.
>
> > But honestly, I don't care one iota about GICv2; I was only trying to
> > do the cleanup while I was there. Feel free to drop that part entirely.
> >
> > > As a way out of this whole mess, can we
> > > instead:
> > >
> > > - Allow userspace to set IIDR.Revision to 1
> > >
> > > - Drop any bug emulation from the handling of IGROUPR registers
> >
> > It doesn't make sense to allow setting IIDR.Revision to 1 *without* the
> > one-liner that actually implements the corresponding behaviour change
> > in the IGROUPR registers.
>
> As I described earlier, this whole IIDR crap inarguably broke UAPI and
> obviously normal guest behavior (i.e. reading the register). At minimum
> we need to permit previously-valid values for IIDR, even if they carry
> no implied behaviors.
But the whole *point* of IIDR is to preserve the behaviour. To set the
IIDR and *not* have the corresponding behaviour is insanity.
> > And as explained at least twice now, it's the
> > behaviour change that's *important* here.
> >
> > The fact that it's a long-standing bug in KVM which downstream has been
> > working around for a long time doesn't matter. The unconditional
> > behavioural change *is* a bug and we should fix it.
>
> That is the nature of a bug fix. If you can provide some concrete
> evidence of a guest depending on the RAZ/WI behavior then I agree we
> need to preserve the old behavior.
>
> Otherwise I see this as a matter of principle in how we do bug fixes to
> KVM. Even if upstream took the strictest possible stance towards behavior
> changes we will invariably fail to account for some minutia.
No. Don't pretend that this is hard. KVM on x86 has been quietly
getting this right for years.
Yes, there is sometimes *some* subjectivity around it, and it's
sometimes reasonable to just unilaterally change behaviours. This is
not, and was not, once of those cases.
> > > - Special-case the stupid GICv2 UAPI where IGROUPR are only writable if
> > > the VMM has written to IIDR and the revision >= 2
> >
> > That already *is* a special case, right? And you'd rather leave it as it is?
>
> Left as documented, yes. With the exception that revision == 1 writes
> not be considered opt-in to restorable IGROUPR.
Don't do that. Just leave it broken, with QEMU not even working. I'm
beyond caring about GICv2 now.
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* [regression] kernel-doc: on struct_group_tagged
From: Randy Dunlap @ 2026-05-19 23:02 UTC (permalink / raw)
To: Linux Documentation, Shuicheng Lin, Jonathan Corbet,
Mauro Carvalho Chehab
Hi Shuicheng Lin.
I have hit a problem on linux-next-2026051[89].
I bisected it and the bad commit is your recent patch
commit 46d9c16115cf (HEAD)
Author: Shuicheng Lin <shuicheng.lin@intel.com>
Date: Thu May 7 02:32:32 2026 +0000
scripts/kernel-doc: Detect mismatched inline member documentation tags
which I tested, but apparently not well enough.
Both "make htmldocs" and running
scripts/kernel-doc -none -Wall include/net/page_pool/types.h
give these warnings: [*]
Warning: include/net/page_pool/types.h:105 Excess struct member 'fast' description in 'page_pool_params'
Warning: include/net/page_pool/types.h:105 Excess struct member 'slow' description in 'page_pool_params'
due to the use of struct_group_tagged in that struct.
There are no warnings from this header file after I revert commit 46d9c16115cf.
Please look into this.
*: Both of these warnings are duplicated in kernel-doc logging, but that's a
different issue.
thanks.
--
~Randy
^ permalink raw reply
* Re: [PATCH] Documentation: KVM: Document guest-visible compatibility expectations
From: Oliver Upton @ 2026-05-19 22:57 UTC (permalink / raw)
To: David Woodhouse
Cc: Paolo Bonzini, Marc Zyngier, Will Deacon, Jonathan Corbet,
Shuah Khan, kvm, Linux Doc Mailing List,
Kernel Mailing List, Linux, Sean Christopherson, Jim Mattson,
Joey Gouly, Suzuki K Poulose, Zenghui Yu, Catalin Marinas,
Raghavendra Rao Ananta, Eric Auger, Kees Cook, Arnd Bergmann,
Nathan Chancellor, linux-arm-kernel, kvmarm, linux-kselftest
In-Reply-To: <1243d375846c4f4e20c229a6f09300126188fc8b.camel@infradead.org>
On Tue, May 19, 2026 at 10:58:05PM +0100, David Woodhouse wrote:
> On Tue, 2026-05-19 at 14:10 -0700, Oliver Upton wrote:
> > And in the absence of clear evidence of a guest depending on the broken
> > IGROUPR behavior, I don't see how the guest-side changes of Christoffer's
> > series are any different from the multitude of bug fixes that we take
> > every single release cycle. It is an unfortunate bug and I concur with
> > Marc that it doesn't seem like the sort of thing a guest could rely
> > upon.
>
> I find this concerning, because I've already explained this.
>
> There is a very real possibility of guests simply not *noticing* that
> they had bugs in this area, as it didn't *matter* what they wrote to
> these registers since it never worked.
>
> There is an even larger possibility of guests having worked around the
> original issue by *detecting* whether the registers were actually
> writable before choosing to use the alternative groups. And if such a
> guest launches on a new kernel and then needs to be rolled back to an
> older kernel, that will also break.
The onus is on you to substantiate this claim. I would imagine after
carrying the revert for so long that there must be at least one example
of such a guest?
What ifs and maybes do not meet the bar, in my opinion, for preserving
bug emulation in KVM. Of course there could be a little flexibility with
that but we need to have some way of discriminating between bug fixes
and genuine guest expectations around the behavior of virtual hardware.
> > Wrong or not, this behavior is documented unambiguously. From the VGICv2
> > UAPI documentation:
> >
> > """
> > Userspace should set GICD_IIDR before setting any other registers (both
> > KVM_DEV_ARM_VGIC_GRP_DIST_REGS and KVM_DEV_ARM_VGIC_GRP_CPU_REGS) to ensure
> > the expected behavior. Unless GICD_IIDR has been set from userspace, writes
> > to the interrupt group registers (GICD_IGROUPR) are ignored.
> > """
> >
> > I'm not inclined to change that.
>
> That'll all very well... but as far as I can tell, QEMU *doesn't* set
> GICD_IIDR, so it still gets the bizarre behaviour where the *guest* can
> write the registers, but userspace can't. So it looks like it'll work
> except migration will fail. Am I missing something?
That's exactly it, and why I said tying up UAPI opt-in with
guest-visible registers is a really bad idea.
> But honestly, I don't care one iota about GICv2; I was only trying to
> do the cleanup while I was there. Feel free to drop that part entirely.
>
> > As a way out of this whole mess, can we
> > instead:
> >
> > - Allow userspace to set IIDR.Revision to 1
> >
> > - Drop any bug emulation from the handling of IGROUPR registers
>
> It doesn't make sense to allow setting IIDR.Revision to 1 *without* the
> one-liner that actually implements the corresponding behaviour change
> in the IGROUPR registers.
As I described earlier, this whole IIDR crap inarguably broke UAPI and
obviously normal guest behavior (i.e. reading the register). At minimum
we need to permit previously-valid values for IIDR, even if they carry
no implied behaviors.
> And as explained at least twice now, it's the
> behaviour change that's *important* here.
>
> The fact that it's a long-standing bug in KVM which downstream has been
> working around for a long time doesn't matter. The unconditional
> behavioural change *is* a bug and we should fix it.
That is the nature of a bug fix. If you can provide some concrete
evidence of a guest depending on the RAZ/WI behavior then I agree we
need to preserve the old behavior.
Otherwise I see this as a matter of principle in how we do bug fixes to
KVM. Even if upstream took the strictest possible stance towards behavior
changes we will invariably fail to account for some minutia.
> > - Special-case the stupid GICv2 UAPI where IGROUPR are only writable if
> > the VMM has written to IIDR and the revision >= 2
>
> That already *is* a special case, right? And you'd rather leave it as it is?
Left as documented, yes. With the exception that revision == 1 writes
not be considered opt-in to restorable IGROUPR.
Thanks,
Oliver
^ permalink raw reply
* Re: [PATCH RESEND bpf-next v10 2/8] bpf: clear list node owner and unlink before drop
From: Eduard Zingerman @ 2026-05-19 22:56 UTC (permalink / raw)
To: Kaitao Cheng
Cc: bpf, Alexei Starovoitov, linux-kernel, linux-doc, ast, memxor,
corbet, martin.lau, daniel, andrii, song, yonghong.song,
john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah, chengkaitao,
skhan, vmalik, linux-kselftest, martin.lau, clm, ihor.solodrai,
bot+bpf-ci
In-Reply-To: <d5961282-d41c-4e54-8ba2-cd08823a8c77@linux.dev>
On Mon, 2026-05-18 at 11:02 +0800, Kaitao Cheng wrote:
[...]
> > > > The patch does have a bug, however. To fix the issues we are seeing now,
> > > > I propose the additional changes below and would appreciate feedback.
> > > >
> > > > --- a/kernel/bpf/helpers.c
> > > > +++ b/kernel/bpf/helpers.c
> > > > @@ -2263,8 +2263,10 @@ void bpf_list_head_free(const struct btf_field *field, void *list_head,
> > > > if (!head->next || list_empty(head))
> > > > goto unlock;
> > > > list_for_each_safe(pos, n, head) {
> > > > - WRITE_ONCE(container_of(pos,
> > > > - struct bpf_list_node_kern, list_head)->owner, NULL);
> > > > + struct bpf_list_node_kern *node;
> > > > +
> > > > + node = container_of(pos, struct bpf_list_node_kern, list_head);
> > > > + WRITE_ONCE(node->owner, BPF_PTR_POISON);
> > > > list_move_tail(pos, &drain);
> > > > }
> > > > unlock:
> > > > @@ -2272,8 +2274,12 @@ void bpf_list_head_free(const struct btf_field *field, void *list_head,
> > > > __bpf_spin_unlock_irqrestore(spin_lock);
> > > >
> > > > while (!list_empty(&drain)) {
> > > > + struct bpf_list_node_kern *node;
> > > > +
> > > > pos = drain.next;
> > > > + node = container_of(pos, struct bpf_list_node_kern, list_head);
> > > > list_del_init(pos);
> > > > + WRITE_ONCE(node->owner, NULL);
Is CPU allowed to reorder the stores in list_del_init() and WRITE_ONCE()?
If it is, I think there is a race here.
Thread #1:
enter bpf_list_head_free()
acquire H1 lock
list_move_tail(pos, &drain); // reordered
<-- ip here -->
WRITE_ONCE(node->owner, BPF_PTR_POISON); // reordered
Thread #2:
acquire H1 lock
n = bpf_refcount_acquire()
release H1 lock
acquire H2 lock
enter __bpf_list_add()
<-- ip here -->
cmpxchg(&node->owner, NULL, BPF_PTR_POISON)
[...]
^ permalink raw reply
* Re: [PATCH v4 04/30] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration
From: David Woodhouse @ 2026-05-19 22:43 UTC (permalink / raw)
To: Dongli Zhang, kvm
Cc: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Thomas Gleixner,
Sean Christopherson, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
Dave Hansen, Vitaly Kuznetsov, x86, Marc Zyngier, Juergen Gross,
Boris Ostrovsky, Paul Durrant, Jonathan Cameron, Sascha Bischoff,
Jack Allister, Joey Gouly, joe.jin, linux-doc, linux-kernel,
xen-devel, linux-kselftest
In-Reply-To: <aa68ed10-15da-4368-a986-6864843a3c44@oracle.com>
[-- Attachment #1: Type: text/plain, Size: 7240 bytes --]
On Tue, 2026-05-19 at 14:23 -0700, Dongli Zhang wrote:
> I think I now understand why I feel like I am always asking weird questions. I
> have been thinking about how to account for downtime, so I see
> KVM_SET_CLOCK_GUEST as a supplement to KVM_SET_CLOCK.
I do not believe in "downtime". There is no such thing.
There is only "steal time".
A CPU may be off in the weeds — a vCPU suffering steal time, or even a
pCPU in SMM which is effectively the same thing — but time doesn't
stop, and neither does the TSC.
> Suppose we are not going to account for any downtime. With KVM_SET_CLOCK_GUEST:
>
> 1. The masterclock is active, so gTSC is synchronized across vCPUs. All vCPUs
> share the same kvm_read_l1_tsc(v, ka->master_cycle_now).
Strictly, by the time we get to the end of my series, masterclock is
active *because* all the vCPUs are running at the same TSC rate (even
if the guest set them to different offsets). But OK.
> 2. Migrate the gTSC to the target VM however people want (either ablolute value
> or offset value). (Optional) Account for downtime in gTSC however people want,
> even with KVM_SET_CLOCK/KVM_CLOCK_REALTIME, which you may not like.
>
> 3. Adjust kvm-clock (that is, ka->kvmclock_offset) with KVM_SET_CLOCK_GUEST.
>
> That is why you think KVM_SET_CLOCK is no longer required if we have
> KVM_SET_CLOCK_GUEST. While I think KVM_SET_CLOCK is required because of
> KVM_CLOCK_REALTIME.
If I recall correctly what we described in
https://lore.kernel.org/all/20240522001817.619072-8-dwmw2@infradead.org/
I don't think we actually needed KVM_SET_CLOCK at all, did we?
We *abuse* KVM_GET_CLOCK to give us a tuple of {realtime, host TSC}
because there's actually no other way for *userspace* to get that. We
don't actually *care* about the KVM clock part.
We use the {realtime, host TSC} pair to reconstitute the guest TSC
values to correctly reflect the passing of time while the guest was in
the ether.
> It it isn't required to account any downtime for gTSC or if there is another way
> to do so, only KVM_SET_CLOCK_GUEST is enough.
Right. If you only want the guest to come back with the *same* values
in its TSC as before the migration, as if the TSC was *paused* during
the migration, then you can just restore those values and use
KVM_SET_CLOCK_GUEST. Assuming you are on modern hardware and have set
all vCPUs to the same rate (and are using this series so the *guest*
can't break masterclock for you, and you can trust the
KVM_SET_CLOCK_GUEST will work).
> >
> > > Another scenario is when only MASTERCLOCK_UPDATE is pending and there is no
> > > pending CLOCK_UPDATE.
> > >
> > > In this scenario, is it fine to skip processing MASTERCLOCK_UPDATE before saving
> > > pvclock_vcpu_time_info?
> > >
> >
> > I'm not sure I understand that scenario.
> >
> > MASTERCLOCK_UPDATE means we have to actually recalculate the master
> > clock (which really *should* be rare, now!). And then any time we do
> > that, we also have to do a CLOCK_UPDATE on every vCPU to disseminate
> > the new information. Which is why kvm_end_pvclock_update() does exactly
> > that.
> >
> > So your "MASTERCLOCK_UPDATE is pending and there is no pending
> > CLOCK_UPDATE" doesn't make much sense to me. If MASTERCLOCK_UPDATE is
> > pending, then there *will* be a CLOCK_UPDATE pending.
>
> Suppose the VM is stopped and the master clock is active.
I don't know what it means for a VM to be 'stopped'. Do you mean that
all vCPUs happen to be experiencing steal time at the present moment?
> Suddenly, we change the host clocksource from TSC to HPET. pvclock_gtod_notify()
> may call pvclock_gtod_update_fn() to set a pending KVM_REQ_MASTERCLOCK_UPDATE
> for all vCPUs. Unless the pending KVM_REQ_MASTERCLOCK_UPDATE is processed by
> kvm_update_masterclock(), kvm_end_pvclock_update() will not set a pending
> KVM_REQ_CLOCK_UPDATE.
You say 'Unless'... do you mean 'Until'?
> Therefore, this is a scenario in which only KVM_REQ_MASTERCLOCK_UPDATE is pending.
>
> I do not think this scenario is important. I am just curious about the expected
> way to implement similar code in the future :)
I think that's working correctly. Until the master clock has *actually*
been updated, there's no point in setting CLOCK_UPDATE for each vCPU to
disseminate the new information to its own pvclock?
> >
> >
> > > > >
> > > > > Would it be helpful to validate that the delta is within a reasonable range,
> > > > > e.g. that the drift can never be more than five minutes (forward or backward)?
> > > >
> > > > If a guest has been running for months on a previous host and is
> > > > migrated to a new host, don't we expect that the KVM clock of the new
> > > > VM on the new host is tweaked from its default near-zero after
> > > > creation, to some large amount?
> > > >
> > >
> > > Regarding live migration, my own investigation does not show a proportional
> > > relationship between VM uptime and the amount of drift.
> >
> > You're comparing the VM on the source host, with the VM on the
> > destination post-migration.
>
> Apologies for making it confusing. I was just trying to explain why I think the
> kvm-clock drift will not be large.
Sure, but I don't care. If we have a sane API, the drift should be zero
:)
> We previously discussed the vCPU hotplug and kvm-clock drift issue. The longer
> the time interval between two vCPU hotplug events, the larger the drift.
>
> For live migration (with QEMU), I provided the equation to show that the drift
> will not be large, because it is determined by something else rather than by how
> long the VM has been running on the source server.
>
>
> For the previous vCPU hotplug and kvm-clock bug, if we add more vCPUs to a guest
> that has been running for three months, the drift will be relatively larger.
>
> For QEMU live migration, migrating a guest VM that has been running on the
> source host for *three months* versus one that has been running for *one day*
> will not cause much difference in kvm-clock drift.
Right.
> For the ideal live update case (on the same host), there may be no need to
> adjust gTSC so that it keeps incrementing. In that case, KVM_SET_CLOCK_GUEST can
> be used to adjust kvm-clock based on gTSC.
Right. You restore the gTSC using its *offset* from the host TSC which
hasn't stopped counting on the same host. Then use KVM_SET_CLOCK_GUEST
to restore the kvmclock in terms of the gTSC. And you have an
absolutely cycle-perfect migration.
> For the live migration scenario, the current QEMU implementation not only fails
> to account for downtime, but also has a drift issue. That is what I would like
> to address in QEMU.
Again, restore the gTSC as accurately as possible. Probably by working
out for *yourself* the relationships of the source and destination host
TSCs to real time, and then reconstituting on the destination using TSC
offset just as for live migration.
And then use KVM_SET_CLOCK_GUEST too.
That's what I attempted to document in
https://lore.kernel.org/all/20240522001817.619072-8-dwmw2@infradead.org/
and should probably revive.
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* Re: [PATCH v3] killswitch: add per-function short-circuit mitigation primitive
From: Song Liu @ 2026-05-19 22:00 UTC (permalink / raw)
To: Sasha Levin
Cc: Daniel Borkmann, linux-kernel, linux-doc, linux-kselftest, bpf,
live-patching, Greg Kroah-Hartman, Andrew Morton, Jonathan Corbet,
Mathieu Desnoyers, Joshua Peisach, Florian Weimer, Breno Leitao,
Anthony Iliopoulos, Michal Hocko, Jiri Olsa, John Fastabend,
Christian Brauner
In-Reply-To: <agzAwjKhOhuANz_P@laps>
On Tue, May 19, 2026 at 12:57 PM Sasha Levin <sashal@kernel.org> wrote:
[...]
> >Fully agree with Song here that there is no clear boundary, and that the
> >killswitch could lead to arbitrary, hard to debug breakage if applied to
> >the wrong function.. introducing worse bugs than the one being mitigated
> >or even /short-circuit LSM enforcement/ (engage security_file_open 0,
> >engage cap_capable 0, engage apparmor_* etc).
>
> This is similar to livepatch, right? Do we need guardrails there too?
livepatch has the same guardrails as other kernel modules:
CONFIG_MODULE_SIG, CONFIG_MODULE_SIG_FORCE, etc.
Thanks,
Song
^ permalink raw reply
* Re: [PATCH] Documentation: KVM: Document guest-visible compatibility expectations
From: David Woodhouse @ 2026-05-19 21:58 UTC (permalink / raw)
To: Oliver Upton
Cc: Paolo Bonzini, Marc Zyngier, Will Deacon, Jonathan Corbet,
Shuah Khan, kvm, Linux Doc Mailing List,
Kernel Mailing List, Linux, Sean Christopherson, Jim Mattson,
Joey Gouly, Suzuki K Poulose, Zenghui Yu, Catalin Marinas,
Raghavendra Rao Ananta, Eric Auger, Kees Cook, Arnd Bergmann,
Nathan Chancellor, linux-arm-kernel, kvmarm, linux-kselftest
In-Reply-To: <agzR2kaJsNa8X9lF@kernel.org>
[-- Attachment #1: Type: text/plain, Size: 6861 bytes --]
On Tue, 2026-05-19 at 14:10 -0700, Oliver Upton wrote:
> On Tue, May 19, 2026 at 03:13:30PM +0100, David Woodhouse wrote:
> > On Tue, 2026-05-19 at 15:53 +0200, Paolo Bonzini wrote:
> > > On Tue, May 19, 2026 at 3:00 PM David Woodhouse <dwmw2@infradead.org> wrote:
> > > > Or some guest configurations which have only ever been tested under KVM
> > > > could have a bug where they *rely* on the registers not being writable,
> > > > and write values which are inconsistent with the rest of their
> > > > configuration. Which breaks the moment those registers become writable.
> > >
> > > Yeah, just having guests that worked by utter chance - but you still
> > > don't want to break them - is the case that is most likely. Crappy
> > > code that runs only under emulation/virtualization appears with
> > > probability 1 over time.
> > >
> > > Is this likely in this specific case---probably not, honestly.
> > > Christoffer's patch dates back to 2018 (commit d53c2c29ae0d); *back
> > > then* KVM/Arm was a lot less mature, and people developing for Arm on
> > > vanilla upstream kernels have moved on from Linux 4.19.
> >
> > It's not really 2018 though. EC2 is still running kernels with that
> > older commit reverted because of the breaking change that it
> > introduced.
> >
> > So the behaviour corresponding to GICD_IIDR.implementation_rev=1 is
> > still current for *millions* of guests.
> >
> > I'm now finding that revert in our tree during a *later* kernel upgrade
> > and trying to eliminate it.
>
> Still, as far as upstream is concerned this is damn near a decade old.
> Decisions that you or your peers made in the downstream doesn't change
> that.
>
> > And sure, I have given the engineers responsible for that a very hard
> > stare and suggested that they should have fixed it 'properly' in the
> > first place with a patch like the one we're discussing right now.
> >
> > And they're looking at this thread and saying "haha no, if fixing
> > things properly for Arm is this hard then we'll stick with the crappy
> > approach".
>
> The appropriate time to ask for accomodation was a *very* long time ago.
>
> And in the absence of clear evidence of a guest depending on the broken
> IGROUPR behavior, I don't see how the guest-side changes of Christoffer's
> series are any different from the multitude of bug fixes that we take
> every single release cycle. It is an unfortunate bug and I concur with
> Marc that it doesn't seem like the sort of thing a guest could rely
> upon.
I find this concerning, because I've already explained this.
There is a very real possibility of guests simply not *noticing* that
they had bugs in this area, as it didn't *matter* what they wrote to
these registers since it never worked.
There is an even larger possibility of guests having worked around the
original issue by *detecting* whether the registers were actually
writable before choosing to use the alternative groups. And if such a
guest launches on a new kernel and then needs to be rolled back to an
older kernel, that will also break.
> Because it is very much a bug fix, it should've happened without a
> change to the revision number.
No. Changing the revision number in conjunction with the guest-visible
behaviour change is *absolutely* the right thing to do.
> Now, the handling of GICD_IIDR itself is a separate issue. By my count,
> the series broke UAPI on three separate occasions. Before b489edc36169
> IIDR was RAZ/WI from userspace. And of course dd6251e463d3 and d53c2c29ae0d
> changed the revision with no way of restoring the old value.
>
> And really, IIDR should've *never* been used as a buy in for new UAPI
> because it unnecessarily becomes guest visible. 49a1a2c70a7f ("KVM: arm64:
> vgic-v3: Advertise GICR_CTLR.{IR, CES} as a new GICD_IIDR revision") is
> a much better example for IIDR going forward as it gates *guest-side*
> behavior.
Yes, 49a1a2c70a7f is the exemplar. The guest-visible behaviour changes,
so we get a new IIDR revision and the ability to preserve the previous
behaviour by setting IIDR to the old value. That is exactly how it
should always be done.
> > I do not want them to be right. I don't think any of us want that.
> >
> > > I would still lean towards accepting the code considering the limited
> > > complexity of the addition (in fact I like it more now that it uses
> > > IIDR instead of v2_groups_user_writable, but that's taste).
> >
> > I'm absolutely prepared to have a separate technical discussion about
> > the v2_groups_user_writable thing for GICv2, which is the second part
> > of that series.
> >
> > It seems like the right thing to do, and as far as I can tell, this
> > code *never* worked with QEMU. But I'm not sure who even cares about
> > GICv2 any more. I couldn't find hardware and I had to test the whole
> > thing inside qemu-tcg.
> >
> > But the 'IIDR defaults to 3 but the *behaviour* doesn't match until you
> > explicitly *set* it to 3' thing seemed so *egregiously* wrong to me,
> > that I fixed it anyway.
>
> Wrong or not, this behavior is documented unambiguously. From the VGICv2
> UAPI documentation:
>
> """
> Userspace should set GICD_IIDR before setting any other registers (both
> KVM_DEV_ARM_VGIC_GRP_DIST_REGS and KVM_DEV_ARM_VGIC_GRP_CPU_REGS) to ensure
> the expected behavior. Unless GICD_IIDR has been set from userspace, writes
> to the interrupt group registers (GICD_IGROUPR) are ignored.
> """
>
> I'm not inclined to change that.
That'll all very well... but as far as I can tell, QEMU *doesn't* set
GICD_IIDR, so it still gets the bizarre behaviour where the *guest* can
write the registers, but userspace can't. So it looks like it'll work
except migration will fail. Am I missing something?
But honestly, I don't care one iota about GICv2; I was only trying to
do the cleanup while I was there. Feel free to drop that part entirely.
> As a way out of this whole mess, can we
> instead:
>
> - Allow userspace to set IIDR.Revision to 1
>
> - Drop any bug emulation from the handling of IGROUPR registers
It doesn't make sense to allow setting IIDR.Revision to 1 *without* the
one-liner that actually implements the corresponding behaviour change
in the IGROUPR registers. And as explained at least twice now, it's the
behaviour change that's *important* here.
The fact that it's a long-standing bug in KVM which downstream has been
working around for a long time doesn't matter. The unconditional
behavioural change *is* a bug and we should fix it.
> - Special-case the stupid GICv2 UAPI where IGROUPR are only writable if
> the VMM has written to IIDR and the revision >= 2
That already *is* a special case, right? And you'd rather leave it as it is?
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* Re: [PATCH v4 04/30] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration
From: Dongli Zhang @ 2026-05-19 21:23 UTC (permalink / raw)
To: David Woodhouse, kvm
Cc: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Thomas Gleixner,
Sean Christopherson, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
Dave Hansen, Vitaly Kuznetsov, x86, Marc Zyngier, Juergen Gross,
Boris Ostrovsky, Paul Durrant, Jonathan Cameron, Sascha Bischoff,
Jack Allister, Joey Gouly, joe.jin, linux-doc, linux-kernel,
xen-devel, linux-kselftest
In-Reply-To: <b9980333f3a310bf05e170e79c40cb2f46485caf.camel@infradead.org>
On 2026-05-19 12:50 AM, David Woodhouse wrote:
> On Mon, 2026-05-18 at 17:57 -0700, Dongli Zhang wrote:
>> On 2026-05-18 1:48 AM, David Woodhouse wrote:
>>> ...
>>
>> I have fixed the Thunderbird configuration. Does it look better to you?
>
> The date is certainly better, thank you. But although I *was* up late
> that night frowning at clocks, I didn't think I was up *quite* as late
> (almost 2am) as it suggests.
>
> But I suspect that getting *that* right is beyond the limit of
> Thunderbird's configurability.
>
> Thanks :)
Let me continue exploring how to add a timezone, such as "17:57 -0700".
>
>> I really appreciate guidelines like the ones below.
>>
>> https://lore.kernel.org/all/20240522001817.619072-8-dwmw2@infradead.org
>>
>> Assuming I am a user of the new API, I feel confused about whether the goal is
>> to replace KVM_SET_CLOCK with KVM_SET_CLOCK_GUEST, or whether the latter is
>> meant to supplement the former.
>
> The issue is that KVM_SET_CLOCK_GUEST can only be used in 'masterclock'
> mode, when the TSC is reliable and the guest TSCs are all in sync.
>
> Which ought to be *all* of the time, on modern hardware and sane
> configurations. And in this series, I don't even let the *guest* screw
> that over by setting different TSC offsets on different vCPUs any more
> (we stay in masterclock mode in that case now). But the VMM can cause
> its guest to come out of masterclock mode, by setting different TSC
> *speeds* on different vCPUs.
>
> So there remain some pathological cases where the kvmclock actually
> still has a justification to exist, and those are the cases where it
> needs to be set in its own right as a function of host time
> (KVM_SET_CLOCK), not purely as a function of the guest TSC
> (KVM_SET_CLOCK_GUEST).
I think I now understand why I feel like I am always asking weird questions. I
have been thinking about how to account for downtime, so I see
KVM_SET_CLOCK_GUEST as a supplement to KVM_SET_CLOCK.
Suppose we are not going to account for any downtime. With KVM_SET_CLOCK_GUEST:
1. The masterclock is active, so gTSC is synchronized across vCPUs. All vCPUs
share the same kvm_read_l1_tsc(v, ka->master_cycle_now).
2. Migrate the gTSC to the target VM however people want (either ablolute value
or offset value). (Optional) Account for downtime in gTSC however people want,
even with KVM_SET_CLOCK/KVM_CLOCK_REALTIME, which you may not like.
3. Adjust kvm-clock (that is, ka->kvmclock_offset) with KVM_SET_CLOCK_GUEST.
That is why you think KVM_SET_CLOCK is no longer required if we have
KVM_SET_CLOCK_GUEST. While I think KVM_SET_CLOCK is required because of
KVM_CLOCK_REALTIME.
It it isn't required to account any downtime for gTSC or if there is another way
to do so, only KVM_SET_CLOCK_GUEST is enough.
>
>
>>
>> If we are going to use KVM_SET_CLOCK_GUEST when KVM_SET_CLOCK is not needed, I
>> would appreciate it if the API could carry more data in addition to struct
>> pvclock_vcpu_time_info.
>>
>> +#define KVM_SET_CLOCK_GUEST _IOW(KVMIO, 0xd6, struct pvclock_vcpu_time_info)
>> +#define KVM_GET_CLOCK_GUEST _IOR(KVMIO, 0xd7, struct pvclock_vcpu_time_info)
>>
>>
[snip]
>>
>> Another scenario is when only MASTERCLOCK_UPDATE is pending and there is no
>> pending CLOCK_UPDATE.
>>
>> In this scenario, is it fine to skip processing MASTERCLOCK_UPDATE before saving
>> pvclock_vcpu_time_info?
>>
>
> I'm not sure I understand that scenario.
>
> MASTERCLOCK_UPDATE means we have to actually recalculate the master
> clock (which really *should* be rare, now!). And then any time we do
> that, we also have to do a CLOCK_UPDATE on every vCPU to disseminate
> the new information. Which is why kvm_end_pvclock_update() does exactly
> that.
>
> So your "MASTERCLOCK_UPDATE is pending and there is no pending
> CLOCK_UPDATE" doesn't make much sense to me. If MASTERCLOCK_UPDATE is
> pending, then there *will* be a CLOCK_UPDATE pending.
Suppose the VM is stopped and the master clock is active.
Suddenly, we change the host clocksource from TSC to HPET. pvclock_gtod_notify()
may call pvclock_gtod_update_fn() to set a pending KVM_REQ_MASTERCLOCK_UPDATE
for all vCPUs. Unless the pending KVM_REQ_MASTERCLOCK_UPDATE is processed by
kvm_update_masterclock(), kvm_end_pvclock_update() will not set a pending
KVM_REQ_CLOCK_UPDATE.
Therefore, this is a scenario in which only KVM_REQ_MASTERCLOCK_UPDATE is pending.
I do not think this scenario is important. I am just curious about the expected
way to implement similar code in the future :)
>
>
>>>>
>>>> Would it be helpful to validate that the delta is within a reasonable range,
>>>> e.g. that the drift can never be more than five minutes (forward or backward)?
>>>
>>> If a guest has been running for months on a previous host and is
>>> migrated to a new host, don't we expect that the KVM clock of the new
>>> VM on the new host is tweaked from its default near-zero after
>>> creation, to some large amount?
>>>
>>
>> Regarding live migration, my own investigation does not show a proportional
>> relationship between VM uptime and the amount of drift.
>
> You're comparing the VM on the source host, with the VM on the
> destination post-migration.
Apologies for making it confusing. I was just trying to explain why I think the
kvm-clock drift will not be large.
We previously discussed the vCPU hotplug and kvm-clock drift issue. The longer
the time interval between two vCPU hotplug events, the larger the drift.
For live migration (with QEMU), I provided the equation to show that the drift
will not be large, because it is determined by something else rather than by how
long the VM has been running on the source server.
For the previous vCPU hotplug and kvm-clock bug, if we add more vCPUs to a guest
that has been running for three months, the drift will be relatively larger.
For QEMU live migration, migrating a guest VM that has been running on the
source host for *three months* versus one that has been running for *one day*
will not cause much difference in kvm-clock drift.
>
> Perhaps I misunderstood, but I thought your suggested validation of a
> 'reasonable range' would also apply when adjusting the kvmclock of the
> nascent VM on the destination host, from "newly created" to "has been
> running for months" while migrating the state of the actual guest onto
> a clean new slate.
>
>> Just taking QEMU + KVM as an example: suppose TSC scaling is inactive, the
>> amount of drift does not depend on how long the VM has been running before live
>> migration.
>>
>> Instead, it depends on the delta between when we call MSR_IA32_TSC and
>> KVM_GET_CLOCK, and between MSR_IA32_TSC and KVM_SET_CLOCK.
>>
>> The guest TSC stops at P1 and resumes at P3.
>> The kvmclock stops at P2 and resumes at P4.
>>
>> We expect P1 == P2 and P3 == P4.
>>
>> On source host.
>>
>> - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=0 ===> P1
>
> Here's where it all starts going wrong. Line 1.
>
> Any API which lets you get a single time value in isolation, and thus
> which is already out of date by the time the system call even returns,
> is fundamentally unsuitable for migration.
>
>> - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=1
>> - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=2
>> - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=3
>> - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=4
>> ... ...
>> - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=N
>> - KVM_GET_CLOCK ===> P2
>>
>> On target host.
>>
>> - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=1 ===> P3
>> - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=2
>
> At this point, the nasty hack in the kernel steps in, realises that the
> value you're setting on vCPU 2 is within a second or so of the value
> you had previously set on vCPU 1, and snaps it back to be precisely the
> same. To work around the fundamental brokenness of this method.
>
>> - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=3
>> - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=4
>> - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=5
>> ... ...
>> - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=N
>> - KVM_SET_CLOCK ====> P4
>>
>>
>> Here is my equiation to predict the drift.
>
> I'm sure you're right, but I didn't get that far when looking at this.
> I'd already thrown up in my mouth a little bit by line one.
>
> Here's my equation to predict the drift of a live update done correctly
> on the same host using the method I've now put in the documentation:
>
> 0.
For the ideal live update case (on the same host), there may be no need to
adjust gTSC so that it keeps incrementing. In that case, KVM_SET_CLOCK_GUEST can
be used to adjust kvm-clock based on gTSC.
For the live migration scenario, the current QEMU implementation not only fails
to account for downtime, but also has a drift issue. That is what I would like
to address in QEMU.
Thank you very much!
Dongli Zhang
^ permalink raw reply
* Re: [PATCH] Documentation: KVM: Document guest-visible compatibility expectations
From: Oliver Upton @ 2026-05-19 21:10 UTC (permalink / raw)
To: David Woodhouse
Cc: Paolo Bonzini, Marc Zyngier, Will Deacon, Jonathan Corbet,
Shuah Khan, kvm, Linux Doc Mailing List,
Kernel Mailing List, Linux, Sean Christopherson, Jim Mattson,
Joey Gouly, Suzuki K Poulose, Zenghui Yu, Catalin Marinas,
Raghavendra Rao Ananta, Eric Auger, Kees Cook, Arnd Bergmann,
Nathan Chancellor, linux-arm-kernel, kvmarm, linux-kselftest
In-Reply-To: <9d0429ddbe4d8c6993e74237c4395697f80092d6.camel@infradead.org>
On Tue, May 19, 2026 at 03:13:30PM +0100, David Woodhouse wrote:
> On Tue, 2026-05-19 at 15:53 +0200, Paolo Bonzini wrote:
> > On Tue, May 19, 2026 at 3:00 PM David Woodhouse <dwmw2@infradead.org> wrote:
> > > Or some guest configurations which have only ever been tested under KVM
> > > could have a bug where they *rely* on the registers not being writable,
> > > and write values which are inconsistent with the rest of their
> > > configuration. Which breaks the moment those registers become writable.
> >
> > Yeah, just having guests that worked by utter chance - but you still
> > don't want to break them - is the case that is most likely. Crappy
> > code that runs only under emulation/virtualization appears with
> > probability 1 over time.
> >
> > Is this likely in this specific case---probably not, honestly.
> > Christoffer's patch dates back to 2018 (commit d53c2c29ae0d); *back
> > then* KVM/Arm was a lot less mature, and people developing for Arm on
> > vanilla upstream kernels have moved on from Linux 4.19.
>
> It's not really 2018 though. EC2 is still running kernels with that
> older commit reverted because of the breaking change that it
> introduced.
>
> So the behaviour corresponding to GICD_IIDR.implementation_rev=1 is
> still current for *millions* of guests.
>
> I'm now finding that revert in our tree during a *later* kernel upgrade
> and trying to eliminate it.
Still, as far as upstream is concerned this is damn near a decade old.
Decisions that you or your peers made in the downstream doesn't change
that.
> And sure, I have given the engineers responsible for that a very hard
> stare and suggested that they should have fixed it 'properly' in the
> first place with a patch like the one we're discussing right now.
>
> And they're looking at this thread and saying "haha no, if fixing
> things properly for Arm is this hard then we'll stick with the crappy
> approach".
The appropriate time to ask for accomodation was a *very* long time ago.
And in the absence of clear evidence of a guest depending on the broken
IGROUPR behavior, I don't see how the guest-side changes of Christoffer's
series are any different from the multitude of bug fixes that we take
every single release cycle. It is an unfortunate bug and I concur with
Marc that it doesn't seem like the sort of thing a guest could rely
upon.
Because it is very much a bug fix, it should've happened without a
change to the revision number.
Now, the handling of GICD_IIDR itself is a separate issue. By my count,
the series broke UAPI on three separate occasions. Before b489edc36169
IIDR was RAZ/WI from userspace. And of course dd6251e463d3 and d53c2c29ae0d
changed the revision with no way of restoring the old value.
And really, IIDR should've *never* been used as a buy in for new UAPI
because it unnecessarily becomes guest visible. 49a1a2c70a7f ("KVM: arm64:
vgic-v3: Advertise GICR_CTLR.{IR, CES} as a new GICD_IIDR revision") is
a much better example for IIDR going forward as it gates *guest-side*
behavior.
> I do not want them to be right. I don't think any of us want that.
>
> > I would still lean towards accepting the code considering the limited
> > complexity of the addition (in fact I like it more now that it uses
> > IIDR instead of v2_groups_user_writable, but that's taste).
>
> I'm absolutely prepared to have a separate technical discussion about
> the v2_groups_user_writable thing for GICv2, which is the second part
> of that series.
>
> It seems like the right thing to do, and as far as I can tell, this
> code *never* worked with QEMU. But I'm not sure who even cares about
> GICv2 any more. I couldn't find hardware and I had to test the whole
> thing inside qemu-tcg.
>
> But the 'IIDR defaults to 3 but the *behaviour* doesn't match until you
> explicitly *set* it to 3' thing seemed so *egregiously* wrong to me,
> that I fixed it anyway.
Wrong or not, this behavior is documented unambiguously. From the VGICv2
UAPI documentation:
"""
Userspace should set GICD_IIDR before setting any other registers (both
KVM_DEV_ARM_VGIC_GRP_DIST_REGS and KVM_DEV_ARM_VGIC_GRP_CPU_REGS) to ensure
the expected behavior. Unless GICD_IIDR has been set from userspace, writes
to the interrupt group registers (GICD_IGROUPR) are ignored.
"""
I'm not inclined to change that. As a way out of this whole mess, can we
instead:
- Allow userspace to set IIDR.Revision to 1
- Drop any bug emulation from the handling of IGROUPR registers
- Special-case the stupid GICv2 UAPI where IGROUPR are only writable if
the VMM has written to IIDR and the revision >= 2
Thanks,
Oliver
^ permalink raw reply
* Re: [PATCH v3 04/12] x86,fs/resctrl: Program PLZA through kmode arch hooks
From: Luck, Tony @ 2026-05-19 20:59 UTC (permalink / raw)
To: Babu Moger
Cc: corbet, reinette.chatre, Dave.Martin, james.morse, tglx, bp,
dave.hansen, skhan, x86, mingo, hpa, akpm, rdunlap,
pawan.kumar.gupta, feng.tang, dapeng1.mi, kees, elver, lirongqing,
paulmck, bhelgaas, seanjc, alexandre.chartre, yazen.ghannam,
peterz, chang.seok.bae, kim.phillips, xin, naveen,
thomas.lendacky, linux-doc, linux-kernel, eranian, peternewman,
sos-linux-ext-patches
In-Reply-To: <0cfd813e10072eefc8f4d84328e83bd9a6220ad4.1777591497.git.babu.moger@amd.com>
On Thu, Apr 30, 2026 at 06:24:49PM -0500, Babu Moger wrote:
> +void resctrl_arch_configure_kmode(cpumask_var_t cpu_mask, u32 closid, u32 rmid, bool enable)
> +{
> + union msr_pqr_plza_assoc plza = { 0 };
> +
> + plza.split.rmid = rmid;
> + plza.split.rmid_en = 1;
Shouldn't there be a parameter for the value of rmid_en?
User asked for global_assign_ctrl_assign_mon_per_cpu set it to '1'
User asked for global_assign_ctrl_inherit_mon_per_cpu set it to '0'
> + plza.split.closid = closid;
> + plza.split.closid_en = 1;
> + plza.split.plza_en = enable;
> +
> + on_each_cpu_mask(cpu_mask, resctrl_kmode_set_one_amd, &plza, 1);
> +}
-Tony
^ permalink raw reply
* Re: [PATCH] killswitch: add per-function short-circuit mitigation primitive
From: Paul Moore @ 2026-05-19 20:50 UTC (permalink / raw)
To: Sasha Levin
Cc: Song Liu, corbet, akpm, skhan, linux-doc, linux-kernel,
linux-kselftest, gregkh, linux-security-module
In-Reply-To: <agzBd9mMt3Zf7j1j@laps>
On Tue, May 19, 2026 at 4:00 PM Sasha Levin <sashal@kernel.org> wrote:
> On Mon, May 18, 2026 at 11:08:38PM -0400, Paul Moore wrote:
> >On Mon, May 18, 2026 at 8:31 PM Sasha Levin <sashal@kernel.org> wrote:
> >> On Mon, May 18, 2026 at 05:29:32PM -0400, Paul Moore wrote:
> >> >From my perspective there are two different issues here: should
> >> >killswitch be a LSM, and should killswitch leverage kprobes to be able
> >> >to "kill" security related symbols. After all, are we okay with
> >> >killswitch killing capable() and friends?
> >>
> >> killswitch doesn't do it on it's own. It may be instructed by root to do that,
> >> at which point that is root's problem.
> >
> >As I mentioned previously, there are cases where we can restrict
> >root's privileges today, but a functional killswitch would allow that
> >restriction to be bypassed. My last email to Song has an example with
> >SELinux.
>
> This would be handled by just disabling killswitch in those scenarios like how
> we do with lockdown, no?
One could presumably deny access to killswitch, but that pushes the
burden of choice onto the users/admins. Yes, that is the easy way to
solve thorny use case conflicts like this, but it would be nice if we
could do better for those who have to deal with this in the wild.
> >> >In my opinion, making killswitch an LSM is more of a procedural item
> >> >that deals with how we view a capability like killswitch. I
> >> >personally view killswitch as somewhat similar to Lockdown, which is
> >> >why I made the suggestion.
> >>
> >> Maybe I'm not all that familiar with LSMs, but we would need to be able to stop
> >> "random" code paths from executing, and I don't think we can create LSM hooks
> >> at that granularity, no?
> >
> >I don't see any LSM hooks in this revision of killswitch, and as long
> >as it is based on a kprobes I can't imagine it would ever use any. As
> >I mentioned above, my killswitch-as-a-LSM comment is primarily about
> >killswitch filling a role very similar to Lockdown.
>
> My question was more about how to structure killswitch as an LSM. I want to be
> able to poke at pretty much any function in the kernel, rather than restrict
> access to a known list of functions.
Well, like I said in my last reply to you, I can't imagine a kprobes
based killswitch would need to worry about the LSM hooks. Structuring
killswitch as an LSM would be mostly a few lines of code to register
it as an LSM and that's about it. Benefits would be minor, and likely
a matter of opinion, it's mostly about how we view something like
killswitch in the kernel. If we view it as a security mechanism
similar to lockdown, then it makes sense as a LSM, if we view this as
a completely different thing then it can be whatever it wants to be.
--
paul-moore.com
^ permalink raw reply
* Re: [PATCH] killswitch: add per-function short-circuit mitigation primitive
From: Paul Moore @ 2026-05-19 20:33 UTC (permalink / raw)
To: Song Liu
Cc: Sasha Levin, corbet, akpm, skhan, linux-doc, linux-kernel,
linux-kselftest, gregkh, linux-security-module
In-Reply-To: <CAPhsuW7Rhdh62AoceQpsfm0+kVsvz8zq97fupm4mtBEyVTkkcg@mail.gmail.com>
On Tue, May 19, 2026 at 1:29 AM Song Liu <song@kernel.org> wrote:
> On Mon, May 18, 2026 at 5:31 PM Sasha Levin <sashal@kernel.org> wrote:
> >
> > On Mon, May 18, 2026 at 05:29:32PM -0400, Paul Moore wrote:
> > >From my perspective there are two different issues here: should
> > >killswitch be a LSM, and should killswitch leverage kprobes to be able
> > >to "kill" security related symbols. After all, are we okay with
> > >killswitch killing capable() and friends?
> >
> > killswitch doesn't do it on it's own. It may be instructed by root to do that,
> > at which point that is root's problem.
> >
> > >In my opinion, making killswitch an LSM is more of a procedural item
> > >that deals with how we view a capability like killswitch. I
> > >personally view killswitch as somewhat similar to Lockdown, which is
> > >why I made the suggestion.
> >
> > Maybe I'm not all that familiar with LSMs, but we would need to be able to stop
> > "random" code paths from executing, and I don't think we can create LSM hooks
> > at that granularity, no?
>
> There are much fewer LSM hooks than ftrace-able (killswitch-able)
> functions. In this sense, killswitch is more granular.
I don't know if I would say it is necessarily more granular as its
ability to filter access is limited to a breakpoint set on a symbol,
but killswitch definitely has a larger quantity of control points.
> However, LSM
> hooks allow LSM policies to make different decisions for different
> arguments. In this sense, LSM hooks are more granular than
> killswitch, as killswitch can only set a fixed return value for each
> engaged function.
Yes, I think we agree here.
> With current LSM solutions, we can mitigate issues like Copy Fail
> without breaking other features of the system. In [1], Cloudflare
> shared how they mitigate Copy Fail with BPF LSM.
... and Android has been shown to not be vulnerable in the first place
due to their use of SELinux and a well crafted SELinux policy.
--
paul-moore.com
^ permalink raw reply
* [PATCH net-next V2 2/2] net/mlx5: implement max_sfs parameter
From: Tariq Toukan @ 2026-05-19 20:04 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Jiri Pirko, Simon Horman, Jonathan Corbet, Shuah Khan,
Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
Vlad Dumitrescu, Aleksandr Loktionov, Daniel Zahka, David Ahern,
Nikolay Aleksandrov, netdev, linux-doc, linux-kernel, linux-rdma,
Gal Pressman, Dragos Tatulea, Jiri Pirko, Nikolay Aleksandrov
In-Reply-To: <20260519200436.353249-1-tariqt@nvidia.com>
From: Nikolay Aleksandrov <nikolay@nvidia.com>
Implement max_sfs generic parameter to allow users to control the total
light-weight NIC subfunctions that can be created using devlink instead
of external vendor tools. A value of 0 will effectively disable creation
of new subfunction devices. A warning is sent to user-space via extack
(returning extack without error code is interpreted as a warning by
user-space tools).
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
Documentation/networking/devlink/mlx5.rst | 7 +-
.../mellanox/mlx5/core/lib/nv_param.c | 83 ++++++++++++++++++-
2 files changed, 86 insertions(+), 4 deletions(-)
diff --git a/Documentation/networking/devlink/mlx5.rst b/Documentation/networking/devlink/mlx5.rst
index 4bba4d780a4a..f5e2dccafa5a 100644
--- a/Documentation/networking/devlink/mlx5.rst
+++ b/Documentation/networking/devlink/mlx5.rst
@@ -45,8 +45,13 @@ Parameters
- The range is between 1 and a device-specific max.
- Applies to each physical function (PF) independently, if the device
supports it. Otherwise, it applies symmetrically to all PFs.
+ * - ``max_sfs``
+ - permanent
+ - The range is between 0 and a device-specific max.
+ - Applies to each physical function (PF) independently.
-Note: permanent parameters such as ``enable_sriov`` and ``total_vfs`` require FW reset to take effect
+Note: permanent parameters such as ``enable_sriov``, ``total_vfs`` and ``max_sfs``
+ require FW reset to take effect
.. code-block:: bash
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/nv_param.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/nv_param.c
index 19bb620b7436..eff3a67e4ca0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/nv_param.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/nv_param.c
@@ -68,7 +68,9 @@ struct mlx5_ifc_mnvda_reg_bits {
struct mlx5_ifc_nv_global_pci_conf_bits {
u8 sriov_valid[0x1];
- u8 reserved_at_1[0x10];
+ u8 reserved_at_1[0xa];
+ u8 per_pf_num_sf[0x1];
+ u8 reserved_at_c[0x5];
u8 per_pf_total_vf[0x1];
u8 reserved_at_12[0xe];
@@ -93,9 +95,11 @@ struct mlx5_ifc_nv_global_pci_cap_bits {
};
struct mlx5_ifc_nv_pf_pci_conf_bits {
- u8 reserved_at_0[0x9];
+ u8 log_sf_bar_size[0x8];
+ u8 pf_total_sf_en[0x1];
u8 pf_total_vf_en[0x1];
- u8 reserved_at_a[0x16];
+ u8 reserved_at_a[0x6];
+ u8 total_sf[0x10];
u8 reserved_at_20[0x20];
@@ -755,6 +759,76 @@ static int mlx5_devlink_total_vfs_validate(struct devlink *devlink, u32 id,
return 0;
}
+static int mlx5_devlink_max_sfs_get(struct devlink *devlink, u32 id,
+ struct devlink_param_gset_ctx *ctx,
+ struct netlink_ext_ack *extack)
+{
+ struct mlx5_core_dev *dev = devlink_priv(devlink);
+ u32 mnvda[MLX5_ST_SZ_DW(mnvda_reg)] = {};
+ void *data;
+ int err;
+
+ err = mlx5_nv_param_read_per_host_pf_conf(dev, mnvda, sizeof(mnvda));
+ if (err) {
+ NL_SET_ERR_MSG_MOD(extack, "Failed to read PF configuration");
+ return err;
+ }
+
+ data = MLX5_ADDR_OF(mnvda_reg, mnvda, configuration_item_data);
+ ctx->val.vu32 = MLX5_GET(nv_pf_pci_conf, data, total_sf);
+
+ return 0;
+}
+
+static int mlx5_devlink_max_sfs_set(struct devlink *devlink, u32 id,
+ struct devlink_param_gset_ctx *ctx,
+ struct netlink_ext_ack *extack)
+{
+ struct mlx5_core_dev *dev = devlink_priv(devlink);
+ u32 mnvda[MLX5_ST_SZ_DW(mnvda_reg)] = {};
+ void *data;
+ int err;
+
+ err = mlx5_nv_param_read_global_pci_conf(dev, mnvda, sizeof(mnvda));
+ if (err) {
+ NL_SET_ERR_MSG_MOD(extack,
+ "Failed to read global PCI configuration");
+ return err;
+ }
+
+ data = MLX5_ADDR_OF(mnvda_reg, mnvda, configuration_item_data);
+ MLX5_SET(nv_global_pci_conf, data, per_pf_num_sf, !!ctx->val.vu32);
+
+ err = mlx5_nv_param_write(dev, mnvda, sizeof(mnvda));
+ if (err) {
+ NL_SET_ERR_MSG_MOD(extack,
+ "Failed to change per_pf_num_sf global PCI configuration");
+ return err;
+ }
+
+ memset(mnvda, 0, sizeof(mnvda));
+ err = mlx5_nv_param_read_per_host_pf_conf(dev, mnvda, sizeof(mnvda));
+ if (err) {
+ NL_SET_ERR_MSG_MOD(extack, "Failed to read PF configuration");
+ return err;
+ }
+
+ data = MLX5_ADDR_OF(mnvda_reg, mnvda, configuration_item_data);
+ MLX5_SET(nv_pf_pci_conf, data, log_sf_bar_size, ctx->val.vu32 ? 12 : 0);
+ MLX5_SET(nv_pf_pci_conf, data, pf_total_sf_en, !!ctx->val.vu32);
+ MLX5_SET(nv_pf_pci_conf, data, total_sf, ctx->val.vu32);
+
+ err = mlx5_nv_param_write(dev, mnvda, sizeof(mnvda));
+ if (err) {
+ NL_SET_ERR_MSG_MOD(extack,
+ "Failed to change PF PCI configuration");
+ return err;
+ }
+ NL_SET_ERR_MSG(extack, "Modifying max_sfs requires a reboot");
+
+ return 0;
+}
+
static const struct devlink_param mlx5_nv_param_devlink_params[] = {
DEVLINK_PARAM_GENERIC(ENABLE_SRIOV, BIT(DEVLINK_PARAM_CMODE_PERMANENT),
mlx5_devlink_enable_sriov_get,
@@ -763,6 +837,9 @@ static const struct devlink_param mlx5_nv_param_devlink_params[] = {
mlx5_devlink_total_vfs_get,
mlx5_devlink_total_vfs_set,
mlx5_devlink_total_vfs_validate),
+ DEVLINK_PARAM_GENERIC(MAX_SFS, BIT(DEVLINK_PARAM_CMODE_PERMANENT),
+ mlx5_devlink_max_sfs_get,
+ mlx5_devlink_max_sfs_set, NULL),
DEVLINK_PARAM_DRIVER(MLX5_DEVLINK_PARAM_ID_CQE_COMPRESSION_TYPE,
"cqe_compress_type", DEVLINK_PARAM_TYPE_STRING,
BIT(DEVLINK_PARAM_CMODE_PERMANENT),
--
2.44.0
^ permalink raw reply related
* [PATCH net-next V2 1/2] devlink: add generic device max_sfs parameter
From: Tariq Toukan @ 2026-05-19 20:04 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Jiri Pirko, Simon Horman, Jonathan Corbet, Shuah Khan,
Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
Vlad Dumitrescu, Aleksandr Loktionov, Daniel Zahka, David Ahern,
Nikolay Aleksandrov, netdev, linux-doc, linux-kernel, linux-rdma,
Gal Pressman, Dragos Tatulea, Jiri Pirko, Nikolay Aleksandrov
In-Reply-To: <20260519200436.353249-1-tariqt@nvidia.com>
From: Nikolay Aleksandrov <nikolay@nvidia.com>
Add a new generic devlink device parameter (max_sfs) to control if and
how many light-weight NIC subfunctions can be created. Subfunctions are
a light-weight network functions backed by an underlying PCI function.
Their lifecycle can already be managed by devlink, but currently users
cannot enable them in the device. They can be enabled/disabled only via
external vendor tools. This parameter allows subfunctions to be enabled
(>0) or disabled (0) via devlink. A subsequent patch will add support
for max_sfs to the mlx5 driver.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
Documentation/networking/devlink/devlink-params.rst | 6 ++++++
include/net/devlink.h | 4 ++++
net/devlink/param.c | 5 +++++
3 files changed, 15 insertions(+)
diff --git a/Documentation/networking/devlink/devlink-params.rst b/Documentation/networking/devlink/devlink-params.rst
index ea17756dcda6..29b8a9246fb6 100644
--- a/Documentation/networking/devlink/devlink-params.rst
+++ b/Documentation/networking/devlink/devlink-params.rst
@@ -165,3 +165,9 @@ own name.
- u32
- Controls the maximum number of MAC address filters that can be assigned
to a Virtual Function (VF).
+ * - ``max_sfs``
+ - u32
+ - The maximum number of subfunctions which can be created on the device.
+ Modifying this parameter may require a device restart and PCI bus
+ rescanning because the BAR layout may change. A value of 0 disables
+ subfunction creation.
diff --git a/include/net/devlink.h b/include/net/devlink.h
index bcd31de1f890..4ec455cfe7a4 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -546,6 +546,7 @@ enum devlink_param_generic_id {
DEVLINK_PARAM_GENERIC_ID_TOTAL_VFS,
DEVLINK_PARAM_GENERIC_ID_NUM_DOORBELLS,
DEVLINK_PARAM_GENERIC_ID_MAX_MAC_PER_VF,
+ DEVLINK_PARAM_GENERIC_ID_MAX_SFS,
/* add new param generic ids above here*/
__DEVLINK_PARAM_GENERIC_ID_MAX,
@@ -619,6 +620,9 @@ enum devlink_param_generic_id {
#define DEVLINK_PARAM_GENERIC_MAX_MAC_PER_VF_NAME "max_mac_per_vf"
#define DEVLINK_PARAM_GENERIC_MAX_MAC_PER_VF_TYPE DEVLINK_PARAM_TYPE_U32
+#define DEVLINK_PARAM_GENERIC_MAX_SFS_NAME "max_sfs"
+#define DEVLINK_PARAM_GENERIC_MAX_SFS_TYPE DEVLINK_PARAM_TYPE_U32
+
#define DEVLINK_PARAM_GENERIC(_id, _cmodes, _get, _set, _validate) \
{ \
.id = DEVLINK_PARAM_GENERIC_ID_##_id, \
diff --git a/net/devlink/param.c b/net/devlink/param.c
index cf95268da5b0..523243e49d88 100644
--- a/net/devlink/param.c
+++ b/net/devlink/param.c
@@ -117,6 +117,11 @@ static const struct devlink_param devlink_param_generic[] = {
.name = DEVLINK_PARAM_GENERIC_MAX_MAC_PER_VF_NAME,
.type = DEVLINK_PARAM_GENERIC_MAX_MAC_PER_VF_TYPE,
},
+ {
+ .id = DEVLINK_PARAM_GENERIC_ID_MAX_SFS,
+ .name = DEVLINK_PARAM_GENERIC_MAX_SFS_NAME,
+ .type = DEVLINK_PARAM_GENERIC_MAX_SFS_TYPE,
+ },
};
static int devlink_param_generic_verify(const struct devlink_param *param)
--
2.44.0
^ permalink raw reply related
* [PATCH net-next V2 0/2] devlink: add generic max_sfs parameter and mlx5 support
From: Tariq Toukan @ 2026-05-19 20:04 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Jiri Pirko, Simon Horman, Jonathan Corbet, Shuah Khan,
Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
Vlad Dumitrescu, Aleksandr Loktionov, Daniel Zahka, David Ahern,
Nikolay Aleksandrov, netdev, linux-doc, linux-kernel, linux-rdma,
Gal Pressman, Dragos Tatulea, Jiri Pirko
Hi,
This series by Nikolay introduces a new generic devlink device
parameter, max_sfs, to control the number of light-weight NIC
subfunctions (SFs) that can be created on a device.
The first patch adds the generic devlink parameter and infrastructure
support.
The second patch implements support for the parameter in the mlx5
driver.
With this addition, users can enable or disable SF creation directly via
devlink, without relying on external vendor-specific tools.
Regards,
Tariq
V2:
- Add missing ` (Aleksandr Loktionov).
- Add review tag to patch 1.
V1:
https://lore.kernel.org/all/20260517112700.343575-1-tariqt@nvidia.com/
Nikolay Aleksandrov (2):
devlink: add generic device max_sfs parameter
net/mlx5: implement max_sfs parameter
.../networking/devlink/devlink-params.rst | 6 ++
Documentation/networking/devlink/mlx5.rst | 7 +-
.../mellanox/mlx5/core/lib/nv_param.c | 83 ++++++++++++++++++-
include/net/devlink.h | 4 +
net/devlink/param.c | 5 ++
5 files changed, 101 insertions(+), 4 deletions(-)
base-commit: 9bf93cb2e180a58d5984ba13daee95903ff4fc14
--
2.44.0
^ permalink raw reply
* Re: [PATCH] killswitch: add per-function short-circuit mitigation primitive
From: Sasha Levin @ 2026-05-19 20:00 UTC (permalink / raw)
To: Paul Moore
Cc: Song Liu, corbet, akpm, skhan, linux-doc, linux-kernel,
linux-kselftest, gregkh, linux-security-module
In-Reply-To: <CAHC9VhTEs7rCaoPG7cWAzyVkN3ztdadHAq0g8mEy_MgCiCe=0g@mail.gmail.com>
On Mon, May 18, 2026 at 11:08:38PM -0400, Paul Moore wrote:
>On Mon, May 18, 2026 at 8:31 PM Sasha Levin <sashal@kernel.org> wrote:
>> On Mon, May 18, 2026 at 05:29:32PM -0400, Paul Moore wrote:
>> >From my perspective there are two different issues here: should
>> >killswitch be a LSM, and should killswitch leverage kprobes to be able
>> >to "kill" security related symbols. After all, are we okay with
>> >killswitch killing capable() and friends?
>>
>> killswitch doesn't do it on it's own. It may be instructed by root to do that,
>> at which point that is root's problem.
>
>As I mentioned previously, there are cases where we can restrict
>root's privileges today, but a functional killswitch would allow that
>restriction to be bypassed. My last email to Song has an example with
>SELinux.
This would be handled by just disabling killswitch in those scenarios like how
we do with lockdown, no?
>> >In my opinion, making killswitch an LSM is more of a procedural item
>> >that deals with how we view a capability like killswitch. I
>> >personally view killswitch as somewhat similar to Lockdown, which is
>> >why I made the suggestion.
>>
>> Maybe I'm not all that familiar with LSMs, but we would need to be able to stop
>> "random" code paths from executing, and I don't think we can create LSM hooks
>> at that granularity, no?
>
>I don't see any LSM hooks in this revision of killswitch, and as long
>as it is based on a kprobes I can't imagine it would ever use any. As
>I mentioned above, my killswitch-as-a-LSM comment is primarily about
>killswitch filling a role very similar to Lockdown.
My question was more about how to structure killswitch as an LSM. I want to be
able to poke at pretty much any function in the kernel, rather than restrict
access to a known list of functions.
>> >The use of kprobes, while an interesting idea, presents problems as
>> >allowing any kernel symbol to be killed introduces the potential for
>> >security regressions. As a reminder, some LSMs, as well as other
>> >kernel subsystems, have mechanisms in place to restrict root and/or
>> >enforce one-way configuration locks; while many people equate "root"
>> >with full control, in many cases today that is not strictly correct.
>>
>> killswitch "complies" with lockdown. Is there a different scenario which we
>> should be blocking?
>
>See the SELinux example I mentioned in my email to Song.
>
>> >Yes, kprobes have been around for some time, this is not a new
>> >problem, but killswitch makes it far more convenient and accessible to
>> >do dangerous things with kprobes. If killswitch makes it past the RFC
>> >stage without any significant changes to its kill mechanism, we may
>> >need to start considering more liberal usage of NOKPROBE_SYMBOL()
>> >which I think would be an unfortunate casualty.
>>
>> Why? If I don't really mind the security impact, I want to be able to have a
>> killswitch-like interface on my systems. If an attacker is in my systems,
>> killswitch is the least of my concerns I think.
>>
>> If you are security concious, just don't enable CONFIG_KILLSWITCH?
>
>Isn't the whole point of killswitch to have it enabled everywhere
>because you never know when you might want/need it?
Right. We have different usecases. If you want selinux/lockdown/etc and a
really crippled root, that should be an option. If you choose to allow
something like killswitch, it should be an option too.
--
Thanks,
Sasha
^ permalink raw reply
* Re: [PATCH v3] killswitch: add per-function short-circuit mitigation primitive
From: Sasha Levin @ 2026-05-19 19:57 UTC (permalink / raw)
To: Daniel Borkmann
Cc: Song Liu, linux-kernel, linux-doc, linux-kselftest, bpf,
live-patching, Greg Kroah-Hartman, Andrew Morton, Jonathan Corbet,
Mathieu Desnoyers, Joshua Peisach, Florian Weimer, Breno Leitao,
Anthony Iliopoulos, Michal Hocko, Jiri Olsa, John Fastabend,
Christian Brauner
In-Reply-To: <b342c38b-7323-4b72-a239-8a574d6bc36b@iogearbox.net>
On Tue, May 19, 2026 at 02:13:26PM +0200, Daniel Borkmann wrote:
>On 5/19/26 1:59 AM, Song Liu wrote:
>>On Mon, May 18, 2026 at 6:33 AM Sasha Levin <sashal@kernel.org> wrote:
>>>On Sun, May 17, 2026 at 11:37:36PM -0700, Song Liu wrote:
>>>>On Sun, May 17, 2026 at 6:49 AM Sasha Levin <sashal@kernel.org> wrote:
>>>>>* fail_function (CONFIG_FUNCTION_ERROR_INJECTION) is disabled in
>>>>> most production kernels. Even where enabled, it only works on
>>>>> functions pre-annotated with ALLOW_ERROR_INJECTION() in source -
>>>>> no help for a freshly-disclosed CVE. The debugfs UI is blocked by
>>>>> lockdown=integrity and the override is probabilistic.
>>>>>
>>>>>* BPF override (bpf_override_return) honors the same
>>>>> ALLOW_ERROR_INJECTION() whitelist, and BPF itself is off in many
>>>>> production kernels. Even where on, the operator interface is
>>>>> "load a verified BPF program," not a one-line write.
>>>>
>>>>If it is OK for killswitch to attach to any kernel functions, do we still
>>>>need ALLOW_ERROR_INJECTION() for fail_function and BPF
>>>>override? Shall we instead also allow fail_function and BPF override
>>>>to attach to any kernel functions?
>>>
>>>I don't think so. ALLOW_ERROR_INJECTION is not a security mechanism, it's an
>>>integrity/safety mechanism for both bpf and fault injection.
>>>
>>>It protects against a "developer or CI script doing legitimate fault injection
>>>accidentally panics the box" scenario, not an "attacker gets in" one.
>>
>>There really isn't a clear boundary between "security mechanism" and
>>"non-security mechanism". As we are making killswitch available
>>everywhere under root, users will soon learn to use it to do fault injection,
>>and potentially much more scary things. (Think about agents with sudo
>>access).
>
>Fully agree with Song here that there is no clear boundary, and that the
>killswitch could lead to arbitrary, hard to debug breakage if applied to
>the wrong function.. introducing worse bugs than the one being mitigated
>or even /short-circuit LSM enforcement/ (engage security_file_open 0,
>engage cap_capable 0, engage apparmor_* etc).
This is similar to livepatch, right? Do we need guardrails there too?
Or do we just trust root to do the right thing for it's systems without needing
to be it's babysitter?
>The ALLOW_ERROR_INJECTION() provides a curated white-list where you may
>return with an error without causing more severe damage (assuming the
>error handling code is right). The right thing would be to more widely
>apply ALLOW_ERROR_INJECTION() or to figure out a better way to safely
>enable the latter without explicit function annotation.
Sure, this would also work. How do you see this happening? Can we let a certain
user/pid/etc disable the allowlist if they choose to?
>Wrt BPF:
>
>>>>>* BPF override (bpf_override_return) honors the same
>>>>> ALLOW_ERROR_INJECTION() whitelist, and BPF itself is off in many
>>>>> production kernels. Even where on, the operator interface is
>>>>> "load a verified BPF program," not a one-line write.
>
>The claim that BPF itself is off in many production kernels is not really
>true, where did you get that from? All the major distros and cloud providers
>have BPF enabled these days, and even systemd ships BPF programs for
>custom service firewalling etc.
The world is a bit bigger than home distros and cloud providers, but sure - bpf
is enabled widely enough at this point.
How do you see this working with the allowlist?
--
Thanks,
Sasha
^ permalink raw reply
* Re: [PATCH mm-unstable v17 00/14] khugepaged: mTHP support
From: Nico Pache @ 2026-05-19 19:20 UTC (permalink / raw)
To: Wei Yang
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap, rientjes,
rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <20260518125007.a4z3pw4r73uuwja4@master>
On Mon, May 18, 2026 at 6:50 AM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> On Mon, May 11, 2026 at 12:58:00PM -0600, Nico Pache wrote:
> >The following series provides khugepaged with the capability to collapse
> >anonymous memory regions to mTHPs.
> >
> >To achieve this we generalize the khugepaged functions to no longer depend
> >on PMD_ORDER. Then during the PMD scan, we use a bitmap to track individual
> >pages that are occupied (!none/zero). After the PMD scan is done, we use
> >the bitmap to find the optimal mTHP sizes for the PMD range. The
> >restriction on max_ptes_none is removed during the scan, to make sure we
> >account for the whole PMD range in the bitmap. When no mTHP size is
> >enabled, the legacy behavior of khugepaged is maintained.
> >
> >We currently only support max_ptes_none values of 0 or HPAGE_PMD_NR - 1
> >(ie 511). If any other value is specified, the kernel will emit a warning
> >and no mTHP collapse will be attempted. If a mTHP collapse is attempted,
> >but contains swapped out, or shared pages, we don't perform the collapse.
> >It is now also possible to collapse to mTHPs without requiring the PMD THP
> >size to be enabled. These limitations are to prevent collapse "creep"
> >behavior. This prevents constantly promoting mTHPs to the next available
> >size, which would occur because a collapse introduces more non-zero pages
> >that would satisfy the promotion condition on subsequent scans.
> >
> >Patch 1-2: Generalize hugepage_vma_revalidate and alloc_charge_folio
> > for arbitrary orders.
> >Patch 3: Rework max_ptes_* handling into helper functions
> >Patch 4: Generalize __collapse_huge_page_* for mTHP support
> >Patch 5: Require collapse_huge_page to enter/exit with the lock dropped
> >Patch 6: Generalize collapse_huge_page for mTHP collapse
> >Patch 7: Skip collapsing mTHP to smaller orders
> >Patch 8-9: Add per-order mTHP statistics and tracepoints
> >Patch 10: Introduce collapse_allowable_orders helper function
> >Patch 11-13: Introduce bitmap and mTHP collapse support, fully enabled
> >Patch 14: Documentation
> >
> >Testing:
> >- Built for x86_64, aarch64, ppc64le, and s390x
> >- ran all arches on test suites provided by the kernel-tests project
> >- internal testing suites: functional testing and performance testing
> >- selftests mm
> >- I created a test script that I used to push khugepaged to its limits
> > while monitoring a number of stats and tracepoints. The code is
> > available here[1] (Run in legacy mode for these changes and set mthp
> > sizes to inherit)
> > The summary from my testings was that there was no significant
> > regression noticed through this test. In some cases my changes had
> > better collapse latencies, and was able to scan more pages in the same
> > amount of time/work, but for the most part the results were consistent.
> >- redis testing. I did some testing with these changes along with my defer
> > changes (see followup [2] post for more details). We've decided to get
> > the mTHP changes merged first before attempting the defer series.
> >- some basic testing on 64k page size.
> >- lots of general use.
> >
>
> Two links are missing. I got them from previous version.
>
> [1] - https://gitlab.com/npache/khugepaged_mthp_test
> [2] - https://lore.kernel.org/lkml/20250515033857.132535-1-npache@redhat.com/
Oh whoops, ill make sure they are there in the followup
>
> And the test in [1] is a performance test. I am thinking whether we want a
> functional test in selftests.
It also works as a functional test in some regards. The reason i never
pursued self-tests is that I naively thought this was getting merged
6(?) months ago and at the time the selftests infrastructure didn't
support it well. Baolin included patches to clean that up in his shmem
mTHP support patches and added tests for both features. Let's repost
and re-merge this first; then, I will follow up in one or two weeks
regarding self-tests. I'm currently on PTO and only have time to
complete, test, and return the v18 changes to Andrew before they
create a huge merge headache and we miss yet another window.
>
> I did a quick try with following change and some hack.
Thanks Ill use that as a base!
>
> @@ -744,6 +765,51 @@ static void collapse_max_ptes_none(struct collapse_context *c, struct mem_ops *o
> ksft_test_result_report(exit_status, "%s\n", __func__);
> }
>
> +static void collapse_mth_ptes(struct collapse_context *c, struct mem_ops *ops)
> +{
> + struct thp_settings settings = *thp_current_settings();
> + void *p;
> + int i;
> +
> + /* Disable mthp on fault */
> + for (i = 0; i < NR_ORDERS; i++) {
> + settings.hugepages[i].enabled = THP_NEVER;
> + }
> + thp_push_settings(&settings);
> +
> + p = ops->setup_area(1);
> +
> + ops->fault(p, 0, hpage_pmd_size);
> +
> + /* Expect all order-0 folio after fault */
> + memset(expected_orders, 0, sizeof(int) * (pmd_order + 1));
> + expected_orders[0] = hpage_pmd_nr;
> + if (check_folio_orders(p, hpage_pmd_size, pagemap_fd,
> + kpageflags_fd, expected_orders,
> + (pmd_order + 1)))
> + ksft_exit_fail_msg("Unexpected huge page at fault\n");
> +
> + /* Enable mthp before collapse */
> + thp_pop_settings();
> + settings.hugepages[2].enabled = THP_ALWAYS;
> + thp_push_settings(&settings);
> +
> + c->collapse("Collapse fully populated PTE table with order 2", p, 1,
> + ops, true);
> +
> + /* Expect all order-2 folio after collapse */
> + memset(expected_orders, 0, sizeof(int) * (pmd_order + 1));
> + expected_orders[2] = 1 << (pmd_order - 2);
> + if (check_folio_orders(p, hpage_pmd_size, pagemap_fd,
> + kpageflags_fd, expected_orders,
> + (pmd_order + 1)))
> + ksft_exit_fail_msg("Unexpected page order\n");
> +
> + ops->cleanup_area(p, hpage_pmd_size);
> + thp_pop_settings();
> + ksft_test_result_report(exit_status, "%s\n", __func__);
> +}
> +
> static void collapse_swapin_single_pte(struct collapse_context *c, struct mem_ops *ops)
> {
> void *p;
>
> This leverage check_after_split_folio_orders() in split_huge_page_test.c to
> check folio order in PMD range.
>
> --
> Wei Yang
> Help you, Help me
>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox