* Re: [PATCH] killswitch: add per-function short-circuit mitigation primitive
From: Sasha Levin @ 2026-05-19 20:00 UTC (permalink / raw)
To: Paul Moore
Cc: Song Liu, corbet, akpm, skhan, linux-doc, linux-kernel,
linux-kselftest, gregkh, linux-security-module
In-Reply-To: <CAHC9VhTEs7rCaoPG7cWAzyVkN3ztdadHAq0g8mEy_MgCiCe=0g@mail.gmail.com>
On Mon, May 18, 2026 at 11:08:38PM -0400, Paul Moore wrote:
>On Mon, May 18, 2026 at 8:31 PM Sasha Levin <sashal@kernel.org> wrote:
>> On Mon, May 18, 2026 at 05:29:32PM -0400, Paul Moore wrote:
>> >From my perspective there are two different issues here: should
>> >killswitch be a LSM, and should killswitch leverage kprobes to be able
>> >to "kill" security related symbols. After all, are we okay with
>> >killswitch killing capable() and friends?
>>
>> killswitch doesn't do it on it's own. It may be instructed by root to do that,
>> at which point that is root's problem.
>
>As I mentioned previously, there are cases where we can restrict
>root's privileges today, but a functional killswitch would allow that
>restriction to be bypassed. My last email to Song has an example with
>SELinux.
This would be handled by just disabling killswitch in those scenarios like how
we do with lockdown, no?
>> >In my opinion, making killswitch an LSM is more of a procedural item
>> >that deals with how we view a capability like killswitch. I
>> >personally view killswitch as somewhat similar to Lockdown, which is
>> >why I made the suggestion.
>>
>> Maybe I'm not all that familiar with LSMs, but we would need to be able to stop
>> "random" code paths from executing, and I don't think we can create LSM hooks
>> at that granularity, no?
>
>I don't see any LSM hooks in this revision of killswitch, and as long
>as it is based on a kprobes I can't imagine it would ever use any. As
>I mentioned above, my killswitch-as-a-LSM comment is primarily about
>killswitch filling a role very similar to Lockdown.
My question was more about how to structure killswitch as an LSM. I want to be
able to poke at pretty much any function in the kernel, rather than restrict
access to a known list of functions.
>> >The use of kprobes, while an interesting idea, presents problems as
>> >allowing any kernel symbol to be killed introduces the potential for
>> >security regressions. As a reminder, some LSMs, as well as other
>> >kernel subsystems, have mechanisms in place to restrict root and/or
>> >enforce one-way configuration locks; while many people equate "root"
>> >with full control, in many cases today that is not strictly correct.
>>
>> killswitch "complies" with lockdown. Is there a different scenario which we
>> should be blocking?
>
>See the SELinux example I mentioned in my email to Song.
>
>> >Yes, kprobes have been around for some time, this is not a new
>> >problem, but killswitch makes it far more convenient and accessible to
>> >do dangerous things with kprobes. If killswitch makes it past the RFC
>> >stage without any significant changes to its kill mechanism, we may
>> >need to start considering more liberal usage of NOKPROBE_SYMBOL()
>> >which I think would be an unfortunate casualty.
>>
>> Why? If I don't really mind the security impact, I want to be able to have a
>> killswitch-like interface on my systems. If an attacker is in my systems,
>> killswitch is the least of my concerns I think.
>>
>> If you are security concious, just don't enable CONFIG_KILLSWITCH?
>
>Isn't the whole point of killswitch to have it enabled everywhere
>because you never know when you might want/need it?
Right. We have different usecases. If you want selinux/lockdown/etc and a
really crippled root, that should be an option. If you choose to allow
something like killswitch, it should be an option too.
--
Thanks,
Sasha
^ permalink raw reply
* [PATCH net-next V2 0/2] devlink: add generic max_sfs parameter and mlx5 support
From: Tariq Toukan @ 2026-05-19 20:04 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Jiri Pirko, Simon Horman, Jonathan Corbet, Shuah Khan,
Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
Vlad Dumitrescu, Aleksandr Loktionov, Daniel Zahka, David Ahern,
Nikolay Aleksandrov, netdev, linux-doc, linux-kernel, linux-rdma,
Gal Pressman, Dragos Tatulea, Jiri Pirko
Hi,
This series by Nikolay introduces a new generic devlink device
parameter, max_sfs, to control the number of light-weight NIC
subfunctions (SFs) that can be created on a device.
The first patch adds the generic devlink parameter and infrastructure
support.
The second patch implements support for the parameter in the mlx5
driver.
With this addition, users can enable or disable SF creation directly via
devlink, without relying on external vendor-specific tools.
Regards,
Tariq
V2:
- Add missing ` (Aleksandr Loktionov).
- Add review tag to patch 1.
V1:
https://lore.kernel.org/all/20260517112700.343575-1-tariqt@nvidia.com/
Nikolay Aleksandrov (2):
devlink: add generic device max_sfs parameter
net/mlx5: implement max_sfs parameter
.../networking/devlink/devlink-params.rst | 6 ++
Documentation/networking/devlink/mlx5.rst | 7 +-
.../mellanox/mlx5/core/lib/nv_param.c | 83 ++++++++++++++++++-
include/net/devlink.h | 4 +
net/devlink/param.c | 5 ++
5 files changed, 101 insertions(+), 4 deletions(-)
base-commit: 9bf93cb2e180a58d5984ba13daee95903ff4fc14
--
2.44.0
^ permalink raw reply
* [PATCH net-next V2 1/2] devlink: add generic device max_sfs parameter
From: Tariq Toukan @ 2026-05-19 20:04 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Jiri Pirko, Simon Horman, Jonathan Corbet, Shuah Khan,
Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
Vlad Dumitrescu, Aleksandr Loktionov, Daniel Zahka, David Ahern,
Nikolay Aleksandrov, netdev, linux-doc, linux-kernel, linux-rdma,
Gal Pressman, Dragos Tatulea, Jiri Pirko, Nikolay Aleksandrov
In-Reply-To: <20260519200436.353249-1-tariqt@nvidia.com>
From: Nikolay Aleksandrov <nikolay@nvidia.com>
Add a new generic devlink device parameter (max_sfs) to control if and
how many light-weight NIC subfunctions can be created. Subfunctions are
a light-weight network functions backed by an underlying PCI function.
Their lifecycle can already be managed by devlink, but currently users
cannot enable them in the device. They can be enabled/disabled only via
external vendor tools. This parameter allows subfunctions to be enabled
(>0) or disabled (0) via devlink. A subsequent patch will add support
for max_sfs to the mlx5 driver.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
Documentation/networking/devlink/devlink-params.rst | 6 ++++++
include/net/devlink.h | 4 ++++
net/devlink/param.c | 5 +++++
3 files changed, 15 insertions(+)
diff --git a/Documentation/networking/devlink/devlink-params.rst b/Documentation/networking/devlink/devlink-params.rst
index ea17756dcda6..29b8a9246fb6 100644
--- a/Documentation/networking/devlink/devlink-params.rst
+++ b/Documentation/networking/devlink/devlink-params.rst
@@ -165,3 +165,9 @@ own name.
- u32
- Controls the maximum number of MAC address filters that can be assigned
to a Virtual Function (VF).
+ * - ``max_sfs``
+ - u32
+ - The maximum number of subfunctions which can be created on the device.
+ Modifying this parameter may require a device restart and PCI bus
+ rescanning because the BAR layout may change. A value of 0 disables
+ subfunction creation.
diff --git a/include/net/devlink.h b/include/net/devlink.h
index bcd31de1f890..4ec455cfe7a4 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -546,6 +546,7 @@ enum devlink_param_generic_id {
DEVLINK_PARAM_GENERIC_ID_TOTAL_VFS,
DEVLINK_PARAM_GENERIC_ID_NUM_DOORBELLS,
DEVLINK_PARAM_GENERIC_ID_MAX_MAC_PER_VF,
+ DEVLINK_PARAM_GENERIC_ID_MAX_SFS,
/* add new param generic ids above here*/
__DEVLINK_PARAM_GENERIC_ID_MAX,
@@ -619,6 +620,9 @@ enum devlink_param_generic_id {
#define DEVLINK_PARAM_GENERIC_MAX_MAC_PER_VF_NAME "max_mac_per_vf"
#define DEVLINK_PARAM_GENERIC_MAX_MAC_PER_VF_TYPE DEVLINK_PARAM_TYPE_U32
+#define DEVLINK_PARAM_GENERIC_MAX_SFS_NAME "max_sfs"
+#define DEVLINK_PARAM_GENERIC_MAX_SFS_TYPE DEVLINK_PARAM_TYPE_U32
+
#define DEVLINK_PARAM_GENERIC(_id, _cmodes, _get, _set, _validate) \
{ \
.id = DEVLINK_PARAM_GENERIC_ID_##_id, \
diff --git a/net/devlink/param.c b/net/devlink/param.c
index cf95268da5b0..523243e49d88 100644
--- a/net/devlink/param.c
+++ b/net/devlink/param.c
@@ -117,6 +117,11 @@ static const struct devlink_param devlink_param_generic[] = {
.name = DEVLINK_PARAM_GENERIC_MAX_MAC_PER_VF_NAME,
.type = DEVLINK_PARAM_GENERIC_MAX_MAC_PER_VF_TYPE,
},
+ {
+ .id = DEVLINK_PARAM_GENERIC_ID_MAX_SFS,
+ .name = DEVLINK_PARAM_GENERIC_MAX_SFS_NAME,
+ .type = DEVLINK_PARAM_GENERIC_MAX_SFS_TYPE,
+ },
};
static int devlink_param_generic_verify(const struct devlink_param *param)
--
2.44.0
^ permalink raw reply related
* [PATCH net-next V2 2/2] net/mlx5: implement max_sfs parameter
From: Tariq Toukan @ 2026-05-19 20:04 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Jiri Pirko, Simon Horman, Jonathan Corbet, Shuah Khan,
Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
Vlad Dumitrescu, Aleksandr Loktionov, Daniel Zahka, David Ahern,
Nikolay Aleksandrov, netdev, linux-doc, linux-kernel, linux-rdma,
Gal Pressman, Dragos Tatulea, Jiri Pirko, Nikolay Aleksandrov
In-Reply-To: <20260519200436.353249-1-tariqt@nvidia.com>
From: Nikolay Aleksandrov <nikolay@nvidia.com>
Implement max_sfs generic parameter to allow users to control the total
light-weight NIC subfunctions that can be created using devlink instead
of external vendor tools. A value of 0 will effectively disable creation
of new subfunction devices. A warning is sent to user-space via extack
(returning extack without error code is interpreted as a warning by
user-space tools).
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
Documentation/networking/devlink/mlx5.rst | 7 +-
.../mellanox/mlx5/core/lib/nv_param.c | 83 ++++++++++++++++++-
2 files changed, 86 insertions(+), 4 deletions(-)
diff --git a/Documentation/networking/devlink/mlx5.rst b/Documentation/networking/devlink/mlx5.rst
index 4bba4d780a4a..f5e2dccafa5a 100644
--- a/Documentation/networking/devlink/mlx5.rst
+++ b/Documentation/networking/devlink/mlx5.rst
@@ -45,8 +45,13 @@ Parameters
- The range is between 1 and a device-specific max.
- Applies to each physical function (PF) independently, if the device
supports it. Otherwise, it applies symmetrically to all PFs.
+ * - ``max_sfs``
+ - permanent
+ - The range is between 0 and a device-specific max.
+ - Applies to each physical function (PF) independently.
-Note: permanent parameters such as ``enable_sriov`` and ``total_vfs`` require FW reset to take effect
+Note: permanent parameters such as ``enable_sriov``, ``total_vfs`` and ``max_sfs``
+ require FW reset to take effect
.. code-block:: bash
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/nv_param.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/nv_param.c
index 19bb620b7436..eff3a67e4ca0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/nv_param.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/nv_param.c
@@ -68,7 +68,9 @@ struct mlx5_ifc_mnvda_reg_bits {
struct mlx5_ifc_nv_global_pci_conf_bits {
u8 sriov_valid[0x1];
- u8 reserved_at_1[0x10];
+ u8 reserved_at_1[0xa];
+ u8 per_pf_num_sf[0x1];
+ u8 reserved_at_c[0x5];
u8 per_pf_total_vf[0x1];
u8 reserved_at_12[0xe];
@@ -93,9 +95,11 @@ struct mlx5_ifc_nv_global_pci_cap_bits {
};
struct mlx5_ifc_nv_pf_pci_conf_bits {
- u8 reserved_at_0[0x9];
+ u8 log_sf_bar_size[0x8];
+ u8 pf_total_sf_en[0x1];
u8 pf_total_vf_en[0x1];
- u8 reserved_at_a[0x16];
+ u8 reserved_at_a[0x6];
+ u8 total_sf[0x10];
u8 reserved_at_20[0x20];
@@ -755,6 +759,76 @@ static int mlx5_devlink_total_vfs_validate(struct devlink *devlink, u32 id,
return 0;
}
+static int mlx5_devlink_max_sfs_get(struct devlink *devlink, u32 id,
+ struct devlink_param_gset_ctx *ctx,
+ struct netlink_ext_ack *extack)
+{
+ struct mlx5_core_dev *dev = devlink_priv(devlink);
+ u32 mnvda[MLX5_ST_SZ_DW(mnvda_reg)] = {};
+ void *data;
+ int err;
+
+ err = mlx5_nv_param_read_per_host_pf_conf(dev, mnvda, sizeof(mnvda));
+ if (err) {
+ NL_SET_ERR_MSG_MOD(extack, "Failed to read PF configuration");
+ return err;
+ }
+
+ data = MLX5_ADDR_OF(mnvda_reg, mnvda, configuration_item_data);
+ ctx->val.vu32 = MLX5_GET(nv_pf_pci_conf, data, total_sf);
+
+ return 0;
+}
+
+static int mlx5_devlink_max_sfs_set(struct devlink *devlink, u32 id,
+ struct devlink_param_gset_ctx *ctx,
+ struct netlink_ext_ack *extack)
+{
+ struct mlx5_core_dev *dev = devlink_priv(devlink);
+ u32 mnvda[MLX5_ST_SZ_DW(mnvda_reg)] = {};
+ void *data;
+ int err;
+
+ err = mlx5_nv_param_read_global_pci_conf(dev, mnvda, sizeof(mnvda));
+ if (err) {
+ NL_SET_ERR_MSG_MOD(extack,
+ "Failed to read global PCI configuration");
+ return err;
+ }
+
+ data = MLX5_ADDR_OF(mnvda_reg, mnvda, configuration_item_data);
+ MLX5_SET(nv_global_pci_conf, data, per_pf_num_sf, !!ctx->val.vu32);
+
+ err = mlx5_nv_param_write(dev, mnvda, sizeof(mnvda));
+ if (err) {
+ NL_SET_ERR_MSG_MOD(extack,
+ "Failed to change per_pf_num_sf global PCI configuration");
+ return err;
+ }
+
+ memset(mnvda, 0, sizeof(mnvda));
+ err = mlx5_nv_param_read_per_host_pf_conf(dev, mnvda, sizeof(mnvda));
+ if (err) {
+ NL_SET_ERR_MSG_MOD(extack, "Failed to read PF configuration");
+ return err;
+ }
+
+ data = MLX5_ADDR_OF(mnvda_reg, mnvda, configuration_item_data);
+ MLX5_SET(nv_pf_pci_conf, data, log_sf_bar_size, ctx->val.vu32 ? 12 : 0);
+ MLX5_SET(nv_pf_pci_conf, data, pf_total_sf_en, !!ctx->val.vu32);
+ MLX5_SET(nv_pf_pci_conf, data, total_sf, ctx->val.vu32);
+
+ err = mlx5_nv_param_write(dev, mnvda, sizeof(mnvda));
+ if (err) {
+ NL_SET_ERR_MSG_MOD(extack,
+ "Failed to change PF PCI configuration");
+ return err;
+ }
+ NL_SET_ERR_MSG(extack, "Modifying max_sfs requires a reboot");
+
+ return 0;
+}
+
static const struct devlink_param mlx5_nv_param_devlink_params[] = {
DEVLINK_PARAM_GENERIC(ENABLE_SRIOV, BIT(DEVLINK_PARAM_CMODE_PERMANENT),
mlx5_devlink_enable_sriov_get,
@@ -763,6 +837,9 @@ static const struct devlink_param mlx5_nv_param_devlink_params[] = {
mlx5_devlink_total_vfs_get,
mlx5_devlink_total_vfs_set,
mlx5_devlink_total_vfs_validate),
+ DEVLINK_PARAM_GENERIC(MAX_SFS, BIT(DEVLINK_PARAM_CMODE_PERMANENT),
+ mlx5_devlink_max_sfs_get,
+ mlx5_devlink_max_sfs_set, NULL),
DEVLINK_PARAM_DRIVER(MLX5_DEVLINK_PARAM_ID_CQE_COMPRESSION_TYPE,
"cqe_compress_type", DEVLINK_PARAM_TYPE_STRING,
BIT(DEVLINK_PARAM_CMODE_PERMANENT),
--
2.44.0
^ permalink raw reply related
* Re: [PATCH] killswitch: add per-function short-circuit mitigation primitive
From: Paul Moore @ 2026-05-19 20:33 UTC (permalink / raw)
To: Song Liu
Cc: Sasha Levin, corbet, akpm, skhan, linux-doc, linux-kernel,
linux-kselftest, gregkh, linux-security-module
In-Reply-To: <CAPhsuW7Rhdh62AoceQpsfm0+kVsvz8zq97fupm4mtBEyVTkkcg@mail.gmail.com>
On Tue, May 19, 2026 at 1:29 AM Song Liu <song@kernel.org> wrote:
> On Mon, May 18, 2026 at 5:31 PM Sasha Levin <sashal@kernel.org> wrote:
> >
> > On Mon, May 18, 2026 at 05:29:32PM -0400, Paul Moore wrote:
> > >From my perspective there are two different issues here: should
> > >killswitch be a LSM, and should killswitch leverage kprobes to be able
> > >to "kill" security related symbols. After all, are we okay with
> > >killswitch killing capable() and friends?
> >
> > killswitch doesn't do it on it's own. It may be instructed by root to do that,
> > at which point that is root's problem.
> >
> > >In my opinion, making killswitch an LSM is more of a procedural item
> > >that deals with how we view a capability like killswitch. I
> > >personally view killswitch as somewhat similar to Lockdown, which is
> > >why I made the suggestion.
> >
> > Maybe I'm not all that familiar with LSMs, but we would need to be able to stop
> > "random" code paths from executing, and I don't think we can create LSM hooks
> > at that granularity, no?
>
> There are much fewer LSM hooks than ftrace-able (killswitch-able)
> functions. In this sense, killswitch is more granular.
I don't know if I would say it is necessarily more granular as its
ability to filter access is limited to a breakpoint set on a symbol,
but killswitch definitely has a larger quantity of control points.
> However, LSM
> hooks allow LSM policies to make different decisions for different
> arguments. In this sense, LSM hooks are more granular than
> killswitch, as killswitch can only set a fixed return value for each
> engaged function.
Yes, I think we agree here.
> With current LSM solutions, we can mitigate issues like Copy Fail
> without breaking other features of the system. In [1], Cloudflare
> shared how they mitigate Copy Fail with BPF LSM.
... and Android has been shown to not be vulnerable in the first place
due to their use of SELinux and a well crafted SELinux policy.
--
paul-moore.com
^ permalink raw reply
* Re: [PATCH] killswitch: add per-function short-circuit mitigation primitive
From: Paul Moore @ 2026-05-19 20:50 UTC (permalink / raw)
To: Sasha Levin
Cc: Song Liu, corbet, akpm, skhan, linux-doc, linux-kernel,
linux-kselftest, gregkh, linux-security-module
In-Reply-To: <agzBd9mMt3Zf7j1j@laps>
On Tue, May 19, 2026 at 4:00 PM Sasha Levin <sashal@kernel.org> wrote:
> On Mon, May 18, 2026 at 11:08:38PM -0400, Paul Moore wrote:
> >On Mon, May 18, 2026 at 8:31 PM Sasha Levin <sashal@kernel.org> wrote:
> >> On Mon, May 18, 2026 at 05:29:32PM -0400, Paul Moore wrote:
> >> >From my perspective there are two different issues here: should
> >> >killswitch be a LSM, and should killswitch leverage kprobes to be able
> >> >to "kill" security related symbols. After all, are we okay with
> >> >killswitch killing capable() and friends?
> >>
> >> killswitch doesn't do it on it's own. It may be instructed by root to do that,
> >> at which point that is root's problem.
> >
> >As I mentioned previously, there are cases where we can restrict
> >root's privileges today, but a functional killswitch would allow that
> >restriction to be bypassed. My last email to Song has an example with
> >SELinux.
>
> This would be handled by just disabling killswitch in those scenarios like how
> we do with lockdown, no?
One could presumably deny access to killswitch, but that pushes the
burden of choice onto the users/admins. Yes, that is the easy way to
solve thorny use case conflicts like this, but it would be nice if we
could do better for those who have to deal with this in the wild.
> >> >In my opinion, making killswitch an LSM is more of a procedural item
> >> >that deals with how we view a capability like killswitch. I
> >> >personally view killswitch as somewhat similar to Lockdown, which is
> >> >why I made the suggestion.
> >>
> >> Maybe I'm not all that familiar with LSMs, but we would need to be able to stop
> >> "random" code paths from executing, and I don't think we can create LSM hooks
> >> at that granularity, no?
> >
> >I don't see any LSM hooks in this revision of killswitch, and as long
> >as it is based on a kprobes I can't imagine it would ever use any. As
> >I mentioned above, my killswitch-as-a-LSM comment is primarily about
> >killswitch filling a role very similar to Lockdown.
>
> My question was more about how to structure killswitch as an LSM. I want to be
> able to poke at pretty much any function in the kernel, rather than restrict
> access to a known list of functions.
Well, like I said in my last reply to you, I can't imagine a kprobes
based killswitch would need to worry about the LSM hooks. Structuring
killswitch as an LSM would be mostly a few lines of code to register
it as an LSM and that's about it. Benefits would be minor, and likely
a matter of opinion, it's mostly about how we view something like
killswitch in the kernel. If we view it as a security mechanism
similar to lockdown, then it makes sense as a LSM, if we view this as
a completely different thing then it can be whatever it wants to be.
--
paul-moore.com
^ permalink raw reply
* Re: [PATCH v3 04/12] x86,fs/resctrl: Program PLZA through kmode arch hooks
From: Luck, Tony @ 2026-05-19 20:59 UTC (permalink / raw)
To: Babu Moger
Cc: corbet, reinette.chatre, Dave.Martin, james.morse, tglx, bp,
dave.hansen, skhan, x86, mingo, hpa, akpm, rdunlap,
pawan.kumar.gupta, feng.tang, dapeng1.mi, kees, elver, lirongqing,
paulmck, bhelgaas, seanjc, alexandre.chartre, yazen.ghannam,
peterz, chang.seok.bae, kim.phillips, xin, naveen,
thomas.lendacky, linux-doc, linux-kernel, eranian, peternewman,
sos-linux-ext-patches
In-Reply-To: <0cfd813e10072eefc8f4d84328e83bd9a6220ad4.1777591497.git.babu.moger@amd.com>
On Thu, Apr 30, 2026 at 06:24:49PM -0500, Babu Moger wrote:
> +void resctrl_arch_configure_kmode(cpumask_var_t cpu_mask, u32 closid, u32 rmid, bool enable)
> +{
> + union msr_pqr_plza_assoc plza = { 0 };
> +
> + plza.split.rmid = rmid;
> + plza.split.rmid_en = 1;
Shouldn't there be a parameter for the value of rmid_en?
User asked for global_assign_ctrl_assign_mon_per_cpu set it to '1'
User asked for global_assign_ctrl_inherit_mon_per_cpu set it to '0'
> + plza.split.closid = closid;
> + plza.split.closid_en = 1;
> + plza.split.plza_en = enable;
> +
> + on_each_cpu_mask(cpu_mask, resctrl_kmode_set_one_amd, &plza, 1);
> +}
-Tony
^ permalink raw reply
* Re: [PATCH] Documentation: KVM: Document guest-visible compatibility expectations
From: Oliver Upton @ 2026-05-19 21:10 UTC (permalink / raw)
To: David Woodhouse
Cc: Paolo Bonzini, Marc Zyngier, Will Deacon, Jonathan Corbet,
Shuah Khan, kvm, Linux Doc Mailing List,
Kernel Mailing List, Linux, Sean Christopherson, Jim Mattson,
Joey Gouly, Suzuki K Poulose, Zenghui Yu, Catalin Marinas,
Raghavendra Rao Ananta, Eric Auger, Kees Cook, Arnd Bergmann,
Nathan Chancellor, linux-arm-kernel, kvmarm, linux-kselftest
In-Reply-To: <9d0429ddbe4d8c6993e74237c4395697f80092d6.camel@infradead.org>
On Tue, May 19, 2026 at 03:13:30PM +0100, David Woodhouse wrote:
> On Tue, 2026-05-19 at 15:53 +0200, Paolo Bonzini wrote:
> > On Tue, May 19, 2026 at 3:00 PM David Woodhouse <dwmw2@infradead.org> wrote:
> > > Or some guest configurations which have only ever been tested under KVM
> > > could have a bug where they *rely* on the registers not being writable,
> > > and write values which are inconsistent with the rest of their
> > > configuration. Which breaks the moment those registers become writable.
> >
> > Yeah, just having guests that worked by utter chance - but you still
> > don't want to break them - is the case that is most likely. Crappy
> > code that runs only under emulation/virtualization appears with
> > probability 1 over time.
> >
> > Is this likely in this specific case---probably not, honestly.
> > Christoffer's patch dates back to 2018 (commit d53c2c29ae0d); *back
> > then* KVM/Arm was a lot less mature, and people developing for Arm on
> > vanilla upstream kernels have moved on from Linux 4.19.
>
> It's not really 2018 though. EC2 is still running kernels with that
> older commit reverted because of the breaking change that it
> introduced.
>
> So the behaviour corresponding to GICD_IIDR.implementation_rev=1 is
> still current for *millions* of guests.
>
> I'm now finding that revert in our tree during a *later* kernel upgrade
> and trying to eliminate it.
Still, as far as upstream is concerned this is damn near a decade old.
Decisions that you or your peers made in the downstream doesn't change
that.
> And sure, I have given the engineers responsible for that a very hard
> stare and suggested that they should have fixed it 'properly' in the
> first place with a patch like the one we're discussing right now.
>
> And they're looking at this thread and saying "haha no, if fixing
> things properly for Arm is this hard then we'll stick with the crappy
> approach".
The appropriate time to ask for accomodation was a *very* long time ago.
And in the absence of clear evidence of a guest depending on the broken
IGROUPR behavior, I don't see how the guest-side changes of Christoffer's
series are any different from the multitude of bug fixes that we take
every single release cycle. It is an unfortunate bug and I concur with
Marc that it doesn't seem like the sort of thing a guest could rely
upon.
Because it is very much a bug fix, it should've happened without a
change to the revision number.
Now, the handling of GICD_IIDR itself is a separate issue. By my count,
the series broke UAPI on three separate occasions. Before b489edc36169
IIDR was RAZ/WI from userspace. And of course dd6251e463d3 and d53c2c29ae0d
changed the revision with no way of restoring the old value.
And really, IIDR should've *never* been used as a buy in for new UAPI
because it unnecessarily becomes guest visible. 49a1a2c70a7f ("KVM: arm64:
vgic-v3: Advertise GICR_CTLR.{IR, CES} as a new GICD_IIDR revision") is
a much better example for IIDR going forward as it gates *guest-side*
behavior.
> I do not want them to be right. I don't think any of us want that.
>
> > I would still lean towards accepting the code considering the limited
> > complexity of the addition (in fact I like it more now that it uses
> > IIDR instead of v2_groups_user_writable, but that's taste).
>
> I'm absolutely prepared to have a separate technical discussion about
> the v2_groups_user_writable thing for GICv2, which is the second part
> of that series.
>
> It seems like the right thing to do, and as far as I can tell, this
> code *never* worked with QEMU. But I'm not sure who even cares about
> GICv2 any more. I couldn't find hardware and I had to test the whole
> thing inside qemu-tcg.
>
> But the 'IIDR defaults to 3 but the *behaviour* doesn't match until you
> explicitly *set* it to 3' thing seemed so *egregiously* wrong to me,
> that I fixed it anyway.
Wrong or not, this behavior is documented unambiguously. From the VGICv2
UAPI documentation:
"""
Userspace should set GICD_IIDR before setting any other registers (both
KVM_DEV_ARM_VGIC_GRP_DIST_REGS and KVM_DEV_ARM_VGIC_GRP_CPU_REGS) to ensure
the expected behavior. Unless GICD_IIDR has been set from userspace, writes
to the interrupt group registers (GICD_IGROUPR) are ignored.
"""
I'm not inclined to change that. As a way out of this whole mess, can we
instead:
- Allow userspace to set IIDR.Revision to 1
- Drop any bug emulation from the handling of IGROUPR registers
- Special-case the stupid GICv2 UAPI where IGROUPR are only writable if
the VMM has written to IIDR and the revision >= 2
Thanks,
Oliver
^ permalink raw reply
* Re: [PATCH v4 04/30] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration
From: Dongli Zhang @ 2026-05-19 21:23 UTC (permalink / raw)
To: David Woodhouse, kvm
Cc: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Thomas Gleixner,
Sean Christopherson, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
Dave Hansen, Vitaly Kuznetsov, x86, Marc Zyngier, Juergen Gross,
Boris Ostrovsky, Paul Durrant, Jonathan Cameron, Sascha Bischoff,
Jack Allister, Joey Gouly, joe.jin, linux-doc, linux-kernel,
xen-devel, linux-kselftest
In-Reply-To: <b9980333f3a310bf05e170e79c40cb2f46485caf.camel@infradead.org>
On 2026-05-19 12:50 AM, David Woodhouse wrote:
> On Mon, 2026-05-18 at 17:57 -0700, Dongli Zhang wrote:
>> On 2026-05-18 1:48 AM, David Woodhouse wrote:
>>> ...
>>
>> I have fixed the Thunderbird configuration. Does it look better to you?
>
> The date is certainly better, thank you. But although I *was* up late
> that night frowning at clocks, I didn't think I was up *quite* as late
> (almost 2am) as it suggests.
>
> But I suspect that getting *that* right is beyond the limit of
> Thunderbird's configurability.
>
> Thanks :)
Let me continue exploring how to add a timezone, such as "17:57 -0700".
>
>> I really appreciate guidelines like the ones below.
>>
>> https://lore.kernel.org/all/20240522001817.619072-8-dwmw2@infradead.org
>>
>> Assuming I am a user of the new API, I feel confused about whether the goal is
>> to replace KVM_SET_CLOCK with KVM_SET_CLOCK_GUEST, or whether the latter is
>> meant to supplement the former.
>
> The issue is that KVM_SET_CLOCK_GUEST can only be used in 'masterclock'
> mode, when the TSC is reliable and the guest TSCs are all in sync.
>
> Which ought to be *all* of the time, on modern hardware and sane
> configurations. And in this series, I don't even let the *guest* screw
> that over by setting different TSC offsets on different vCPUs any more
> (we stay in masterclock mode in that case now). But the VMM can cause
> its guest to come out of masterclock mode, by setting different TSC
> *speeds* on different vCPUs.
>
> So there remain some pathological cases where the kvmclock actually
> still has a justification to exist, and those are the cases where it
> needs to be set in its own right as a function of host time
> (KVM_SET_CLOCK), not purely as a function of the guest TSC
> (KVM_SET_CLOCK_GUEST).
I think I now understand why I feel like I am always asking weird questions. I
have been thinking about how to account for downtime, so I see
KVM_SET_CLOCK_GUEST as a supplement to KVM_SET_CLOCK.
Suppose we are not going to account for any downtime. With KVM_SET_CLOCK_GUEST:
1. The masterclock is active, so gTSC is synchronized across vCPUs. All vCPUs
share the same kvm_read_l1_tsc(v, ka->master_cycle_now).
2. Migrate the gTSC to the target VM however people want (either ablolute value
or offset value). (Optional) Account for downtime in gTSC however people want,
even with KVM_SET_CLOCK/KVM_CLOCK_REALTIME, which you may not like.
3. Adjust kvm-clock (that is, ka->kvmclock_offset) with KVM_SET_CLOCK_GUEST.
That is why you think KVM_SET_CLOCK is no longer required if we have
KVM_SET_CLOCK_GUEST. While I think KVM_SET_CLOCK is required because of
KVM_CLOCK_REALTIME.
It it isn't required to account any downtime for gTSC or if there is another way
to do so, only KVM_SET_CLOCK_GUEST is enough.
>
>
>>
>> If we are going to use KVM_SET_CLOCK_GUEST when KVM_SET_CLOCK is not needed, I
>> would appreciate it if the API could carry more data in addition to struct
>> pvclock_vcpu_time_info.
>>
>> +#define KVM_SET_CLOCK_GUEST _IOW(KVMIO, 0xd6, struct pvclock_vcpu_time_info)
>> +#define KVM_GET_CLOCK_GUEST _IOR(KVMIO, 0xd7, struct pvclock_vcpu_time_info)
>>
>>
[snip]
>>
>> Another scenario is when only MASTERCLOCK_UPDATE is pending and there is no
>> pending CLOCK_UPDATE.
>>
>> In this scenario, is it fine to skip processing MASTERCLOCK_UPDATE before saving
>> pvclock_vcpu_time_info?
>>
>
> I'm not sure I understand that scenario.
>
> MASTERCLOCK_UPDATE means we have to actually recalculate the master
> clock (which really *should* be rare, now!). And then any time we do
> that, we also have to do a CLOCK_UPDATE on every vCPU to disseminate
> the new information. Which is why kvm_end_pvclock_update() does exactly
> that.
>
> So your "MASTERCLOCK_UPDATE is pending and there is no pending
> CLOCK_UPDATE" doesn't make much sense to me. If MASTERCLOCK_UPDATE is
> pending, then there *will* be a CLOCK_UPDATE pending.
Suppose the VM is stopped and the master clock is active.
Suddenly, we change the host clocksource from TSC to HPET. pvclock_gtod_notify()
may call pvclock_gtod_update_fn() to set a pending KVM_REQ_MASTERCLOCK_UPDATE
for all vCPUs. Unless the pending KVM_REQ_MASTERCLOCK_UPDATE is processed by
kvm_update_masterclock(), kvm_end_pvclock_update() will not set a pending
KVM_REQ_CLOCK_UPDATE.
Therefore, this is a scenario in which only KVM_REQ_MASTERCLOCK_UPDATE is pending.
I do not think this scenario is important. I am just curious about the expected
way to implement similar code in the future :)
>
>
>>>>
>>>> Would it be helpful to validate that the delta is within a reasonable range,
>>>> e.g. that the drift can never be more than five minutes (forward or backward)?
>>>
>>> If a guest has been running for months on a previous host and is
>>> migrated to a new host, don't we expect that the KVM clock of the new
>>> VM on the new host is tweaked from its default near-zero after
>>> creation, to some large amount?
>>>
>>
>> Regarding live migration, my own investigation does not show a proportional
>> relationship between VM uptime and the amount of drift.
>
> You're comparing the VM on the source host, with the VM on the
> destination post-migration.
Apologies for making it confusing. I was just trying to explain why I think the
kvm-clock drift will not be large.
We previously discussed the vCPU hotplug and kvm-clock drift issue. The longer
the time interval between two vCPU hotplug events, the larger the drift.
For live migration (with QEMU), I provided the equation to show that the drift
will not be large, because it is determined by something else rather than by how
long the VM has been running on the source server.
For the previous vCPU hotplug and kvm-clock bug, if we add more vCPUs to a guest
that has been running for three months, the drift will be relatively larger.
For QEMU live migration, migrating a guest VM that has been running on the
source host for *three months* versus one that has been running for *one day*
will not cause much difference in kvm-clock drift.
>
> Perhaps I misunderstood, but I thought your suggested validation of a
> 'reasonable range' would also apply when adjusting the kvmclock of the
> nascent VM on the destination host, from "newly created" to "has been
> running for months" while migrating the state of the actual guest onto
> a clean new slate.
>
>> Just taking QEMU + KVM as an example: suppose TSC scaling is inactive, the
>> amount of drift does not depend on how long the VM has been running before live
>> migration.
>>
>> Instead, it depends on the delta between when we call MSR_IA32_TSC and
>> KVM_GET_CLOCK, and between MSR_IA32_TSC and KVM_SET_CLOCK.
>>
>> The guest TSC stops at P1 and resumes at P3.
>> The kvmclock stops at P2 and resumes at P4.
>>
>> We expect P1 == P2 and P3 == P4.
>>
>> On source host.
>>
>> - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=0 ===> P1
>
> Here's where it all starts going wrong. Line 1.
>
> Any API which lets you get a single time value in isolation, and thus
> which is already out of date by the time the system call even returns,
> is fundamentally unsuitable for migration.
>
>> - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=1
>> - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=2
>> - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=3
>> - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=4
>> ... ...
>> - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=N
>> - KVM_GET_CLOCK ===> P2
>>
>> On target host.
>>
>> - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=1 ===> P3
>> - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=2
>
> At this point, the nasty hack in the kernel steps in, realises that the
> value you're setting on vCPU 2 is within a second or so of the value
> you had previously set on vCPU 1, and snaps it back to be precisely the
> same. To work around the fundamental brokenness of this method.
>
>> - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=3
>> - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=4
>> - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=5
>> ... ...
>> - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=N
>> - KVM_SET_CLOCK ====> P4
>>
>>
>> Here is my equiation to predict the drift.
>
> I'm sure you're right, but I didn't get that far when looking at this.
> I'd already thrown up in my mouth a little bit by line one.
>
> Here's my equation to predict the drift of a live update done correctly
> on the same host using the method I've now put in the documentation:
>
> 0.
For the ideal live update case (on the same host), there may be no need to
adjust gTSC so that it keeps incrementing. In that case, KVM_SET_CLOCK_GUEST can
be used to adjust kvm-clock based on gTSC.
For the live migration scenario, the current QEMU implementation not only fails
to account for downtime, but also has a drift issue. That is what I would like
to address in QEMU.
Thank you very much!
Dongli Zhang
^ permalink raw reply
* Re: [PATCH] Documentation: KVM: Document guest-visible compatibility expectations
From: David Woodhouse @ 2026-05-19 21:58 UTC (permalink / raw)
To: Oliver Upton
Cc: Paolo Bonzini, Marc Zyngier, Will Deacon, Jonathan Corbet,
Shuah Khan, kvm, Linux Doc Mailing List,
Kernel Mailing List, Linux, Sean Christopherson, Jim Mattson,
Joey Gouly, Suzuki K Poulose, Zenghui Yu, Catalin Marinas,
Raghavendra Rao Ananta, Eric Auger, Kees Cook, Arnd Bergmann,
Nathan Chancellor, linux-arm-kernel, kvmarm, linux-kselftest
In-Reply-To: <agzR2kaJsNa8X9lF@kernel.org>
[-- Attachment #1: Type: text/plain, Size: 6861 bytes --]
On Tue, 2026-05-19 at 14:10 -0700, Oliver Upton wrote:
> On Tue, May 19, 2026 at 03:13:30PM +0100, David Woodhouse wrote:
> > On Tue, 2026-05-19 at 15:53 +0200, Paolo Bonzini wrote:
> > > On Tue, May 19, 2026 at 3:00 PM David Woodhouse <dwmw2@infradead.org> wrote:
> > > > Or some guest configurations which have only ever been tested under KVM
> > > > could have a bug where they *rely* on the registers not being writable,
> > > > and write values which are inconsistent with the rest of their
> > > > configuration. Which breaks the moment those registers become writable.
> > >
> > > Yeah, just having guests that worked by utter chance - but you still
> > > don't want to break them - is the case that is most likely. Crappy
> > > code that runs only under emulation/virtualization appears with
> > > probability 1 over time.
> > >
> > > Is this likely in this specific case---probably not, honestly.
> > > Christoffer's patch dates back to 2018 (commit d53c2c29ae0d); *back
> > > then* KVM/Arm was a lot less mature, and people developing for Arm on
> > > vanilla upstream kernels have moved on from Linux 4.19.
> >
> > It's not really 2018 though. EC2 is still running kernels with that
> > older commit reverted because of the breaking change that it
> > introduced.
> >
> > So the behaviour corresponding to GICD_IIDR.implementation_rev=1 is
> > still current for *millions* of guests.
> >
> > I'm now finding that revert in our tree during a *later* kernel upgrade
> > and trying to eliminate it.
>
> Still, as far as upstream is concerned this is damn near a decade old.
> Decisions that you or your peers made in the downstream doesn't change
> that.
>
> > And sure, I have given the engineers responsible for that a very hard
> > stare and suggested that they should have fixed it 'properly' in the
> > first place with a patch like the one we're discussing right now.
> >
> > And they're looking at this thread and saying "haha no, if fixing
> > things properly for Arm is this hard then we'll stick with the crappy
> > approach".
>
> The appropriate time to ask for accomodation was a *very* long time ago.
>
> And in the absence of clear evidence of a guest depending on the broken
> IGROUPR behavior, I don't see how the guest-side changes of Christoffer's
> series are any different from the multitude of bug fixes that we take
> every single release cycle. It is an unfortunate bug and I concur with
> Marc that it doesn't seem like the sort of thing a guest could rely
> upon.
I find this concerning, because I've already explained this.
There is a very real possibility of guests simply not *noticing* that
they had bugs in this area, as it didn't *matter* what they wrote to
these registers since it never worked.
There is an even larger possibility of guests having worked around the
original issue by *detecting* whether the registers were actually
writable before choosing to use the alternative groups. And if such a
guest launches on a new kernel and then needs to be rolled back to an
older kernel, that will also break.
> Because it is very much a bug fix, it should've happened without a
> change to the revision number.
No. Changing the revision number in conjunction with the guest-visible
behaviour change is *absolutely* the right thing to do.
> Now, the handling of GICD_IIDR itself is a separate issue. By my count,
> the series broke UAPI on three separate occasions. Before b489edc36169
> IIDR was RAZ/WI from userspace. And of course dd6251e463d3 and d53c2c29ae0d
> changed the revision with no way of restoring the old value.
>
> And really, IIDR should've *never* been used as a buy in for new UAPI
> because it unnecessarily becomes guest visible. 49a1a2c70a7f ("KVM: arm64:
> vgic-v3: Advertise GICR_CTLR.{IR, CES} as a new GICD_IIDR revision") is
> a much better example for IIDR going forward as it gates *guest-side*
> behavior.
Yes, 49a1a2c70a7f is the exemplar. The guest-visible behaviour changes,
so we get a new IIDR revision and the ability to preserve the previous
behaviour by setting IIDR to the old value. That is exactly how it
should always be done.
> > I do not want them to be right. I don't think any of us want that.
> >
> > > I would still lean towards accepting the code considering the limited
> > > complexity of the addition (in fact I like it more now that it uses
> > > IIDR instead of v2_groups_user_writable, but that's taste).
> >
> > I'm absolutely prepared to have a separate technical discussion about
> > the v2_groups_user_writable thing for GICv2, which is the second part
> > of that series.
> >
> > It seems like the right thing to do, and as far as I can tell, this
> > code *never* worked with QEMU. But I'm not sure who even cares about
> > GICv2 any more. I couldn't find hardware and I had to test the whole
> > thing inside qemu-tcg.
> >
> > But the 'IIDR defaults to 3 but the *behaviour* doesn't match until you
> > explicitly *set* it to 3' thing seemed so *egregiously* wrong to me,
> > that I fixed it anyway.
>
> Wrong or not, this behavior is documented unambiguously. From the VGICv2
> UAPI documentation:
>
> """
> Userspace should set GICD_IIDR before setting any other registers (both
> KVM_DEV_ARM_VGIC_GRP_DIST_REGS and KVM_DEV_ARM_VGIC_GRP_CPU_REGS) to ensure
> the expected behavior. Unless GICD_IIDR has been set from userspace, writes
> to the interrupt group registers (GICD_IGROUPR) are ignored.
> """
>
> I'm not inclined to change that.
That'll all very well... but as far as I can tell, QEMU *doesn't* set
GICD_IIDR, so it still gets the bizarre behaviour where the *guest* can
write the registers, but userspace can't. So it looks like it'll work
except migration will fail. Am I missing something?
But honestly, I don't care one iota about GICv2; I was only trying to
do the cleanup while I was there. Feel free to drop that part entirely.
> As a way out of this whole mess, can we
> instead:
>
> - Allow userspace to set IIDR.Revision to 1
>
> - Drop any bug emulation from the handling of IGROUPR registers
It doesn't make sense to allow setting IIDR.Revision to 1 *without* the
one-liner that actually implements the corresponding behaviour change
in the IGROUPR registers. And as explained at least twice now, it's the
behaviour change that's *important* here.
The fact that it's a long-standing bug in KVM which downstream has been
working around for a long time doesn't matter. The unconditional
behavioural change *is* a bug and we should fix it.
> - Special-case the stupid GICv2 UAPI where IGROUPR are only writable if
> the VMM has written to IIDR and the revision >= 2
That already *is* a special case, right? And you'd rather leave it as it is?
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* Re: [PATCH v3] killswitch: add per-function short-circuit mitigation primitive
From: Song Liu @ 2026-05-19 22:00 UTC (permalink / raw)
To: Sasha Levin
Cc: Daniel Borkmann, linux-kernel, linux-doc, linux-kselftest, bpf,
live-patching, Greg Kroah-Hartman, Andrew Morton, Jonathan Corbet,
Mathieu Desnoyers, Joshua Peisach, Florian Weimer, Breno Leitao,
Anthony Iliopoulos, Michal Hocko, Jiri Olsa, John Fastabend,
Christian Brauner
In-Reply-To: <agzAwjKhOhuANz_P@laps>
On Tue, May 19, 2026 at 12:57 PM Sasha Levin <sashal@kernel.org> wrote:
[...]
> >Fully agree with Song here that there is no clear boundary, and that the
> >killswitch could lead to arbitrary, hard to debug breakage if applied to
> >the wrong function.. introducing worse bugs than the one being mitigated
> >or even /short-circuit LSM enforcement/ (engage security_file_open 0,
> >engage cap_capable 0, engage apparmor_* etc).
>
> This is similar to livepatch, right? Do we need guardrails there too?
livepatch has the same guardrails as other kernel modules:
CONFIG_MODULE_SIG, CONFIG_MODULE_SIG_FORCE, etc.
Thanks,
Song
^ permalink raw reply
* Re: [PATCH v4 04/30] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration
From: David Woodhouse @ 2026-05-19 22:43 UTC (permalink / raw)
To: Dongli Zhang, kvm
Cc: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Thomas Gleixner,
Sean Christopherson, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
Dave Hansen, Vitaly Kuznetsov, x86, Marc Zyngier, Juergen Gross,
Boris Ostrovsky, Paul Durrant, Jonathan Cameron, Sascha Bischoff,
Jack Allister, Joey Gouly, joe.jin, linux-doc, linux-kernel,
xen-devel, linux-kselftest
In-Reply-To: <aa68ed10-15da-4368-a986-6864843a3c44@oracle.com>
[-- Attachment #1: Type: text/plain, Size: 7240 bytes --]
On Tue, 2026-05-19 at 14:23 -0700, Dongli Zhang wrote:
> I think I now understand why I feel like I am always asking weird questions. I
> have been thinking about how to account for downtime, so I see
> KVM_SET_CLOCK_GUEST as a supplement to KVM_SET_CLOCK.
I do not believe in "downtime". There is no such thing.
There is only "steal time".
A CPU may be off in the weeds — a vCPU suffering steal time, or even a
pCPU in SMM which is effectively the same thing — but time doesn't
stop, and neither does the TSC.
> Suppose we are not going to account for any downtime. With KVM_SET_CLOCK_GUEST:
>
> 1. The masterclock is active, so gTSC is synchronized across vCPUs. All vCPUs
> share the same kvm_read_l1_tsc(v, ka->master_cycle_now).
Strictly, by the time we get to the end of my series, masterclock is
active *because* all the vCPUs are running at the same TSC rate (even
if the guest set them to different offsets). But OK.
> 2. Migrate the gTSC to the target VM however people want (either ablolute value
> or offset value). (Optional) Account for downtime in gTSC however people want,
> even with KVM_SET_CLOCK/KVM_CLOCK_REALTIME, which you may not like.
>
> 3. Adjust kvm-clock (that is, ka->kvmclock_offset) with KVM_SET_CLOCK_GUEST.
>
> That is why you think KVM_SET_CLOCK is no longer required if we have
> KVM_SET_CLOCK_GUEST. While I think KVM_SET_CLOCK is required because of
> KVM_CLOCK_REALTIME.
If I recall correctly what we described in
https://lore.kernel.org/all/20240522001817.619072-8-dwmw2@infradead.org/
I don't think we actually needed KVM_SET_CLOCK at all, did we?
We *abuse* KVM_GET_CLOCK to give us a tuple of {realtime, host TSC}
because there's actually no other way for *userspace* to get that. We
don't actually *care* about the KVM clock part.
We use the {realtime, host TSC} pair to reconstitute the guest TSC
values to correctly reflect the passing of time while the guest was in
the ether.
> It it isn't required to account any downtime for gTSC or if there is another way
> to do so, only KVM_SET_CLOCK_GUEST is enough.
Right. If you only want the guest to come back with the *same* values
in its TSC as before the migration, as if the TSC was *paused* during
the migration, then you can just restore those values and use
KVM_SET_CLOCK_GUEST. Assuming you are on modern hardware and have set
all vCPUs to the same rate (and are using this series so the *guest*
can't break masterclock for you, and you can trust the
KVM_SET_CLOCK_GUEST will work).
> >
> > > Another scenario is when only MASTERCLOCK_UPDATE is pending and there is no
> > > pending CLOCK_UPDATE.
> > >
> > > In this scenario, is it fine to skip processing MASTERCLOCK_UPDATE before saving
> > > pvclock_vcpu_time_info?
> > >
> >
> > I'm not sure I understand that scenario.
> >
> > MASTERCLOCK_UPDATE means we have to actually recalculate the master
> > clock (which really *should* be rare, now!). And then any time we do
> > that, we also have to do a CLOCK_UPDATE on every vCPU to disseminate
> > the new information. Which is why kvm_end_pvclock_update() does exactly
> > that.
> >
> > So your "MASTERCLOCK_UPDATE is pending and there is no pending
> > CLOCK_UPDATE" doesn't make much sense to me. If MASTERCLOCK_UPDATE is
> > pending, then there *will* be a CLOCK_UPDATE pending.
>
> Suppose the VM is stopped and the master clock is active.
I don't know what it means for a VM to be 'stopped'. Do you mean that
all vCPUs happen to be experiencing steal time at the present moment?
> Suddenly, we change the host clocksource from TSC to HPET. pvclock_gtod_notify()
> may call pvclock_gtod_update_fn() to set a pending KVM_REQ_MASTERCLOCK_UPDATE
> for all vCPUs. Unless the pending KVM_REQ_MASTERCLOCK_UPDATE is processed by
> kvm_update_masterclock(), kvm_end_pvclock_update() will not set a pending
> KVM_REQ_CLOCK_UPDATE.
You say 'Unless'... do you mean 'Until'?
> Therefore, this is a scenario in which only KVM_REQ_MASTERCLOCK_UPDATE is pending.
>
> I do not think this scenario is important. I am just curious about the expected
> way to implement similar code in the future :)
I think that's working correctly. Until the master clock has *actually*
been updated, there's no point in setting CLOCK_UPDATE for each vCPU to
disseminate the new information to its own pvclock?
> >
> >
> > > > >
> > > > > Would it be helpful to validate that the delta is within a reasonable range,
> > > > > e.g. that the drift can never be more than five minutes (forward or backward)?
> > > >
> > > > If a guest has been running for months on a previous host and is
> > > > migrated to a new host, don't we expect that the KVM clock of the new
> > > > VM on the new host is tweaked from its default near-zero after
> > > > creation, to some large amount?
> > > >
> > >
> > > Regarding live migration, my own investigation does not show a proportional
> > > relationship between VM uptime and the amount of drift.
> >
> > You're comparing the VM on the source host, with the VM on the
> > destination post-migration.
>
> Apologies for making it confusing. I was just trying to explain why I think the
> kvm-clock drift will not be large.
Sure, but I don't care. If we have a sane API, the drift should be zero
:)
> We previously discussed the vCPU hotplug and kvm-clock drift issue. The longer
> the time interval between two vCPU hotplug events, the larger the drift.
>
> For live migration (with QEMU), I provided the equation to show that the drift
> will not be large, because it is determined by something else rather than by how
> long the VM has been running on the source server.
>
>
> For the previous vCPU hotplug and kvm-clock bug, if we add more vCPUs to a guest
> that has been running for three months, the drift will be relatively larger.
>
> For QEMU live migration, migrating a guest VM that has been running on the
> source host for *three months* versus one that has been running for *one day*
> will not cause much difference in kvm-clock drift.
Right.
> For the ideal live update case (on the same host), there may be no need to
> adjust gTSC so that it keeps incrementing. In that case, KVM_SET_CLOCK_GUEST can
> be used to adjust kvm-clock based on gTSC.
Right. You restore the gTSC using its *offset* from the host TSC which
hasn't stopped counting on the same host. Then use KVM_SET_CLOCK_GUEST
to restore the kvmclock in terms of the gTSC. And you have an
absolutely cycle-perfect migration.
> For the live migration scenario, the current QEMU implementation not only fails
> to account for downtime, but also has a drift issue. That is what I would like
> to address in QEMU.
Again, restore the gTSC as accurately as possible. Probably by working
out for *yourself* the relationships of the source and destination host
TSCs to real time, and then reconstituting on the destination using TSC
offset just as for live migration.
And then use KVM_SET_CLOCK_GUEST too.
That's what I attempted to document in
https://lore.kernel.org/all/20240522001817.619072-8-dwmw2@infradead.org/
and should probably revive.
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* Re: [PATCH RESEND bpf-next v10 2/8] bpf: clear list node owner and unlink before drop
From: Eduard Zingerman @ 2026-05-19 22:56 UTC (permalink / raw)
To: Kaitao Cheng
Cc: bpf, Alexei Starovoitov, linux-kernel, linux-doc, ast, memxor,
corbet, martin.lau, daniel, andrii, song, yonghong.song,
john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah, chengkaitao,
skhan, vmalik, linux-kselftest, martin.lau, clm, ihor.solodrai,
bot+bpf-ci
In-Reply-To: <d5961282-d41c-4e54-8ba2-cd08823a8c77@linux.dev>
On Mon, 2026-05-18 at 11:02 +0800, Kaitao Cheng wrote:
[...]
> > > > The patch does have a bug, however. To fix the issues we are seeing now,
> > > > I propose the additional changes below and would appreciate feedback.
> > > >
> > > > --- a/kernel/bpf/helpers.c
> > > > +++ b/kernel/bpf/helpers.c
> > > > @@ -2263,8 +2263,10 @@ void bpf_list_head_free(const struct btf_field *field, void *list_head,
> > > > if (!head->next || list_empty(head))
> > > > goto unlock;
> > > > list_for_each_safe(pos, n, head) {
> > > > - WRITE_ONCE(container_of(pos,
> > > > - struct bpf_list_node_kern, list_head)->owner, NULL);
> > > > + struct bpf_list_node_kern *node;
> > > > +
> > > > + node = container_of(pos, struct bpf_list_node_kern, list_head);
> > > > + WRITE_ONCE(node->owner, BPF_PTR_POISON);
> > > > list_move_tail(pos, &drain);
> > > > }
> > > > unlock:
> > > > @@ -2272,8 +2274,12 @@ void bpf_list_head_free(const struct btf_field *field, void *list_head,
> > > > __bpf_spin_unlock_irqrestore(spin_lock);
> > > >
> > > > while (!list_empty(&drain)) {
> > > > + struct bpf_list_node_kern *node;
> > > > +
> > > > pos = drain.next;
> > > > + node = container_of(pos, struct bpf_list_node_kern, list_head);
> > > > list_del_init(pos);
> > > > + WRITE_ONCE(node->owner, NULL);
Is CPU allowed to reorder the stores in list_del_init() and WRITE_ONCE()?
If it is, I think there is a race here.
Thread #1:
enter bpf_list_head_free()
acquire H1 lock
list_move_tail(pos, &drain); // reordered
<-- ip here -->
WRITE_ONCE(node->owner, BPF_PTR_POISON); // reordered
Thread #2:
acquire H1 lock
n = bpf_refcount_acquire()
release H1 lock
acquire H2 lock
enter __bpf_list_add()
<-- ip here -->
cmpxchg(&node->owner, NULL, BPF_PTR_POISON)
[...]
^ permalink raw reply
* Re: [PATCH] Documentation: KVM: Document guest-visible compatibility expectations
From: Oliver Upton @ 2026-05-19 22:57 UTC (permalink / raw)
To: David Woodhouse
Cc: Paolo Bonzini, Marc Zyngier, Will Deacon, Jonathan Corbet,
Shuah Khan, kvm, Linux Doc Mailing List,
Kernel Mailing List, Linux, Sean Christopherson, Jim Mattson,
Joey Gouly, Suzuki K Poulose, Zenghui Yu, Catalin Marinas,
Raghavendra Rao Ananta, Eric Auger, Kees Cook, Arnd Bergmann,
Nathan Chancellor, linux-arm-kernel, kvmarm, linux-kselftest
In-Reply-To: <1243d375846c4f4e20c229a6f09300126188fc8b.camel@infradead.org>
On Tue, May 19, 2026 at 10:58:05PM +0100, David Woodhouse wrote:
> On Tue, 2026-05-19 at 14:10 -0700, Oliver Upton wrote:
> > And in the absence of clear evidence of a guest depending on the broken
> > IGROUPR behavior, I don't see how the guest-side changes of Christoffer's
> > series are any different from the multitude of bug fixes that we take
> > every single release cycle. It is an unfortunate bug and I concur with
> > Marc that it doesn't seem like the sort of thing a guest could rely
> > upon.
>
> I find this concerning, because I've already explained this.
>
> There is a very real possibility of guests simply not *noticing* that
> they had bugs in this area, as it didn't *matter* what they wrote to
> these registers since it never worked.
>
> There is an even larger possibility of guests having worked around the
> original issue by *detecting* whether the registers were actually
> writable before choosing to use the alternative groups. And if such a
> guest launches on a new kernel and then needs to be rolled back to an
> older kernel, that will also break.
The onus is on you to substantiate this claim. I would imagine after
carrying the revert for so long that there must be at least one example
of such a guest?
What ifs and maybes do not meet the bar, in my opinion, for preserving
bug emulation in KVM. Of course there could be a little flexibility with
that but we need to have some way of discriminating between bug fixes
and genuine guest expectations around the behavior of virtual hardware.
> > Wrong or not, this behavior is documented unambiguously. From the VGICv2
> > UAPI documentation:
> >
> > """
> > Userspace should set GICD_IIDR before setting any other registers (both
> > KVM_DEV_ARM_VGIC_GRP_DIST_REGS and KVM_DEV_ARM_VGIC_GRP_CPU_REGS) to ensure
> > the expected behavior. Unless GICD_IIDR has been set from userspace, writes
> > to the interrupt group registers (GICD_IGROUPR) are ignored.
> > """
> >
> > I'm not inclined to change that.
>
> That'll all very well... but as far as I can tell, QEMU *doesn't* set
> GICD_IIDR, so it still gets the bizarre behaviour where the *guest* can
> write the registers, but userspace can't. So it looks like it'll work
> except migration will fail. Am I missing something?
That's exactly it, and why I said tying up UAPI opt-in with
guest-visible registers is a really bad idea.
> But honestly, I don't care one iota about GICv2; I was only trying to
> do the cleanup while I was there. Feel free to drop that part entirely.
>
> > As a way out of this whole mess, can we
> > instead:
> >
> > - Allow userspace to set IIDR.Revision to 1
> >
> > - Drop any bug emulation from the handling of IGROUPR registers
>
> It doesn't make sense to allow setting IIDR.Revision to 1 *without* the
> one-liner that actually implements the corresponding behaviour change
> in the IGROUPR registers.
As I described earlier, this whole IIDR crap inarguably broke UAPI and
obviously normal guest behavior (i.e. reading the register). At minimum
we need to permit previously-valid values for IIDR, even if they carry
no implied behaviors.
> And as explained at least twice now, it's the
> behaviour change that's *important* here.
>
> The fact that it's a long-standing bug in KVM which downstream has been
> working around for a long time doesn't matter. The unconditional
> behavioural change *is* a bug and we should fix it.
That is the nature of a bug fix. If you can provide some concrete
evidence of a guest depending on the RAZ/WI behavior then I agree we
need to preserve the old behavior.
Otherwise I see this as a matter of principle in how we do bug fixes to
KVM. Even if upstream took the strictest possible stance towards behavior
changes we will invariably fail to account for some minutia.
> > - Special-case the stupid GICv2 UAPI where IGROUPR are only writable if
> > the VMM has written to IIDR and the revision >= 2
>
> That already *is* a special case, right? And you'd rather leave it as it is?
Left as documented, yes. With the exception that revision == 1 writes
not be considered opt-in to restorable IGROUPR.
Thanks,
Oliver
^ permalink raw reply
* [regression] kernel-doc: on struct_group_tagged
From: Randy Dunlap @ 2026-05-19 23:02 UTC (permalink / raw)
To: Linux Documentation, Shuicheng Lin, Jonathan Corbet,
Mauro Carvalho Chehab
Hi Shuicheng Lin.
I have hit a problem on linux-next-2026051[89].
I bisected it and the bad commit is your recent patch
commit 46d9c16115cf (HEAD)
Author: Shuicheng Lin <shuicheng.lin@intel.com>
Date: Thu May 7 02:32:32 2026 +0000
scripts/kernel-doc: Detect mismatched inline member documentation tags
which I tested, but apparently not well enough.
Both "make htmldocs" and running
scripts/kernel-doc -none -Wall include/net/page_pool/types.h
give these warnings: [*]
Warning: include/net/page_pool/types.h:105 Excess struct member 'fast' description in 'page_pool_params'
Warning: include/net/page_pool/types.h:105 Excess struct member 'slow' description in 'page_pool_params'
due to the use of struct_group_tagged in that struct.
There are no warnings from this header file after I revert commit 46d9c16115cf.
Please look into this.
*: Both of these warnings are duplicated in kernel-doc logging, but that's a
different issue.
thanks.
--
~Randy
^ permalink raw reply
* Re: [PATCH] Documentation: KVM: Document guest-visible compatibility expectations
From: David Woodhouse @ 2026-05-19 23:33 UTC (permalink / raw)
To: Oliver Upton
Cc: Paolo Bonzini, Marc Zyngier, Will Deacon, Jonathan Corbet,
Shuah Khan, kvm, Linux Doc Mailing List,
Kernel Mailing List, Linux, Sean Christopherson, Jim Mattson,
Joey Gouly, Suzuki K Poulose, Zenghui Yu, Catalin Marinas,
Raghavendra Rao Ananta, Eric Auger, Kees Cook, Arnd Bergmann,
Nathan Chancellor, linux-arm-kernel, kvmarm, linux-kselftest
In-Reply-To: <agzq5kwzuJvd7Mh5@kernel.org>
[-- Attachment #1: Type: text/plain, Size: 6032 bytes --]
On Tue, 2026-05-19 at 15:57 -0700, Oliver Upton wrote:
> On Tue, May 19, 2026 at 10:58:05PM +0100, David Woodhouse wrote:
> > On Tue, 2026-05-19 at 14:10 -0700, Oliver Upton wrote:
> > > And in the absence of clear evidence of a guest depending on the broken
> > > IGROUPR behavior, I don't see how the guest-side changes of Christoffer's
> > > series are any different from the multitude of bug fixes that we take
> > > every single release cycle. It is an unfortunate bug and I concur with
> > > Marc that it doesn't seem like the sort of thing a guest could rely
> > > upon.
> >
> > I find this concerning, because I've already explained this.
> >
> > There is a very real possibility of guests simply not *noticing* that
> > they had bugs in this area, as it didn't *matter* what they wrote to
> > these registers since it never worked.
> >
> > There is an even larger possibility of guests having worked around the
> > original issue by *detecting* whether the registers were actually
> > writable before choosing to use the alternative groups. And if such a
> > guest launches on a new kernel and then needs to be rolled back to an
> > older kernel, that will also break.
>
> The onus is on you to substantiate this claim. I would imagine after
> carrying the revert for so long that there must be at least one example
> of such a guest?
What? No. We have *avoided* having the bug, specifically so that we do
not find out the consequences of the bug.
> What ifs and maybes do not meet the bar, in my opinion, for preserving
> bug emulation in KVM. Of course there could be a little flexibility with
> that but we need to have some way of discriminating between bug fixes
> and genuine guest expectations around the behavior of virtual hardware.
I believe you have this completely backwards.
The expectation of KVM is that do not change guest visible behaviour if
there's any reasonable chance that it might cause problems.
A stable and mature platform doesn't get to play in its ivory tower and
randomly inflict breakage on guests because they "deserve it".
I've literally explained the potential failure modes, including the one
on rollback if a guest *does* change the group configuration and then
needs to be rolled back to the older kernel that doesn't support it.
And yes, "ifs and maybes" absolutely *are* the quality bar expected by
KVM because — again, as already explained more than once — as we
accumulate a bunch of such "unlikely" breakages in a fleet upgrade
from, say, 6.1 to 6.12, the likelihood of *one* of them actually
turning out to afflict *one* of the zoo of guest operating systems
approaches 1.
We don't get to just YOLO it.
> > > Wrong or not, this behavior is documented unambiguously. From the VGICv2
> > > UAPI documentation:
> > >
> > > """
> > > Userspace should set GICD_IIDR before setting any other registers (both
> > > KVM_DEV_ARM_VGIC_GRP_DIST_REGS and KVM_DEV_ARM_VGIC_GRP_CPU_REGS) to ensure
> > > the expected behavior. Unless GICD_IIDR has been set from userspace, writes
> > > to the interrupt group registers (GICD_IGROUPR) are ignored.
> > > """
> > >
> > > I'm not inclined to change that.
> >
> > That'll all very well... but as far as I can tell, QEMU *doesn't* set
> > GICD_IIDR, so it still gets the bizarre behaviour where the *guest* can
> > write the registers, but userspace can't. So it looks like it'll work
> > except migration will fail. Am I missing something?
>
> That's exactly it, and why I said tying up UAPI opt-in with
> guest-visible registers is a really bad idea.
>
> > But honestly, I don't care one iota about GICv2; I was only trying to
> > do the cleanup while I was there. Feel free to drop that part entirely.
> >
> > > As a way out of this whole mess, can we
> > > instead:
> > >
> > > - Allow userspace to set IIDR.Revision to 1
> > >
> > > - Drop any bug emulation from the handling of IGROUPR registers
> >
> > It doesn't make sense to allow setting IIDR.Revision to 1 *without* the
> > one-liner that actually implements the corresponding behaviour change
> > in the IGROUPR registers.
>
> As I described earlier, this whole IIDR crap inarguably broke UAPI and
> obviously normal guest behavior (i.e. reading the register). At minimum
> we need to permit previously-valid values for IIDR, even if they carry
> no implied behaviors.
But the whole *point* of IIDR is to preserve the behaviour. To set the
IIDR and *not* have the corresponding behaviour is insanity.
> > And as explained at least twice now, it's the
> > behaviour change that's *important* here.
> >
> > The fact that it's a long-standing bug in KVM which downstream has been
> > working around for a long time doesn't matter. The unconditional
> > behavioural change *is* a bug and we should fix it.
>
> That is the nature of a bug fix. If you can provide some concrete
> evidence of a guest depending on the RAZ/WI behavior then I agree we
> need to preserve the old behavior.
>
> Otherwise I see this as a matter of principle in how we do bug fixes to
> KVM. Even if upstream took the strictest possible stance towards behavior
> changes we will invariably fail to account for some minutia.
No. Don't pretend that this is hard. KVM on x86 has been quietly
getting this right for years.
Yes, there is sometimes *some* subjectivity around it, and it's
sometimes reasonable to just unilaterally change behaviours. This is
not, and was not, once of those cases.
> > > - Special-case the stupid GICv2 UAPI where IGROUPR are only writable if
> > > the VMM has written to IIDR and the revision >= 2
> >
> > That already *is* a special case, right? And you'd rather leave it as it is?
>
> Left as documented, yes. With the exception that revision == 1 writes
> not be considered opt-in to restorable IGROUPR.
Don't do that. Just leave it broken, with QEMU not even working. I'm
beyond caring about GICv2 now.
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* Re: [PATCH v4 04/30] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration
From: Dongli Zhang @ 2026-05-19 23:34 UTC (permalink / raw)
To: David Woodhouse, kvm
Cc: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Thomas Gleixner,
Sean Christopherson, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
Dave Hansen, Vitaly Kuznetsov, x86, Marc Zyngier, Juergen Gross,
Boris Ostrovsky, Paul Durrant, Jonathan Cameron, Sascha Bischoff,
Jack Allister, Joey Gouly, joe.jin, linux-doc, linux-kernel,
xen-devel, linux-kselftest
In-Reply-To: <32ca0a8da4bfb1e92013a7f75e0ff7541ebcd6a6.camel@infradead.org>
On 2026-05-19 3:43 PM, David Woodhouse wrote:
> On Tue, 2026-05-19 at 14:23 -0700, Dongli Zhang wrote:
>> I think I now understand why I feel like I am always asking weird questions. I
>> have been thinking about how to account for downtime, so I see
>> KVM_SET_CLOCK_GUEST as a supplement to KVM_SET_CLOCK.
>
> I do not believe in "downtime". There is no such thing.
> There is only "steal time".
Or "leap seconds" as used in the document?
https://lore.kernel.org/all/20240522001817.619072-8-dwmw2@infradead.org
>
> If I recall correctly what we described in
> https://lore.kernel.org/all/20240522001817.619072-8-dwmw2@infradead.org/
> I don't think we actually needed KVM_SET_CLOCK at all, did we?
Here I partially copied the content from the link.
The 2nd step of destination VMM is to invoke KVM_SET_CLOCK ioctl.
---
From the destination VMM process:
-4. Invoke the KVM_SET_CLOCK ioctl, providing the source nanoseconds from
- kvmclock (guest_src) and CLOCK_REALTIME (host_src) in their respective
+4. Before creating the vCPUs, invoke the KVM_SET_TSC_KHZ ioctl on the VM, to
+ set the scaled frequency of the guest's TSC (freq).
+
+5. Invoke the KVM_SET_CLOCK ioctl, providing the source nanoseconds from
+ kvmclock (guest_src) and CLOCK_REALTIME (time_src) in their respective
fields. Ensure that the KVM_CLOCK_REALTIME flag is set in the provided
structure.
- KVM will advance the VM's kvmclock to account for elapsed time since
- recording the clock values. Note that this will cause problems in
+ KVM will restore the VM's kvmclock, accounting for elapsed time since
+ the clock values were recorded. Note that this will cause problems in
the guest (e.g., timeouts) unless CLOCK_REALTIME is synchronized
between the source and destination, and a reasonably short time passes
- between the source pausing the VMs and the destination executing
- steps 4-7.
+ between the source pausing the VMs and the destination resuming them.
+ Due to the KVM_[SG]ET_CLOCK API using CLOCK_REALTIME instead of
+ CLOCK_TAI, leap seconds during the migration may also introduce errors.
--
>>>
>>> So your "MASTERCLOCK_UPDATE is pending and there is no pending
>>> CLOCK_UPDATE" doesn't make much sense to me. If MASTERCLOCK_UPDATE is
>>> pending, then there *will* be a CLOCK_UPDATE pending.
>>
>> Suppose the VM is stopped and the master clock is active.
>
> I don't know what it means for a VM to be 'stopped'. Do you mean that
> all vCPUs happen to be experiencing steal time at the present moment?
Taking QEMU as an example, all vCPU threads remain asleep in host userspace
without having a chance to invoke KVM_RUN. As a result, none of the vCPUs can
enter KVM kernel mode to process any pending requests.
This is the state before QEMU resumes from live migration or live update.
(qemu) stop
(qemu) info status
VM status: paused
According to my understanding, older KVM versions even required the userspace
VMM to keep vCPUs in userspace to avoid racing with KVM_RUN.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kvm/x86.c?h=v5.15#n6090
case KVM_SET_CLOCK: {
... ...
/*
* TODO: userspace has to take care of races with VCPU_RUN, so
* kvm_gen_update_masterclock() can be cut down to locked
* pvclock_update_vm_gtod_copy().
*/
>
>> Suddenly, we change the host clocksource from TSC to HPET. pvclock_gtod_notify()
>> may call pvclock_gtod_update_fn() to set a pending KVM_REQ_MASTERCLOCK_UPDATE
>> for all vCPUs. Unless the pending KVM_REQ_MASTERCLOCK_UPDATE is processed by
>> kvm_update_masterclock(), kvm_end_pvclock_update() will not set a pending
>> KVM_REQ_CLOCK_UPDATE.
>
> You say 'Unless'... do you mean 'Until'?
Until.
>
>> Therefore, this is a scenario in which only KVM_REQ_MASTERCLOCK_UPDATE is pending.
>>
>> I do not think this scenario is important. I am just curious about the expected
>> way to implement similar code in the future :)
>
> I think that's working correctly. Until the master clock has *actually*
> been updated, there's no point in setting CLOCK_UPDATE for each vCPU to
> disseminate the new information to its own pvclock?
Thank you very much for helping confirm this.
>
>> For the live migration scenario, the current QEMU implementation not only fails
>> to account for downtime, but also has a drift issue. That is what I would like
>> to address in QEMU.
>
> Again, restore the gTSC as accurately as possible. Probably by working
> out for *yourself* the relationships of the source and destination host
> TSCs to real time, and then reconstituting on the destination using TSC
> offset just as for live migration.
>
> And then use KVM_SET_CLOCK_GUEST too.
>
> That's what I attempted to document in
> https://lore.kernel.org/all/20240522001817.619072-8-dwmw2@infradead.org/
> and should probably revive.
I would really appreciate it if this document could be revived. I don't see it
in your most recent v4 PATCH 7. It is very helpful as a guideline for how
userspace VMMs should take advantage of these APIs.
Thank you very much!
Dongli Zhang
^ permalink raw reply
* Re: [PATCH v11 4/5] platform/chrome: Protect cros_ec_device lifecycle with revocable
From: Jason Gunthorpe @ 2026-05-19 23:55 UTC (permalink / raw)
To: Tzung-Bi Shih, Danilo Krummrich
Cc: Arnd Bergmann, Greg Kroah-Hartman, Bartosz Golaszewski,
Linus Walleij, Benson Leung, linux-kernel, chrome-platform,
driver-core, linux-doc, linux-gpio, Rafael J. Wysocki,
Jonathan Corbet, Shuah Khan, Laurent Pinchart, Wolfram Sang,
Johan Hovold, Paul E . McKenney
In-Reply-To: <agiCQQO9KGoMS1Jj@tzungbi-laptop>
On Sat, May 16, 2026 at 10:42:09PM +0800, Tzung-Bi Shih wrote:
> On Thu, May 14, 2026 at 01:00:43PM -0300, Jason Gunthorpe wrote:
> > On Thu, May 14, 2026 at 03:34:12AM +0000, Tzung-Bi Shih wrote:
> >
> > > To help me understand, could you elaborate on why the revocable mechanism
> > > isn't suitable here?
> >
> > Stay within one driver. Create the revokable is probe, consume it
> > within that drivers fops/etc, destroy it on remove. Do not randomly
> > pass it to other drivers.
>
> In that sense, after applying [1], does the patch make sense to you?
It is better, but you can see revokable is creating contortions that
don't make sense:
> @@ -223,11 +223,9 @@ static int cros_ec_chardev_open(struct inode *inode, struct file *filp)
> ret = blocking_notifier_chain_register(&pdata->subscribers,
> &priv->notifier);
> if (ret) {
> - scoped_guard(rwsem_read, &pdata->ec_dev_sem) {
> - if (pdata->ec_dev)
> - dev_err(pdata->ec_dev->dev,
> - "failed to register event notifier\n");
> - }
> + revocable_try_access_or_skip_scoped(&pdata->ec_rev, ec_dev)
> + dev_err(ec_dev->dev,
> + "failed to register event notifier\n");
It is impossible for ec_dev to be null here, the misc_unregister does
fence open.
> @@ -482,11 +483,12 @@ static int cros_ec_chardev_probe(struct platform_device *pdev)
> static void cros_ec_chardev_remove(struct platform_device *pdev)
> {
> struct chardev_pdata *pdata = platform_get_drvdata(pdev);
> + struct cros_ec_device *ec_dev;
>
> - blocking_notifier_chain_unregister(&pdata->ec_dev->event_notifier,
> - &pdata->relay);
> - scoped_guard(rwsem_write, &pdata->ec_dev_sem)
> - pdata->ec_dev = NULL;
> + revocable_try_access_or_skip_scoped(&pdata->ec_rev, ec_dev)
> + blocking_notifier_chain_unregister(&ec_dev->event_notifier,
> + &pdata->relay);
> + revocable_revoke(&pdata->ec_rev);
And this is complete garbage nonsense, we are in a driver bound
context about to revoke the revokable, it is not optional, it can't
fail, if it doesn't we can't skip the unregister or it will eventually
crash.
I said it before, but to re-iterate - what this scheme fails to
capture from rust is the most important detail - the driver bound
checking that confirms the content is valid without any need for
locking or possibility of failure.
Open, remove, are both bound contexts that can never fail to obtain
their protected content and don't need srcu locking.
I don't konw what Danilo thinks, but as the rust side has evolved I
think it was a mistake to combine the revocable and SRCU
together. Having two primitives would make more sense
The first is "this value is only valid under driver bound, present
your thing proving driver bound and you can get the value". This would
be fully 0 cost.
The second is "My callchain doesn't have a way to get driver bound,
so this widget will try to open a SRCU critical section that produces
it".
Each driver could have many of the first but needs only one of the
second. The second is the "code smell" that says something is not
great by not properly managing driver bound. Since a driver needs only
one of the widgets it would solve the repeated srcu problem on unbind.
From that lens you can see how troubled this C version is, it promotes
the "code smell" SRCU API into the only API and makes it first class
promoting its use and ignores/obfuscates/worsens the actual API we
want people to use: prove you have a driver bound. :(
Jason
^ permalink raw reply
* Re: [PATCH net-next v2 2/2] net: ti: icssg: Add HSR and LRE PA statistics
From: Jakub Kicinski @ 2026-05-19 23:56 UTC (permalink / raw)
To: Luka Gejak
Cc: MD Danish Anwar, Felix Maurer, David S. Miller, Eric Dumazet,
Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan,
Roger Quadros, Andrew Lunn, Meghana Malladi, Jacob Keller,
David Carlier, Vadim Fedorenko, Kevin Hao, netdev, linux-doc,
linux-kernel, linux-arm-kernel, Vladimir Oltean
In-Reply-To: <E30AAC96-01D2-4A23-B562-126087DEB7FA@linux.dev>
On Tue, 19 May 2026 07:55:55 +0200 Luka Gejak wrote:
> On May 19, 2026 3:45:06 AM GMT+02:00, Jakub Kicinski <kuba@kernel.org> wrote:
> >On Thu, 14 May 2026 13:26:05 +0530 MD Danish Anwar wrote:
> >> Add new firmware PA statistics counters for HSR and LRE to the ethtool
> >> statistics exposed by the ICSSG driver.
> >>
> >> New statistics added:
> >> - FW_HSR_FWD_CHECK_FAIL_DROP: Packets dropped on the HSR forwarding path
> >> - FW_HSR_HE_CHECK_FAIL_DROP: Packets dropped on the HSR host egress path
> >> - FW_HSR_SKIP_HOST_DUP_DISCARD_FRAMES: Frames with duplicate discard
> >> skipped
> >> - FW_LRE_CNT_UNIQUE/DUPLICATE/MULTIPLE_RX: LRE duplicate detection
> >> counters
> >> - FW_LRE_CNT_RX/TX: LRE per-port frame counters
> >> - FW_LRE_CNT_OWN_RX: Own HSR tagged frames received
> >> - FW_LRE_CNT_ERRWRONGLAN: Frames with wrong LAN identifier (PRP)
> >>
> >> Document the new HSR/LRE statistics in icssg_prueth.rst.
> >
> >To an untrained eye these stats look like stuff that could
> >be standardized across drivers.
> >
> >Luka, Felix, others on CC, do you think we should expose these
> >from HSR over netlink as "standard" offload stats different drivers
> >can plug into or not worth it?
>
> I think there is a case for standardizing part of this, but I would
> not standardize the whole set as-is.
>
> The LRE counters look generic enough to me, especially:
> - unique rx
> - duplicate rx
> - multiple rx
> - rx / tx
> - own rx
> - wrong LAN, PRP only
>
> Those are protocol/LRE concepts rather than TI firmware details, so
> exposing them from the HSR/PRP layer sounds useful. I would expect
> both the software implementation and offloaded implementations to be
> able to provide at least some of them, with unsupported counters
> omitted or reported as not available.
> I would not put the firmware check/drop counters in the same standard
> bucket, though:
> - FW_HSR_FWD_CHECK_FAIL_DROP
> - FW_HSR_HE_CHECK_FAIL_DROP
> - FW_HSR_SKIP_HOST_DUP_DISCARD_FRAMES
Thanks for the breakdown!
> Those sound more like implementation/debug counters for the ICSSG
> firmware pipeline. They are still useful in ethtool driver stats, but
> I would be hesitant to bake their exact semantics into HSR UAPI.
> So my preference would be:
> 1. Keep driver-private ethtool stats for the full firmware counter set.
> 2. Add a small HSR/PRP standard stats set separately, limited to
> well-defined LRE counters.
> 3. Make the HSR layer expose them, with offload drivers plugging in via
> an optional callback or offload stats op.
> 4. Define the counters carefully, including whether they are per-HSR
> device or per-port A/B, and what PRP-only counters mean for HSR.
>
> I do not think this patch should blindly become the UAPI definition,
Not at all, the unique / multiple stats gave me pause. We should
only put in the standard API what can be easily and unambiguously
defined given the protocol spec.
> but I do think it points at a useful follow-up. If we want to avoid
> adding driver-private names first and then standardizing different
> names later, then it may be worth asking Danish to split the
> protocol-level LRE counters out and route those through a common HSR
> stats interface.
As a general policy we ask for standard stats to be added first and
ethtool to only contain what didn't fit in the standard ones.
There are some technical reasons but it's mostly a mindset thing.
^ permalink raw reply
* Re: [PATCH v2 5/6] gpio: remove machine hogs
From: Dmitry Torokhov @ 2026-05-20 0:46 UTC (permalink / raw)
To: Bartosz Golaszewski
Cc: Linus Walleij, Bartosz Golaszewski, Geert Uytterhoeven,
Frank Rowand, Mika Westerberg, Andy Shevchenko, Aaro Koskinen,
Janusz Krzysztofik, Tony Lindgren, Russell King, Jonathan Corbet,
Shuah Khan, linux-gpio, linux-kernel, linux-acpi,
linux-arm-kernel, linux-omap, linux-doc
In-Reply-To: <20260309-gpio-hog-fwnode-v2-5-4e61f3dbf06a@oss.qualcomm.com>
On Mon, Mar 09, 2026 at 01:42:41PM +0100, Bartosz Golaszewski wrote:
> With no more users, remove legacy machine hog API from the kernel.
>
> Reviewed-by: Linus Walleij <linusw@kernel.org>
> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
> Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
Argh! What is the replacement for this? I have patches for rsk7203 to
use them to get rid of legacy gpio use, like this:
diff --git a/arch/sh/boards/mach-rsk/devices-rsk7203.c b/arch/sh/boards/mach-rsk/devices-rsk7203.c
index f8760a91e2f1..5bbd3b31cffb 100644
--- a/arch/sh/boards/mach-rsk/devices-rsk7203.c
+++ b/arch/sh/boards/mach-rsk/devices-rsk7203.c
@@ -12,7 +12,7 @@
#include <linux/smsc911x.h>
#include <linux/input.h>
#include <linux/io.h>
-#include <linux/gpio.h>
+#include <linux/gpio/consumer.h>
#include <linux/gpio/machine.h>
#include <linux/gpio/property.h>
#include <asm/machvec.h>
@@ -165,6 +165,19 @@ static const struct platform_device_info rsk7203_devices[] __initconst = {
},
};
+/* The base of the function GPIOs in the flat enum */
+#define SH7203_FN_BASE GPIO_FN_PINT7_PB
+
+static struct gpiod_hog rsk7203_gpio_hogs[] = {
+ GPIO_HOG("sh7203_pfc-fn", GPIO_FN_TXD0 - SH7203_FN_BASE,
+ "TXD0", GPIO_ACTIVE_HIGH, GPIOD_ASIS),
+ GPIO_HOG("sh7203_pfc-fn", GPIO_FN_RXD0 - SH7203_FN_BASE,
+ "RXD0", GPIO_ACTIVE_HIGH, GPIOD_ASIS),
+ GPIO_HOG("sh7203_pfc-fn", GPIO_FN_IRQ0_PB - SH7203_FN_BASE,
+ "IRQ0_PB", GPIO_ACTIVE_HIGH, GPIOD_ASIS),
+ { }
+};
+
static int __init rsk7203_devices_setup(void)
{
struct platform_device *pd;
@@ -172,12 +185,10 @@ static int __init rsk7203_devices_setup(void)
int i;
/* Select pins for SCIF0 */
- gpio_request(GPIO_FN_TXD0, NULL);
- gpio_request(GPIO_FN_RXD0, NULL);
+ gpiod_add_hogs(rsk7203_gpio_hogs);
/* Setup LAN9118: CS1 in 16-bit Big Endian Mode, IRQ0 at Port B */
__raw_writel(0x36db0400, 0xfffc0008); /* CS1BCR */
- gpio_request(GPIO_FN_IRQ0_PB, NULL);
error = software_node_register_node_group(rsk7203_swnodes);
if (error) {
If there is no replacement maybe we can resurrect this? Or shoudl we
have add swnode support for hogs?
Thanks.
--
Dmitry
^ permalink raw reply related
* [PATCH RESEND 0/3] mm/damon: reposting three reviewed patches
From: SeongJae Park @ 2026-05-20 1:20 UTC (permalink / raw)
To: Andrew Morton
Cc: SeongJae Park, Liam R. Howlett, David Hildenbrand,
Jonathan Corbet, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
Shuah Khan, Suren Baghdasaryan, Vlastimil Babka, damon, linux-doc,
linux-kernel, linux-mm
This series reposts patches that were reviewed by the DAMON maintainer
but not yet added to mm-new, for more visibility. From their last
posted versions, only minor changes including commit subject update,
Reviewed-by: and Acked-by: tags collection, and rebasing to latest
mm-new were made by the DAMON maintainer.
Sakurai Shun (1):
Docs/mm/damon/design: fix three typos
Zenghui Yu (1):
Docs/{ABI,admin-guide}/damon: fix various typoes
niecheng (1):
mm/damon/core: clarify next_intervals_tune_sis update path
.../ABI/testing/sysfs-kernel-mm-damon | 2 +-
Documentation/admin-guide/mm/damon/usage.rst | 18 +++++++++---------
Documentation/mm/damon/design.rst | 6 +++---
mm/damon/core.c | 3 +++
4 files changed, 16 insertions(+), 13 deletions(-)
base-commit: f4f9ecd6383da01075a0be8f9c08b82152061fd0
--
2.47.3
^ permalink raw reply
* [PATCH RESEND 2/3] Docs/mm/damon/design: fix three typos
From: SeongJae Park @ 2026-05-20 1:21 UTC (permalink / raw)
To: Andrew Morton
Cc: Sakurai Shun, Liam R. Howlett, David Hildenbrand, Jonathan Corbet,
Lorenzo Stoakes, Michal Hocko, Mike Rapoport, SeongJae Park,
Shuah Khan, Suren Baghdasaryan, Vlastimil Babka, damon, linux-doc,
linux-kernel, linux-mm
In-Reply-To: <20260520012104.93602-1-sj@kernel.org>
From: Sakurai Shun <ssh1326@icloud.com>
L140: "unsinged" -> "unsigned"
L371: "sampleing" -> "sampling"
L387: "multipled" -> "multiplied"
Signed-off-by: Sakurai Shun <ssh1326@icloud.com>
Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: SeongJae Park <sj@kernel.org>
---
Changes from v2
- v2: https://lore.kernel.org/20260517073433.3015-1-ssh1326@icloud.com
- Collect Reviewed-by: and Acked-by: tags.
- Update commit subject prefix.
- Rebase to latest mm-new.
Changes from v1
- v1: https://lore.kernel.org/20260516093552.8404-1-cheesecake2960@icloud.com
- Use real author name.
Documentation/mm/damon/design.rst | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/Documentation/mm/damon/design.rst b/Documentation/mm/damon/design.rst
index a24f9f00d1837..2da7ca0d3d17a 100644
--- a/Documentation/mm/damon/design.rst
+++ b/Documentation/mm/damon/design.rst
@@ -147,7 +147,7 @@ as Idle page tracking does.
Address Unit
------------
-DAMON core layer uses ``unsinged long`` type for monitoring target address
+DAMON core layer uses ``unsigned long`` type for monitoring target address
ranges. In some cases, the address space for a given operations set could be
too large to be handled with the type. ARM (32-bit) with large physical
address extension is an example. For such cases, a per-operations set
@@ -417,7 +417,7 @@ with theoretical maximum ``nr_accesses``, which can be calculated as
``aggregation interval / sampling interval``.
The mechanism calculates the ratio of access events for ``aggrs`` aggregations,
-and increases or decrease the ``sampleing interval`` and ``aggregation
+and increases or decrease the ``sampling interval`` and ``aggregation
interval`` in same ratio, if the observed access ratio is lower or higher than
the target, respectively. The ratio of the intervals change is decided in
proportion to the distance between current samples ratio and the target ratio.
@@ -433,7 +433,7 @@ The tuning is turned off by default, and need to be set explicitly by the user.
As a rule of thumbs and the Parreto principle, 4% access samples ratio target
is recommended. Note that Parreto principle (80/20 rule) has applied twice.
That is, assumes 4% (20% of 20%) DAMON-observed access events ratio (source)
-to capture 64% (80% multipled by 80%) real access events (outcomes).
+to capture 64% (80% multiplied by 80%) real access events (outcomes).
To know how user-space can use this feature via :ref:`DAMON sysfs interface
<sysfs_interface>`, refer to :ref:`intervals_goal
--
2.47.3
^ permalink raw reply related
* [PATCH RESEND 3/3] Docs/{ABI,admin-guide}/damon: fix various typoes
From: SeongJae Park @ 2026-05-20 1:21 UTC (permalink / raw)
To: Andrew Morton
Cc: Zenghui Yu, Liam R. Howlett, David Hildenbrand, Jonathan Corbet,
Lorenzo Stoakes, Michal Hocko, Mike Rapoport, SeongJae Park,
Shuah Khan, Suren Baghdasaryan, Vlastimil Babka, damon, linux-doc,
linux-kernel, linux-mm
In-Reply-To: <20260520012104.93602-1-sj@kernel.org>
From: Zenghui Yu <zenghui.yu@linux.dev>
``damon_target_idx`` was wrongly written as ``target_idx`` in the docs. Fix
it all over the place, as well as the wrong directory count, grammar, etc.
Signed-off-by: Zenghui Yu <zenghui.yu@linux.dev>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: SeongJae Park <sj@kernel.org>
---
Changes from v1
- v1: https://lore.kernel.org/20260517182624.7167-1-zenghui.yu@linux.dev
- Collect Reviewed-by: tag.
- Rebase to latest mm-new.
.../ABI/testing/sysfs-kernel-mm-damon | 2 +-
Documentation/admin-guide/mm/damon/usage.rst | 18 +++++++++---------
2 files changed, 10 insertions(+), 10 deletions(-)
diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-damon b/Documentation/ABI/testing/sysfs-kernel-mm-damon
index ee29d4e204ffa..b73e6bc28ea5f 100644
--- a/Documentation/ABI/testing/sysfs-kernel-mm-damon
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-damon
@@ -452,7 +452,7 @@ Description: If 'hugepage_size' is written to the 'type' file, writing to
or reading from this file sets or gets the maximum size of the
hugepage for the filter.
-What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/core_filters/<F>/target_idx
+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/core_filters/<F>/damon_target_idx
Date: Feb 2025
Contact: SeongJae Park <sj@kernel.org>
Description: If 'target' is written to the 'type' file, writing to or
diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst
index 0d6a27dc97b0a..d46875e603d86 100644
--- a/Documentation/admin-guide/mm/damon/usage.rst
+++ b/Documentation/admin-guide/mm/damon/usage.rst
@@ -97,7 +97,7 @@ comma (",").
│ │ │ │ │ │ │ │ │ 0/target_metric,target_value,current_value,nid,path
│ │ │ │ │ │ │ :ref:`watermarks <sysfs_watermarks>`/metric,interval_us,high,mid,low
│ │ │ │ │ │ │ :ref:`{core_,ops_,}filters <sysfs_filters>`/nr_filters
- │ │ │ │ │ │ │ │ 0/type,matching,allow,memcg_path,addr_start,addr_end,target_idx,min,max
+ │ │ │ │ │ │ │ │ 0/type,matching,allow,memcg_path,addr_start,addr_end,damon_target_idx,min,max
│ │ │ │ │ │ │ :ref:`dests <damon_sysfs_dests>`/nr_dests
│ │ │ │ │ │ │ │ 0/id,weight
│ │ │ │ │ │ │ :ref:`stats <sysfs_schemes_stats>`/nr_tried,sz_tried,nr_applied,sz_applied,sz_ops_filter_passed,qt_exceeds,nr_snapshots,max_nr_snapshots
@@ -374,7 +374,7 @@ to ``N-1``. Each directory represents each DAMON-based operation scheme.
schemes/<N>/
------------
-In each scheme directory, eight directories (``access_pattern``, ``quotas``,
+In each scheme directory, nine directories (``access_pattern``, ``quotas``,
``watermarks``, ``core_filters``, ``ops_filters``, ``filters``, ``dests``,
``stats``, and ``tried_regions``) and three files (``action``, ``target_nid``
and ``apply_interval``) exist.
@@ -492,7 +492,7 @@ given DAMON-based operation scheme.
Under the watermarks directory, five files (``metric``, ``interval_us``,
``high``, ``mid``, and ``low``) for setting the metric, the time interval
between check of the metric, and the three watermarks exist. You can set and
-get the five values by writing to the files, respectively.
+get the five values by writing to and reading from the files, respectively.
Keywords and meanings of those that can be written to the ``metric`` file are
as below.
@@ -500,7 +500,7 @@ as below.
- none: Ignore the watermarks
- free_mem_rate: System's free memory rate (per thousand)
-The ``interval`` should written in microseconds unit.
+The ``interval_us`` should be written in microseconds unit.
.. _sysfs_filters:
@@ -528,9 +528,9 @@ in the numeric order.
Each filter directory contains nine files, namely ``type``, ``matching``,
``allow``, ``memcg_path``, ``addr_start``, ``addr_end``, ``min``, ``max``
-and ``target_idx``. To ``type`` file, you can write the type of the filter.
-Refer to :ref:`the design doc <damon_design_damos_filters>` for available type
-names, their meaning and on what layer those are handled.
+and ``damon_target_idx``. To ``type`` file, you can write the type of the
+filter. Refer to :ref:`the design doc <damon_design_damos_filters>` for
+available type names, their meaning and on what layer those are handled.
For ``memcg`` type, you can specify the memory cgroup of the interest by
writing the path of the memory cgroup from the cgroups mount point to
@@ -540,7 +540,7 @@ files, respectively. For ``hugepage_size`` type, you can specify the minimum
and maximum size of the range (closed interval) to ``min`` and ``max`` files,
respectively. For ``target`` type, you can specify the index of the target
between the list of the DAMON context's monitoring targets list to
-``target_idx`` file.
+``damon_target_idx`` file.
You can write ``Y`` or ``N`` to ``matching`` file to specify whether the filter
is for memory that matches the ``type``. You can write ``Y`` or ``N`` to
@@ -731,7 +731,7 @@ show results using tracepoint supporting tools like ``perf``. For example::
Each line of the perf script output represents each monitoring region. The
first five fields are as usual other tracepoint outputs. The sixth field
-(``target_id=X``) shows the ide of the monitoring target of the region. The
+(``target_id=X``) shows the id of the monitoring target of the region. The
seventh field (``nr_regions=X``) shows the total number of monitoring regions
for the target. The eighth field (``X-Y:``) shows the start (``X``) and end
(``Y``) addresses of the region in bytes. The ninth field (``X``) shows the
--
2.47.3
^ permalink raw reply related
* [PATCH v7 0/4] Add MSI Claw HID Configuration Driver
From: Derek J. Clark @ 2026-05-20 1:31 UTC (permalink / raw)
To: Jiri Kosina, Benjamin Tissoires
Cc: Pierre-Loup A . Griffais, Denis Benato, Zhouwang Huang,
Derek J . Clark, linux-input, linux-doc, linux-kernel
This series adds an HID Configuration driver for the MSI Claw line of
Handheld Gaming PC's. The MSI Claw HID interface provides multiple
features, such as the ability to switch between xinput, dinput, and a
desktop mode, RGB control, rumble intensity, and mapping of the rear "M"
keys. There are additional gamepad modes that are not included in this
driver as they appear to be used in assembly line testing or are
incomplete in the firmware. During my testing I found them to be unstable.
The initial version of this driver was written by Denis Benato, which
contained the initial reverse-engineering and implementation for the
gamepad mode switching. This work was later expanded by Zhouwang Huang
to include more gamepad modes and additional features. Finally, I
refactored the entire driver, fixed multiple bugs, and refined the overall
format to conform to kernel driver best practices and style guide.
Claude was used initially by Zhouwang Huang to quickly parse HID captures
during the reverse-engineering of some of the features. Since Claude had
already been used, as a test of its capabilities I had it implement the
rumble intensity attribute after I had already rewritten most of the
driver, which I then manually edited to fix some mistakes. I also used
Claude to review the driver and these patches for any mistakes and bugs.
Assisted-by: Claude:claude-sonnet-4-6
Co-developed-by: Denis Benato <denis.benato@linux.dev>
Signed-off-by: Denis Benato <denis.benato@linux.dev>
Co-developed-by: Zhouwang Huang <honjow311@gmail.com>
Signed-off-by: Zhouwang Huang <honjow311@gmail.com>
Signed-off-by: Derek J. Clark <derekjohn.clark@gmail.com>
---
v7:
- Use smp_[store_release|load_acquire] pattern for checking
gamepad_registered and rgb_registered to avoid possible races during
teardown.
- Reorder reinit_completion in claw_hw_output_report to avoid race
with possible incoming ACKs.
- Reorder cancel_delayed_work_sync to ensure setup can't be re-armed
after cancel.
- Reset command state machine if hw_output_report has an error.
- Add comments to (hopefully) silence sashinko-bot warnings about the
use of endpoint matching and the impossible scenario of switching to
the alternate endpoint from userspace while the driver is bound.
- Don't use spinlock_irqsave when already in irq context.
- Add profile_lock for read/write profile_pending.
- Use struct for mkey reports and rumble reports, following the
pattern established by rgb reports previously.
- Add gating to cfg_setup_fn, allowing either gamepad settings or rgb
settings to populate if the other fails for any reason.
- Match on write address for rumble and mkey reports to prevent late
ACK from causing synchronization errors.
v6: https://lore.kernel.org/linux-input/20260518222935.1802071-1-derekjohn.clark@gmail.com/
- Add send/ack pattern to ensure synchronous acks.
- Use spinlock_irqsave instead of mutex for read/write MODE event
data.
- add select NEW_LEDS to kconfig.
- Make all timeouts 25ms to ensure at least 2 jiffies in a 100Hz
config.
- Gate all attribute show/store functions with gamepad_registered or
rgb_registered, enabling use of devm_device_add_group and ending
the need to hold a mutex during remove.
- Don't set gamepad_mode on resume, MCU preserves state.
- Ensure all count variables are checked for > 0 characters before
setting buf - 1 to \n.
- Re-arm cfg_setup in resume if it was canceled in an early suspend.
- Remove duplicated argv_free macro.
- Add spinlock_irqsave vice mutex for read/write access on attribute
variables.
v5: https://lore.kernel.org/linux-input/20260517013925.3120314-1-derekjohn.clark@gmail.com/
- Swap disabled & combination mkeys_function enum values.
- Fix bug introduced in v5 where claw_buttons_store would return
-EINVAL on all valid key entries.
- Ensure mode_mutex is properly init.
- Ensure claw_remove is calling hid_hw_close and not hid_hw_stop for
all paths.
- Ensure adding "DISABLED" key to valid entries is done in the correct
patch.
- Re-enable sending an empty string to clear button mappings in
addition to setting DISABLED.
- Move adding the RGB device into cfg_setup to prevent led core
attributes from being written to prior to setup completing.
- Ensure frame_lock is properly init.
- Change variable names in RGB functions from frame and zone to f and
z respectively to fit all scoped_guard actions in 100 columns.
v4: https://lore.kernel.org/linux-input/20260516042841.500299-1-derekjohn.clark@gmail.com/
- Add msi_suspend/claw_suspend.
- Reorder claw_remove to cancel all work before removing sysfs.
- Add mutex lock for removing sysfs attributes.
- Add mutex lock for MODE command data read/write.
- Change dev_warn to dev_dbg in claw_profile_event.
- use __free with DEFINE_FREE macro for argv instead of manually
running argv_free, cleaining up scoped_guard goto.
- Fix frame_calc validity check to use >=.
- Use spinlock instead of mutex in raw_event and related attribute
_store function.
- Ensure delayed work is canceled in suspend & canceled before sysfs
attribute removal.
v3: https://lore.kernel.org/linux-input/20260515033622.2095277-1-derekjohn.clark@gmail.com/
- Add mutex for read/write if rgb frame data.
- Ensure claw_hw_output_report is properly guarded.
- Remove setting rgb_frame_count when reading rgb profiles as it always
returns garbage data.
- Ensure rgb_speed is getting drvdata from a valid lookup (not hdev).
- Use scoped_guard where necessary.
- Reoder claw_probe to ensure all mutex, completion, and variable
assignments are in place prior to setting drvdata.
- Ensure gamepad_mode is set to a valid enum value in claw_probe.
v2: https://lore.kernel.org/linux-input/20260513231445.3213501-1-derekjohn.clark@gmail.com/
- Use mutexes to guard SYNC_TO_ROM calls and pending_profile calls.
- Rename driver to hid-msi and add generic entrypoints for
probe/resume/remove that call claw specific functions in order to
future proof the driver for other MSI HID interfaces.
- Fix various bugs and formatting issues.
v1: https://lore.kernel.org/linux-input/20260510043510.442807-1-derekjohn.clark@gmail.com/
Derek J. Clark (4):
HID: hid-msi: Add MSI Claw configuration driver
HID: hid-msi: Add M-key mapping attributes
HID: hid-msi: Add RGB control interface
HID: hid-msi: Add Rumble Intensity Attributes
MAINTAINERS | 6 +
drivers/hid/Kconfig | 13 +
drivers/hid/Makefile | 1 +
drivers/hid/hid-ids.h | 5 +
drivers/hid/hid-msi.c | 1911 +++++++++++++++++++++++++++++++++++++++++
5 files changed, 1936 insertions(+)
create mode 100644 drivers/hid/hid-msi.c
--
2.53.0
^ permalink raw reply
* [PATCH v7 1/4] HID: hid-msi: Add MSI Claw configuration driver
From: Derek J. Clark @ 2026-05-20 1:31 UTC (permalink / raw)
To: Jiri Kosina, Benjamin Tissoires
Cc: Pierre-Loup A . Griffais, Denis Benato, Zhouwang Huang,
Derek J . Clark, linux-input, linux-doc, linux-kernel
In-Reply-To: <20260520013158.3633277-1-derekjohn.clark@gmail.com>
Adds configuration HID driver for the MSI Claw series of handheld PC's.
In this initial patch add the initial driver outline and attributes for
changing the gamepad mode, M-key behavior, and add a WO reset function.
Sending the SWITCH_MODE and RESET commands causes a USB disconnect in
the device. The completion will therefore never get hit and would trigger
an -EIO. To avoid showing the user an error for every write to these
attrs a bypass for the completion handling is introduced when timeout ==
0.
The initial version of this patch was written by Denis Benato, which
contained the initial reverse-engineering and implementation for the
gamepad mode switching. This work was later expanded by Zhouwang Huang
to include more gamepad modes. Finally, I refactored the drivers data
in/out flow and overall format to conform to kernel driver best
practices and style guides. Claude was used as an initial reviewer of
this patch.
Assisted-by: Claude:claude-sonnet-4-6
Co-developed-by: Denis Benato <denis.benato@linux.dev>
Signed-off-by: Denis Benato <denis.benato@linux.dev>
Co-developed-by: Zhouwang Huang <honjow311@gmail.com>
Signed-off-by: Zhouwang Huang <honjow311@gmail.com>
Signed-off-by: Derek J. Clark <derekjohn.clark@gmail.com>
---
v7:
- Use smp_[store_release|load_acquire] pattern for checking
gamepad_registered to avoid possible races during teardown.
- Reorder reinit_completion in claw_hw_output_report to avoid race
with possible incoming ACKs.
- Reorder cancel_delayed_work_sync to ensure setup can't be re-armed
after cancel.
- Reset command state machine if hw_output_report has an error.
- Add comments to (hopefully) silence sashinko-bot warnings about the
use of endpoint matching and the impossible scenario of switching to
the alternate endpoint from userspace while the driver is bound.
- Don't use spinlock_irqsave when already in irq context.
v6:
- Add send/ack pattern to ensure synchronous acks.
- Use spinlock_irqsave instead of mutex for read/write MODE event
data.
- add select NEW_LEDS to kconfig.
- Make all timeouts 25ms to ensure at least 2 jiffies in a 100Hz
config.
- Gate all attribute show/store functions with gamepad_registered,
enabling use of devm_device_add_group.
- Re-arm cfg_setup in resume if it was canceled in an early suspend.
- Don't set gamepad_mode on resume, MCU preserves state.
- Ensure all count variables are checked for > 0 characters before
setting buf - 1 to \n.
v5:
- Swap disabled & combination mkeys_function enum values.
- Ensure mode_mutex is properly init.
- Ensure claw_remove is calling hid_hw_close and not hid_hw_stop for
all paths.
v4:
- Add msi_suspend/claw_suspend.
- Reorder claw_remove to cancel all work before removing sysfs.
- Add mutex lock for removing sysfs attributes.
- Add mutex lock for MODE command data read/write.
v3:
- Ensure claw_hw_output_report is properly guarded.
- Reoder claw_probe to ensure all mutex, completion, and variable
assignments are in place prior to setting drvdata.
- Ensure gamepad_mode is set to a valid enum value in claw_probe.
v2:
- Rename driver to hid-msi from hid-msi-claw.
- Rename reusable/generic functions to msi_* from claw_*, retaining
claw specific functions.
- Add generic entrypoints for probe, remove, and raw event that route
to claw specific functions.
---
MAINTAINERS | 6 +
drivers/hid/Kconfig | 13 +
drivers/hid/Makefile | 1 +
drivers/hid/hid-ids.h | 5 +
drivers/hid/hid-msi.c | 692 ++++++++++++++++++++++++++++++++++++++++++
5 files changed, 717 insertions(+)
create mode 100644 drivers/hid/hid-msi.c
diff --git a/MAINTAINERS b/MAINTAINERS
index 6f6517bf4f970..8e2de98b768f7 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -17965,6 +17965,12 @@ S: Odd Fixes
F: Documentation/devicetree/bindings/net/ieee802154/mrf24j40.txt
F: drivers/net/ieee802154/mrf24j40.c
+MSI HID DRIVER
+M: Derek J. Clark <derekjohn.clark@gmail.com>
+L: linux-input@vger.kernel.org
+S: Maintained
+F: drivers/hid/hid-msi.c
+
MSI EC DRIVER
M: Nikita Kravets <teackot@gmail.com>
L: platform-driver-x86@vger.kernel.org
diff --git a/drivers/hid/Kconfig b/drivers/hid/Kconfig
index 10c12d8e65579..7766676051a52 100644
--- a/drivers/hid/Kconfig
+++ b/drivers/hid/Kconfig
@@ -492,6 +492,19 @@ config HID_GT683R
Currently the following devices are know to be supported:
- MSI GT683R
+config HID_MSI
+ tristate "MSI Claw Gamepad Support"
+ depends on USB_HID
+ select NEW_LEDS
+ select LEDS_CLASS
+ select LEDS_CLASS_MULTICOLOR
+ help
+ Support for the MSI Claw RGB and controller configuration
+
+ Say Y here to include configuration interface support for the MSI Claw Line
+ of Handheld Console Controllers. Say M here to compile this driver as a
+ module. The module will be called hid-msi.
+
config HID_KEYTOUCH
tristate "Keytouch HID devices"
help
diff --git a/drivers/hid/Makefile b/drivers/hid/Makefile
index 07dfdb6a49c59..80925a17b059c 100644
--- a/drivers/hid/Makefile
+++ b/drivers/hid/Makefile
@@ -92,6 +92,7 @@ obj-$(CONFIG_HID_MAYFLASH) += hid-mf.o
obj-$(CONFIG_HID_MEGAWORLD_FF) += hid-megaworld.o
obj-$(CONFIG_HID_MICROSOFT) += hid-microsoft.o
obj-$(CONFIG_HID_MONTEREY) += hid-monterey.o
+obj-$(CONFIG_HID_MSI) += hid-msi.o
obj-$(CONFIG_HID_MULTITOUCH) += hid-multitouch.o
obj-$(CONFIG_HID_NINTENDO) += hid-nintendo.o
obj-$(CONFIG_HID_NTI) += hid-nti.o
diff --git a/drivers/hid/hid-ids.h b/drivers/hid/hid-ids.h
index 933b7943bdb50..94a9b89dc240a 100644
--- a/drivers/hid/hid-ids.h
+++ b/drivers/hid/hid-ids.h
@@ -1047,7 +1047,12 @@
#define USB_DEVICE_ID_MOZA_R16_R21_2 0x0010
#define USB_VENDOR_ID_MSI 0x1770
+#define USB_VENDOR_ID_MSI_2 0x0db0
#define USB_DEVICE_ID_MSI_GT683R_LED_PANEL 0xff00
+#define USB_DEVICE_ID_MSI_CLAW_XINPUT 0x1901
+#define USB_DEVICE_ID_MSI_CLAW_DINPUT 0x1902
+#define USB_DEVICE_ID_MSI_CLAW_DESKTOP 0x1903
+#define USB_DEVICE_ID_MSI_CLAW_BIOS 0x1904
#define USB_VENDOR_ID_NATIONAL_SEMICONDUCTOR 0x0400
#define USB_DEVICE_ID_N_S_HARMONY 0xc359
diff --git a/drivers/hid/hid-msi.c b/drivers/hid/hid-msi.c
new file mode 100644
index 0000000000000..d95483907a5e5
--- /dev/null
+++ b/drivers/hid/hid-msi.c
@@ -0,0 +1,692 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * HID driver for MSI Claw Handheld PC gamepads.
+ *
+ * Provides configuration support for the MSI Claw series of handheld PC
+ * gamepads. Multiple iterations of the device firmware has led to some
+ * quirks for how certain attributes are handled. The original firmware
+ * did not support remapping of the M1 (right) and M2 (left) rear paddles.
+ * Additionally, the MCU RAM address for writing configuration data has
+ * changed twice. Checks are done during probe to enumerate these variances.
+ *
+ * Copyright (c) 2026 Zhouwang Huang <honjow311@gmail.com>
+ * Copyright (c) 2026 Denis Benato <denis.benato@linux.dev>
+ * Copyright (c) 2026 Valve Corporation
+ */
+
+#include <linux/array_size.h>
+#include <linux/cleanup.h>
+#include <linux/completion.h>
+#include <linux/container_of.h>
+#include <linux/device.h>
+#include <linux/hid.h>
+#include <linux/kobject.h>
+#include <linux/leds.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/pm.h>
+#include <linux/spinlock.h>
+#include <linux/sysfs.h>
+#include <linux/types.h>
+#include <linux/unaligned.h>
+#include <linux/usb.h>
+#include <linux/workqueue.h>
+
+#include "hid-ids.h"
+
+#define CLAW_OUTPUT_REPORT_ID 0x0f
+#define CLAW_INPUT_REPORT_ID 0x10
+
+#define CLAW_PACKET_SIZE 64
+
+#define CLAW_DINPUT_CFG_INTF_IN 0x82
+#define CLAW_XINPUT_CFG_INTF_IN 0x83
+
+enum claw_command_index {
+ CLAW_COMMAND_TYPE_NONE = 0x00,
+ CLAW_COMMAND_TYPE_READ_PROFILE = 0x04,
+ CLAW_COMMAND_TYPE_READ_PROFILE_ACK = 0x05,
+ CLAW_COMMAND_TYPE_ACK = 0x06,
+ CLAW_COMMAND_TYPE_WRITE_PROFILE_DATA = 0x21,
+ CLAW_COMMAND_TYPE_SYNC_TO_ROM = 0x22,
+ CLAW_COMMAND_TYPE_SWITCH_MODE = 0x24,
+ CLAW_COMMAND_TYPE_READ_GAMEPAD_MODE = 0x26,
+ CLAW_COMMAND_TYPE_GAMEPAD_MODE_ACK = 0x27,
+ CLAW_COMMAND_TYPE_RESET_DEVICE = 0x28,
+};
+
+enum claw_gamepad_mode_index {
+ CLAW_GAMEPAD_MODE_XINPUT = 0x01,
+ CLAW_GAMEPAD_MODE_DINPUT = 0x02,
+ CLAW_GAMEPAD_MODE_DESKTOP = 0x04,
+};
+
+static const char * const claw_gamepad_mode_text[] = {
+ [CLAW_GAMEPAD_MODE_XINPUT] = "xinput",
+ [CLAW_GAMEPAD_MODE_DINPUT] = "dinput",
+ [CLAW_GAMEPAD_MODE_DESKTOP] = "desktop",
+};
+
+enum claw_mkeys_function_index {
+ CLAW_MKEY_FUNCTION_MACRO,
+ CLAW_MKEY_FUNCTION_DISABLED,
+ CLAW_MKEY_FUNCTION_COMBO,
+};
+
+static const char * const claw_mkeys_function_text[] = {
+ [CLAW_MKEY_FUNCTION_MACRO] = "macro",
+ [CLAW_MKEY_FUNCTION_DISABLED] = "disabled",
+ [CLAW_MKEY_FUNCTION_COMBO] = "combination",
+};
+
+struct claw_command_report {
+ u8 report_id;
+ u8 padding[2];
+ u8 header_tail;
+ u8 cmd;
+ u8 data[59];
+} __packed;
+
+struct claw_drvdata {
+ /* MCU General Variables */
+ struct completion send_cmd_complete;
+ struct delayed_work cfg_resume;
+ struct delayed_work cfg_setup;
+ struct hid_device *hdev;
+ struct mutex cfg_mutex; /* mutex for synchronous data */
+ bool waiting_for_ack;
+ spinlock_t cmd_lock; /* Lock for cmd data read/write */
+ u8 waiting_cmd;
+ int cmd_status;
+ u8 ep;
+
+ /* Gamepad Variables */
+ enum claw_mkeys_function_index mkeys_function;
+ enum claw_gamepad_mode_index gamepad_mode;
+ bool gamepad_registered;
+ spinlock_t mode_lock; /* Lock for mode data read/write */
+};
+
+static int get_endpoint_address(struct hid_device *hdev)
+{
+ struct usb_host_endpoint *ep;
+ struct usb_interface *intf;
+
+ intf = to_usb_interface(hdev->dev.parent);
+ ep = intf->cur_altsetting->endpoint;
+ if (ep)
+ return ep->desc.bEndpointAddress;
+
+ return -ENODEV;
+}
+
+static int claw_gamepad_mode_event(struct claw_drvdata *drvdata,
+ struct claw_command_report *cmd_rep)
+{
+ if (cmd_rep->data[0] >= ARRAY_SIZE(claw_gamepad_mode_text) ||
+ !claw_gamepad_mode_text[cmd_rep->data[0]] ||
+ cmd_rep->data[1] >= ARRAY_SIZE(claw_mkeys_function_text))
+ return -EINVAL;
+
+ scoped_guard(spinlock, &drvdata->mode_lock) {
+ drvdata->gamepad_mode = cmd_rep->data[0];
+ drvdata->mkeys_function = cmd_rep->data[1];
+ }
+
+ return 0;
+}
+
+static int claw_raw_event(struct claw_drvdata *drvdata, struct hid_report *report,
+ u8 *data, int size)
+{
+ struct claw_command_report *cmd_rep;
+ int ret = 0;
+
+ if (size != CLAW_PACKET_SIZE)
+ return 0;
+
+ cmd_rep = (struct claw_command_report *)data;
+
+ if (cmd_rep->report_id != CLAW_INPUT_REPORT_ID || cmd_rep->header_tail != 0x3c)
+ return 0;
+
+ dev_dbg(&drvdata->hdev->dev, "Rx data as raw input report: [%*ph]\n",
+ CLAW_PACKET_SIZE, data);
+
+ switch (cmd_rep->cmd) {
+ case CLAW_COMMAND_TYPE_GAMEPAD_MODE_ACK:
+ ret = claw_gamepad_mode_event(drvdata, cmd_rep);
+
+ scoped_guard(spinlock, &drvdata->cmd_lock) {
+ if (drvdata->waiting_for_ack &&
+ drvdata->waiting_cmd == CLAW_COMMAND_TYPE_READ_GAMEPAD_MODE) {
+ drvdata->cmd_status = ret;
+ drvdata->waiting_for_ack = false;
+ complete(&drvdata->send_cmd_complete);
+ }
+ }
+
+ break;
+ case CLAW_COMMAND_TYPE_ACK:
+ scoped_guard(spinlock, &drvdata->cmd_lock) {
+ if (drvdata->waiting_for_ack) {
+ drvdata->cmd_status = 0;
+ drvdata->waiting_for_ack = false;
+ complete(&drvdata->send_cmd_complete);
+ }
+ dev_dbg(&drvdata->hdev->dev, "Waiting CMD: %x\n", drvdata->waiting_cmd);
+ }
+
+ break;
+ default:
+ dev_dbg(&drvdata->hdev->dev, "Unknown command: %x\n", cmd_rep->cmd);
+ return 0;
+ }
+
+ return ret;
+}
+
+static int msi_raw_event(struct hid_device *hdev, struct hid_report *report,
+ u8 *data, int size)
+{
+ struct claw_drvdata *drvdata = hid_get_drvdata(hdev);
+
+ if (!drvdata || (drvdata->ep != CLAW_XINPUT_CFG_INTF_IN &&
+ drvdata->ep != CLAW_DINPUT_CFG_INTF_IN))
+ return 0;
+
+ return claw_raw_event(drvdata, report, data, size);
+}
+
+static int claw_hw_output_report(struct hid_device *hdev, u8 index, u8 *data,
+ size_t len, unsigned int timeout)
+{
+ unsigned char *dmabuf __free(kfree) = NULL;
+ u8 header[] = { CLAW_OUTPUT_REPORT_ID, 0, 0, 0x3c, index };
+ struct claw_drvdata *drvdata = hid_get_drvdata(hdev);
+ size_t header_size = ARRAY_SIZE(header);
+ int ret;
+
+ if (header_size + len > CLAW_PACKET_SIZE)
+ return -EINVAL;
+
+ /* We can't use a devm_alloc reusable buffer without side effects during suspend */
+ dmabuf = kzalloc(CLAW_PACKET_SIZE, GFP_KERNEL);
+ if (!dmabuf)
+ return -ENOMEM;
+
+ memcpy(dmabuf, header, header_size);
+ if (data && len)
+ memcpy(dmabuf + header_size, data, len);
+
+ guard(mutex)(&drvdata->cfg_mutex);
+ if (timeout) {
+ reinit_completion(&drvdata->send_cmd_complete);
+ scoped_guard(spinlock_irqsave, &drvdata->cmd_lock) {
+ drvdata->waiting_cmd = index;
+ drvdata->waiting_for_ack = true;
+ drvdata->cmd_status = -ETIMEDOUT;
+ }
+ }
+
+ dev_dbg(&hdev->dev, "Send data as raw output report: [%*ph]\n",
+ CLAW_PACKET_SIZE, dmabuf);
+
+ ret = hid_hw_output_report(hdev, dmabuf, CLAW_PACKET_SIZE);
+ if (ret < 0)
+ return ret;
+
+ ret = ret == CLAW_PACKET_SIZE ? 0 : -EIO;
+ if (ret) {
+ scoped_guard(spinlock_irqsave, &drvdata->cmd_lock) {
+ drvdata->waiting_cmd = CLAW_COMMAND_TYPE_NONE;
+ drvdata->waiting_for_ack = false;
+ }
+ return ret;
+ }
+
+ if (timeout) {
+ ret = wait_for_completion_interruptible_timeout(&drvdata->send_cmd_complete,
+ msecs_to_jiffies(timeout));
+
+ dev_dbg(&hdev->dev, "Remaining timeout: %u\n", ret);
+ ret = ret > 0 ? drvdata->cmd_status : ret ?: -EBUSY;
+ scoped_guard(spinlock_irqsave, &drvdata->cmd_lock)
+ drvdata->waiting_for_ack = false;
+ }
+
+ return ret;
+}
+
+static ssize_t gamepad_mode_store(struct device *dev, struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ struct hid_device *hdev = to_hid_device(dev);
+ struct claw_drvdata *drvdata = hid_get_drvdata(hdev);
+ int i, ret = -EINVAL;
+ u8 data[2];
+
+ /* Pairs with smp_store_release from cfg_setup_fn in system_wq context */
+ if (!smp_load_acquire(&drvdata->gamepad_registered))
+ return -ENODEV;
+
+ for (i = 0; i < ARRAY_SIZE(claw_gamepad_mode_text); i++) {
+ if (claw_gamepad_mode_text[i] && sysfs_streq(buf, claw_gamepad_mode_text[i])) {
+ ret = i;
+ break;
+ }
+ }
+ if (ret < 0)
+ return ret;
+
+ data[0] = ret;
+ scoped_guard(spinlock_irqsave, &drvdata->mode_lock)
+ data[1] = drvdata->mkeys_function;
+
+ ret = claw_hw_output_report(hdev, CLAW_COMMAND_TYPE_SWITCH_MODE, data, ARRAY_SIZE(data), 0);
+ if (ret)
+ return ret;
+
+ return count;
+}
+
+static ssize_t gamepad_mode_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct hid_device *hdev = to_hid_device(dev);
+ struct claw_drvdata *drvdata = hid_get_drvdata(hdev);
+ int ret, i;
+
+ /* Pairs with smp_store_release from cfg_setup_fn in system_wq context */
+ if (!smp_load_acquire(&drvdata->gamepad_registered))
+ return -ENODEV;
+
+ ret = claw_hw_output_report(hdev, CLAW_COMMAND_TYPE_READ_GAMEPAD_MODE, NULL, 0, 25);
+ if (ret)
+ return ret;
+
+ scoped_guard(spinlock_irqsave, &drvdata->mode_lock)
+ i = drvdata->gamepad_mode;
+
+ if (!claw_gamepad_mode_text[i] || claw_gamepad_mode_text[i][0] == '\0')
+ return sysfs_emit(buf, "unsupported\n");
+
+ return sysfs_emit(buf, "%s\n", claw_gamepad_mode_text[i]);
+}
+static DEVICE_ATTR_RW(gamepad_mode);
+
+static ssize_t gamepad_mode_index_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ ssize_t count = 0;
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(claw_gamepad_mode_text); i++) {
+ if (!claw_gamepad_mode_text[i] || claw_gamepad_mode_text[i][0] == '\0')
+ continue;
+ count += sysfs_emit_at(buf, count, "%s ", claw_gamepad_mode_text[i]);
+ }
+
+ if (count)
+ buf[count - 1] = '\n';
+
+ return count;
+}
+static DEVICE_ATTR_RO(gamepad_mode_index);
+
+static ssize_t mkeys_function_store(struct device *dev, struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ struct hid_device *hdev = to_hid_device(dev);
+ struct claw_drvdata *drvdata = hid_get_drvdata(hdev);
+ int i, ret = -EINVAL;
+ u8 data[2];
+
+ /* Pairs with smp_store_release from cfg_setup_fn in system_wq context */
+ if (!smp_load_acquire(&drvdata->gamepad_registered))
+ return -ENODEV;
+
+ for (i = 0; i < ARRAY_SIZE(claw_mkeys_function_text); i++) {
+ if (claw_mkeys_function_text[i] && sysfs_streq(buf, claw_mkeys_function_text[i])) {
+ ret = i;
+ break;
+ }
+ }
+ if (ret < 0)
+ return ret;
+
+ scoped_guard(spinlock_irqsave, &drvdata->mode_lock)
+ data[0] = drvdata->gamepad_mode;
+ data[1] = ret;
+
+ ret = claw_hw_output_report(hdev, CLAW_COMMAND_TYPE_SWITCH_MODE, data, ARRAY_SIZE(data), 0);
+ if (ret)
+ return ret;
+
+ return count;
+}
+
+static ssize_t mkeys_function_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ struct hid_device *hdev = to_hid_device(dev);
+ struct claw_drvdata *drvdata = hid_get_drvdata(hdev);
+ int ret, i;
+
+ /* Pairs with smp_store_release from cfg_setup_fn in system_wq context */
+ if (!smp_load_acquire(&drvdata->gamepad_registered))
+ return -ENODEV;
+
+ ret = claw_hw_output_report(hdev, CLAW_COMMAND_TYPE_READ_GAMEPAD_MODE, NULL, 0, 25);
+ if (ret)
+ return ret;
+
+ scoped_guard(spinlock_irqsave, &drvdata->mode_lock)
+ i = drvdata->mkeys_function;
+
+ if (i >= ARRAY_SIZE(claw_mkeys_function_text))
+ return sysfs_emit(buf, "unsupported\n");
+
+ return sysfs_emit(buf, "%s\n", claw_mkeys_function_text[i]);
+}
+static DEVICE_ATTR_RW(mkeys_function);
+
+static ssize_t mkeys_function_index_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ int i, count = 0;
+
+ for (i = 0; i < ARRAY_SIZE(claw_mkeys_function_text); i++)
+ count += sysfs_emit_at(buf, count, "%s ", claw_mkeys_function_text[i]);
+
+ if (count)
+ buf[count - 1] = '\n';
+
+ return count;
+}
+static DEVICE_ATTR_RO(mkeys_function_index);
+
+static ssize_t reset_store(struct device *dev, struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ struct hid_device *hdev = to_hid_device(dev);
+ struct claw_drvdata *drvdata = hid_get_drvdata(hdev);
+ bool val;
+ int ret;
+
+ /* Pairs with smp_store_release from cfg_setup_fn in system_wq context */
+ if (!smp_load_acquire(&drvdata->gamepad_registered))
+ return -ENODEV;
+
+ ret = kstrtobool(buf, &val);
+ if (ret)
+ return ret;
+
+ if (!val)
+ return -EINVAL;
+
+ ret = claw_hw_output_report(hdev, CLAW_COMMAND_TYPE_RESET_DEVICE, NULL, 0, 0);
+ if (ret)
+ return ret;
+
+ return count;
+}
+static DEVICE_ATTR_WO(reset);
+
+static umode_t claw_gamepad_attr_is_visible(struct kobject *kobj, struct attribute *attr,
+ int n)
+{
+ struct hid_device *hdev = to_hid_device(kobj_to_dev(kobj));
+ struct claw_drvdata *drvdata = hid_get_drvdata(hdev);
+
+ if (!drvdata) {
+ dev_warn(&hdev->dev,
+ "Failed to get drvdata from kobj. Gamepad attributes are not available.\n");
+ return 0;
+ }
+
+ return attr->mode;
+}
+
+static struct attribute *claw_gamepad_attrs[] = {
+ &dev_attr_gamepad_mode.attr,
+ &dev_attr_gamepad_mode_index.attr,
+ &dev_attr_mkeys_function.attr,
+ &dev_attr_mkeys_function_index.attr,
+ &dev_attr_reset.attr,
+ NULL,
+};
+
+static const struct attribute_group claw_gamepad_attr_group = {
+ .attrs = claw_gamepad_attrs,
+ .is_visible = claw_gamepad_attr_is_visible,
+};
+
+static void cfg_setup_fn(struct work_struct *work)
+{
+ struct delayed_work *dwork = container_of(work, struct delayed_work, work);
+ struct claw_drvdata *drvdata = container_of(dwork, struct claw_drvdata, cfg_setup);
+ int ret;
+
+ ret = claw_hw_output_report(drvdata->hdev, CLAW_COMMAND_TYPE_READ_GAMEPAD_MODE,
+ NULL, 0, 25);
+ if (ret) {
+ dev_err(&drvdata->hdev->dev,
+ "Failed to setup device, can't read gamepad mode: %d\n", ret);
+ return;
+ }
+
+ /* Add sysfs attributes after we get the device state */
+ ret = devm_device_add_group(&drvdata->hdev->dev, &claw_gamepad_attr_group);
+ if (ret) {
+ dev_err(&drvdata->hdev->dev,
+ "Failed to setup device, can't create gamepad attrs: %d\n", ret);
+ return;
+ }
+ /* Pairs with smp_load_acquire in attribute show/store functions */
+ smp_store_release(&drvdata->gamepad_registered, true);
+
+ kobject_uevent(&drvdata->hdev->dev.kobj, KOBJ_CHANGE);
+}
+
+static void cfg_resume_fn(struct work_struct *work)
+{
+ struct delayed_work *dwork = container_of(work, struct delayed_work, work);
+ struct claw_drvdata *drvdata = container_of(dwork, struct claw_drvdata, cfg_resume);
+
+ /* Pairs with smp_store_release from cfg_setup_fn in system_wq context */
+ if (!smp_load_acquire(&drvdata->gamepad_registered))
+ schedule_delayed_work(&drvdata->cfg_setup, msecs_to_jiffies(500));
+}
+
+static int claw_probe(struct hid_device *hdev, u8 ep)
+{
+ struct claw_drvdata *drvdata;
+ int ret;
+
+ drvdata = devm_kzalloc(&hdev->dev, sizeof(*drvdata), GFP_KERNEL);
+ if (!drvdata)
+ return -ENOMEM;
+
+ drvdata->gamepad_mode = CLAW_GAMEPAD_MODE_XINPUT;
+ drvdata->hdev = hdev;
+ drvdata->ep = ep;
+
+ mutex_init(&drvdata->cfg_mutex);
+ spin_lock_init(&drvdata->cmd_lock);
+ spin_lock_init(&drvdata->mode_lock);
+ init_completion(&drvdata->send_cmd_complete);
+ INIT_DELAYED_WORK(&drvdata->cfg_resume, &cfg_resume_fn);
+ INIT_DELAYED_WORK(&drvdata->cfg_setup, &cfg_setup_fn);
+
+ /* For control interface: open the HID transport for sending commands. */
+ ret = hid_hw_open(hdev);
+ if (ret)
+ return ret;
+
+ hid_set_drvdata(hdev, drvdata);
+ schedule_delayed_work(&drvdata->cfg_setup, msecs_to_jiffies(500));
+
+ return 0;
+}
+
+static int msi_probe(struct hid_device *hdev, const struct hid_device_id *id)
+{
+ int ret;
+ u8 ep;
+
+ if (!hid_is_usb(hdev)) {
+ ret = -ENODEV;
+ goto err_probe;
+ }
+
+ ret = hid_parse(hdev);
+ if (ret)
+ goto err_probe;
+
+ /* Set quirk to create separate input devices per HID application */
+ hdev->quirks |= HID_QUIRK_INPUT_PER_APP | HID_QUIRK_MULTI_INPUT;
+ ret = hid_hw_start(hdev, HID_CONNECT_DEFAULT);
+ if (ret)
+ goto err_probe;
+
+ /* For non-control interfaces (keyboard/mouse), allow userspace to grab the devices. */
+ ret = get_endpoint_address(hdev);
+ if (ret < 0)
+ goto err_stop_hw;
+
+ ep = ret;
+ if (ep == CLAW_XINPUT_CFG_INTF_IN || ep == CLAW_DINPUT_CFG_INTF_IN) {
+ ret = claw_probe(hdev, ep);
+ if (ret)
+ goto err_stop_hw;
+ }
+
+ return 0;
+
+err_stop_hw:
+ hid_hw_stop(hdev);
+err_probe:
+ return dev_err_probe(&hdev->dev, ret, "Failed to init device\n");
+}
+
+static void claw_remove(struct hid_device *hdev)
+{
+ struct claw_drvdata *drvdata = hid_get_drvdata(hdev);
+
+ if (!drvdata) {
+ hid_hw_close(hdev);
+ return;
+ }
+
+ cancel_delayed_work_sync(&drvdata->cfg_resume);
+ cancel_delayed_work_sync(&drvdata->cfg_setup);
+
+ /* Pairs with smp_load_acquire in attribute show/store functions */
+ smp_store_release(&drvdata->gamepad_registered, false);
+
+ hid_hw_close(hdev);
+}
+
+static void msi_remove(struct hid_device *hdev)
+{
+ int ret;
+ u8 ep;
+
+ /* Safe assumption. SET_INTERFACE ioctl can't be used while driver is bound */
+ ret = get_endpoint_address(hdev);
+ if (ret <= 0)
+ goto hw_stop;
+
+ ep = ret;
+ if (ep == CLAW_XINPUT_CFG_INTF_IN || ep == CLAW_DINPUT_CFG_INTF_IN)
+ claw_remove(hdev);
+
+hw_stop:
+ hid_hw_stop(hdev);
+}
+
+static int claw_resume(struct hid_device *hdev)
+{
+ struct claw_drvdata *drvdata = hid_get_drvdata(hdev);
+
+ if (!drvdata)
+ return -ENODEV;
+
+ /* MCU can take up to 500ms to be ready after resume */
+ schedule_delayed_work(&drvdata->cfg_resume, msecs_to_jiffies(500));
+ return 0;
+}
+
+static int msi_resume(struct hid_device *hdev)
+{
+ int ret;
+ u8 ep;
+
+ /* Safe assumption. SET_INTERFACE ioctl can't be used while driver is bound */
+ ret = get_endpoint_address(hdev);
+ if (ret <= 0)
+ return 0;
+
+ ep = ret;
+ if (ep == CLAW_XINPUT_CFG_INTF_IN || ep == CLAW_DINPUT_CFG_INTF_IN)
+ return claw_resume(hdev);
+
+ return 0;
+}
+
+static int claw_suspend(struct hid_device *hdev)
+{
+ struct claw_drvdata *drvdata = hid_get_drvdata(hdev);
+
+ if (!drvdata)
+ return -ENODEV;
+
+ cancel_delayed_work_sync(&drvdata->cfg_resume);
+ cancel_delayed_work_sync(&drvdata->cfg_setup);
+
+ return 0;
+}
+
+static int msi_suspend(struct hid_device *hdev, pm_message_t msg)
+{
+ int ret;
+ u8 ep;
+
+ /* Safe assumption. SET_INTERFACE ioctl can't be used while driver is bound */
+ ret = get_endpoint_address(hdev);
+ if (ret <= 0)
+ return 0;
+
+ ep = ret;
+ if (ep == CLAW_XINPUT_CFG_INTF_IN || ep == CLAW_DINPUT_CFG_INTF_IN)
+ return claw_suspend(hdev);
+
+ return 0;
+}
+
+static const struct hid_device_id msi_devices[] = {
+ { HID_USB_DEVICE(USB_VENDOR_ID_MSI_2, USB_DEVICE_ID_MSI_CLAW_XINPUT) },
+ { HID_USB_DEVICE(USB_VENDOR_ID_MSI_2, USB_DEVICE_ID_MSI_CLAW_DINPUT) },
+ { HID_USB_DEVICE(USB_VENDOR_ID_MSI_2, USB_DEVICE_ID_MSI_CLAW_DESKTOP) },
+ { HID_USB_DEVICE(USB_VENDOR_ID_MSI_2, USB_DEVICE_ID_MSI_CLAW_BIOS) },
+ { }
+};
+MODULE_DEVICE_TABLE(hid, msi_devices);
+
+static struct hid_driver msi_driver = {
+ .name = "hid-msi",
+ .id_table = msi_devices,
+ .raw_event = msi_raw_event,
+ .probe = msi_probe,
+ .remove = msi_remove,
+ .resume = msi_resume,
+ .suspend = pm_ptr(msi_suspend),
+};
+module_hid_driver(msi_driver);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Denis Benato <denis.benato@linux.dev>");
+MODULE_AUTHOR("Zhouwang Huang <honjow311@gmail.com>");
+MODULE_AUTHOR("Derek J. Clark <derekjohn.clark@gmail.com>");
+MODULE_DESCRIPTION("HID driver for MSI Claw Handheld PC gamepads");
--
2.53.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox