Linux Documentation
 help / color / mirror / Atom feed
* Re: [PATCH v8 23/46] KVM: TDX: Make source page optional for KVM_TDX_INIT_MEM_REGION
From: Yan Zhao @ 2026-06-23  5:16 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, jmattson, jthoughton, michael.roth, oupton, pankaj.gupta,
	qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
	tabba, willy, wyihan, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <ajnf5Z9nWZxoLS4x@google.com>

On Mon, Jun 22, 2026 at 06:22:45PM -0700, Sean Christopherson wrote:
> On Mon, Jun 22, 2026, Yan Zhao wrote:
> > On Thu, Jun 18, 2026 at 05:32:00PM -0700, Ackerley Tng via B4 Relay wrote:
> > > From: Ackerley Tng <ackerleytng@google.com>
> > > 
> > > Update tdx_gmem_post_populate() to handle cases where a source page is
> > > not explicitly provided. Instead of returning -EOPNOTSUPP when src_page
> > > is NULL, default to using the page associated with the destination PFN.
> > > 
> > > This change allows for in-place memory conversion where the data is
> > > already present in the target PFN, ensuring the TDX module has a valid
> > > source page reference for the TDH.MEM.PAGE.ADD operation.
> > > 
> > > Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > > ---
> > >  Documentation/virt/kvm/x86/intel-tdx.rst |  4 ++++
> > >  arch/x86/kvm/vmx/tdx.c                   | 11 ++++++++---
> > >  2 files changed, 12 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/Documentation/virt/kvm/x86/intel-tdx.rst b/Documentation/virt/kvm/x86/intel-tdx.rst
> > > index 6a222e9d09541..74357fe87f9ec 100644
> > > --- a/Documentation/virt/kvm/x86/intel-tdx.rst
> > > +++ b/Documentation/virt/kvm/x86/intel-tdx.rst
> > > @@ -158,6 +158,10 @@ KVM_TDX_INIT_MEM_REGION
> > >  Initialize @nr_pages TDX guest private memory starting from @gpa with userspace
> > >  provided data from @source_addr. @source_addr must be PAGE_SIZE-aligned.
> > >  
> > > +If guest_memfd in-place conversion is enabled, pass NULL for @source_addr to
> > > +initialize the memory region using memory contents already populated in
> > > +guest_memfd memory.
> > > +
> > >  Note, before calling this sub command, memory attribute of the range
> > >  [gpa, gpa + nr_pages] needs to be private.  Userspace can use
> > >  KVM_SET_MEMORY_ATTRIBUTES to set the attribute.
> > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > > index ffe9d0db58c59..56d10333c61a7 100644
> > > --- a/arch/x86/kvm/vmx/tdx.c
> > > +++ b/arch/x86/kvm/vmx/tdx.c
> > > @@ -3198,8 +3198,12 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> > >  	if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm))
> > >  		return -EIO;
> > >  
> > > -	if (!src_page)
> > > -		return -EOPNOTSUPP;
> > > +	if (!src_page) {
> > > +		if (!gmem_in_place_conversion)
> > When userspace turns on gmem_in_place_conversion while creating guest_memfd
> > without the MMAP flag, the absence of src_page should still be treated as an
> > error.
> 
> Why MMAP?
Hmm, I was showing a scenario that in-place conversion couldn't occur.
I didn't mean that with the MMAP flag, mmap() and user write must occur.

> Shouldn't this be a general "if (!src_page && !up-to-date)"?  Just
> because userspace _can_ mmap() the memory doesn't mean userspace _has_ mmap()'d
> and written memory.  And when write() lands, MMAP wouldn't be necessary to
> initialize the memory.
Do you mean using up-to-date flag as below?

if (!src_page) {
	src_page = pfn_to_page(pfn);
	if (!folio_test_uptodate(page_folio(src_page)))
		return -EOPNOTSUPP;
}

One concern is that TDX now does not much care about the up-to-date flag since
TDX doesn't rely on the flag to clear pages on conversions.
I'm not sure if the flag can be reliably checked in this case. e.g.,
now the whole folio is marked up-to-date even if only part of it is faulted by
user access.
Ensuring that the up-to-date flag works correctly with huge page support seems
to have more effort than introducing a dedicated flag for TDX.

> > Additionally, to properly enable in-place copying for the TDX initial memory
> > region, userspace must not only specify source_addr to NULL, but also follow
> > a specific sequence (where steps 1/2/3/7 are required only for in-place copy):
> > 1. create guest_memfd with MMAP flag
> > 2. mmap the guest_memfd.
> > 3. convert the initial memory range to shared.
> > 4. copy initial content to the source page.
> > 5. convert the initial memory range to private
> > 6. invoke ioctl KVM_TDX_INIT_MEM_REGION.
> > 7. do not unmap the source backend.
> > 
> > So, would it be reasonable to introduce a dedicated flag that allows userspace
> > to explicitly opt into the in-place copy functionality? e.g.,
> 
> Why?  It's userspace's responsibility to get the above right.  If userspace fails
> to provide a src_page when it doesn't want in-place copy, that's a userspace bug.
I mean if userspace specifies a NULL source_addr by mistake, it's better for
kernel to detect this mistake, similar to how it validates whether source_addr
is PAGE_ALIGNED.
Since userspace already needs to perform additional steps to enable in-place
copy, specifying a dedicated flag to indicate that the NULL source_addr is
intentional seems like a reasonable burden.

^ permalink raw reply

* [PATCH 1/3] Documentation/hwmon: Add onsemi's FD5121 controllers' documentation
From: Selvamani Rajagopal via B4 Relay @ 2026-06-23  5:55 UTC (permalink / raw)
  To: Guenter Roeck, Jonathan Corbet, Shuah Khan, Selva Rajagopal,
	Rob Herring, Krzysztof Kozlowski, Conor Dooley
  Cc: linux-hwmon, linux-doc, linux-kernel, devicetree,
	Selvamani Rajagopal
In-Reply-To: <20260622-support-fd5121-from-onsemi-v1-0-b31767689c65@onsemi.com>

From: Selvamani Rajagopal <Selvamani.Rajagopal@onsemi.com>

Document the hardware monitoring support for the FD5121, FD5123, and
FD5125 devices.

Documentation describes the supported telemetry data exposed via
the sysfs for the hwmon subsystem, including voltage, current,
power and temperature measurements.

Signed-off-by: Selvamani Rajagopal <Selvamani.Rajagopal@onsemi.com>
---
 Documentation/hwmon/fd5121.rst | 93 ++++++++++++++++++++++++++++++++++++++++++
 Documentation/hwmon/index.rst  |  1 +
 2 files changed, 94 insertions(+)

diff --git a/Documentation/hwmon/fd5121.rst b/Documentation/hwmon/fd5121.rst
new file mode 100644
index 000000000000..c279db4641e4
--- /dev/null
+++ b/Documentation/hwmon/fd5121.rst
@@ -0,0 +1,93 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+
+Kernel driver fd5121
+=====================
+
+Supported chips:
+
+  * onsemi FD5121
+
+    Prefix: 'fd5121'
+
+    Datasheet: Datasheet is not publicly available.
+
+  * onsemi FD5123
+
+    Prefix: 'fd5121'
+
+    Datasheet: Datasheet is not publicly available.
+
+  * onsemi FD5125
+
+    Prefix: 'fd5121'
+
+    Datasheet: Datasheet is not publicly available.
+
+Author: Selva Rajagopal <selvamani.rajagopal@onsemi.com>
+
+Description
+-----------
+
+FD5121, FD5123, FD5125 are dual-rail, multi-phase controllers
+and compliant to
+
+  - AVS and AMD proprietary SVI3 protocol.
+  - PMBus rev 1.4.1 interface.
+
+Used as multi-phase voltage regulators for CPUs, high performance
+ASICs, SoCs or graphic cores.
+
+Gives full telemetry options including input/output voltage
+and current, as well as fault handling and identifications.
+
+Usage Notes
+-----------
+
+This driver does not probe for PMBus devices. You will have
+to instantiate devices explicitly.
+
+Example: the following commands will load the driver for the
+controller at address 0x50 on I2C bus #1::
+
+    # modprobe fd5121
+    # echo fd5121 0x50 > /sys/bus/i2c/devices/i2c-1/new_device
+
+It can also be instantiated by declaring in device tree
+
+Sysfs attributes
+----------------
+
+The following attributes are supported:
+
+======================  ====================================
+curr[1-2]_label		"iin[1-2]"
+curr[3-4]_label		"iout[1-2]"
+curr[1-2]_input		Measured input current.
+curr[3-4]_input		Measured output current.
+curr[1-4]_crit_alarm	Output current critical high alarm.
+curr[1-4]_max_alarm	Output current high alarm.
+
+in[1-2]_label		"vin[1-2]"
+in[3-4]_label		"vout[1-2]"
+in[1-4]_lcrit_alarm	Input voltage critical low alarm.
+in[1-4]_crit_alarm	Input voltage critical high alarm.
+in[1-2]_max_alarm	Input voltage high alarm.
+in[1-2]_input           Measured input voltage.
+in[3-4]_input           Measured output voltage.
+
+power[1-2]_label	"pin[1-2]"
+power[3-4]_label	"pout[1-2]"
+power[3-4]_crit_alarm	Output power critical high alarm.
+power[1-2]_max_alarm	Output power high alarm.
+power[1-4]_max          Power limit.
+power[1-4]_input        Measured input power.
+power[3-4]_crit         Critical maximum output power.
+
+temp[1-2]_crit_alarm	Chip temperature critical high alarm.
+temp[1-2]_max_alarm	Chip temperature high alarm.
+temp[1-2]_input         Measured temperature.
+temp[1-2]_max           Maximum temperature.
+temp[1-2]_crit          Critical high temperature.
+
+======================  ====================================
+
diff --git a/Documentation/hwmon/index.rst b/Documentation/hwmon/index.rst
index 4aa910569c31..451f5433fa60 100644
--- a/Documentation/hwmon/index.rst
+++ b/Documentation/hwmon/index.rst
@@ -79,6 +79,7 @@ Hardware Monitoring Kernel Drivers
    f71805f
    f71882fg
    fam15h_power
+   fd5121
    fsp-3y
    ftsteutates
    g760a

-- 
2.43.0



^ permalink raw reply related

* [PATCH 2/3] dt-bindings: hwmon: pmbus: Support for onsemi's FD5121
From: Selvamani Rajagopal via B4 Relay @ 2026-06-23  5:55 UTC (permalink / raw)
  To: Guenter Roeck, Jonathan Corbet, Shuah Khan, Selva Rajagopal,
	Rob Herring, Krzysztof Kozlowski, Conor Dooley
  Cc: linux-hwmon, linux-doc, linux-kernel, devicetree,
	Selvamani Rajagopal
In-Reply-To: <20260622-support-fd5121-from-onsemi-v1-0-b31767689c65@onsemi.com>

From: Selvamani Rajagopal <Selvamani.Rajagopal@onsemi.com>

Add devicetree schema for onsemi FD5121, FD5123, and
FD5125 dual rail, multi-phase digital controllers.

Signed-off-by: Selvamani Rajagopal <Selvamani.Rajagopal@onsemi.com>
---
 .../bindings/hwmon/pmbus/onnn,fd5121.yaml          | 41 ++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/Documentation/devicetree/bindings/hwmon/pmbus/onnn,fd5121.yaml b/Documentation/devicetree/bindings/hwmon/pmbus/onnn,fd5121.yaml
new file mode 100644
index 000000000000..b0453b0634f0
--- /dev/null
+++ b/Documentation/devicetree/bindings/hwmon/pmbus/onnn,fd5121.yaml
@@ -0,0 +1,41 @@
+# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/hwmon/pmbus/onnn,fd5121.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: onsemi's multi-phase digital controllers
+
+maintainers:
+  - Selvamani Rajagopal <selvamani.rajagopal@onsemi.com>
+
+description:
+  onsemi multi-phase digital controllers with PMBus.
+
+properties:
+  compatible:
+    enum:
+      - onnn,fd5121
+      - onnn,fd5123
+      - onnn,fd5125
+
+  reg:
+    maxItems: 1
+
+required:
+  - compatible
+  - reg
+
+additionalProperties: false
+
+examples:
+  - |
+    i2c {
+      #address-cells = <1>;
+      #size-cells = <0>;
+
+      fd5121@50 {
+        compatible = "onnn,fd5121";
+        reg = <0x50>;
+      };
+    };

-- 
2.43.0



^ permalink raw reply related

* [PATCH 0/3] Support onsemi's FD5121 multiphase digital controller
From: Selvamani Rajagopal via B4 Relay @ 2026-06-23  5:55 UTC (permalink / raw)
  To: Guenter Roeck, Jonathan Corbet, Shuah Khan, Selva Rajagopal,
	Rob Herring, Krzysztof Kozlowski, Conor Dooley
  Cc: linux-hwmon, linux-doc, linux-kernel, devicetree,
	Selvamani Rajagopal

FD5121 is a dual rail, multi-phase controller designed to
power CPU, ASIC or SoC with fully configurable rails.

This driver adds support for FD5121, FD5123 and FD5125. These
controllers configurability through PMBus 1.4.1. 

Added documents for these controllers.

Signed-off-by: Selvamani Rajagopal <Selvamani.Rajagopal@onsemi.com>
---
Selvamani Rajagopal (3):
      Documentation/hwmon: Add onsemi's FD5121 controllers' documentation
      dt-bindings: hwmon: pmbus: Support for onsemi's FD5121
      hwmon: (pmbus/fd5121): Add support FD5121, FD5123 and FD5125

 .../bindings/hwmon/pmbus/onnn,fd5121.yaml          |   41 +
 Documentation/hwmon/fd5121.rst                     |   93 ++
 Documentation/hwmon/index.rst                      |    1 +
 MAINTAINERS                                        |    8 +
 drivers/hwmon/pmbus/Kconfig                        |    9 +
 drivers/hwmon/pmbus/Makefile                       |    1 +
 drivers/hwmon/pmbus/fd5121.c                       | 1004 ++++++++++++++++++++
 7 files changed, 1157 insertions(+)
---
base-commit: 1a3746ccbb0a97bed3c06ccde6b880013b1dddc1
change-id: 20260622-support-fd5121-from-onsemi-6fa9f98b5bf0

Best regards,
-- 
Selvamani Rajagopal <Selvamani.Rajagopal@onsemi.com>



^ permalink raw reply

* [PATCH 3/3] hwmon: (pmbus/fd5121): Add support FD5121, FD5123 and FD5125
From: Selvamani Rajagopal via B4 Relay @ 2026-06-23  5:55 UTC (permalink / raw)
  To: Guenter Roeck, Jonathan Corbet, Shuah Khan, Selva Rajagopal,
	Rob Herring, Krzysztof Kozlowski, Conor Dooley
  Cc: linux-hwmon, linux-doc, linux-kernel, devicetree,
	Selvamani Rajagopal
In-Reply-To: <20260622-support-fd5121-from-onsemi-v1-0-b31767689c65@onsemi.com>

From: Selvamani Rajagopal <Selvamani.Rajagopal@onsemi.com>

FD5121 is a dual-rail, multi-phase, digital controller that offers
full telemtry options including input/output voltage, current as
well as fault handling and identifications.

These controllers are compliant with PMBus specification.

Signed-off-by: Selvamani Rajagopal <Selvamani.Rajagopal@onsemi.com>
---
 MAINTAINERS                  |    8 +
 drivers/hwmon/pmbus/Kconfig  |    9 +
 drivers/hwmon/pmbus/Makefile |    1 +
 drivers/hwmon/pmbus/fd5121.c | 1004 ++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 1022 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index d95d3ef77773..c0664c33324a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -20135,6 +20135,14 @@ L:	linux-mips@vger.kernel.org
 S:	Maintained
 F:	arch/mips/boot/dts/ralink/omega2p.dts
 
+ONSEMI HARDWARE MONITOR DRIVER
+M:	Selva Rajagopal <selvamani.rajagopal@onsemi.com>
+L:	linux-hwmon@vger.kernel.org
+S:	Supported
+W:	https://www.onsemi.com
+F:	Documentation/devicetree/bindings/hwmon/pmbus/onnn,fd5121.yaml
+F:	drivers/hwmon/pmbus/fd5121.c
+
 ONSEMI ETHERNET PHY DRIVERS
 M:	Piergiorgio Beruto <piergiorgio.beruto@gmail.com>
 L:	netdev@vger.kernel.org
diff --git a/drivers/hwmon/pmbus/Kconfig b/drivers/hwmon/pmbus/Kconfig
index c8cda160b5f8..3a06ed83539e 100644
--- a/drivers/hwmon/pmbus/Kconfig
+++ b/drivers/hwmon/pmbus/Kconfig
@@ -179,6 +179,15 @@ config SENSORS_E50SN12051
 	  This driver can also be built as a module. If so, the module will
 	  be called e50sn12051.
 
+config SENSORS_FD5121
+	tristate "FD5121/FD5123/FD5125 controllers from onsemi"
+	help
+	  If you say yes here, you get support for onsemi
+	  controllers FD5121, FD5123, FD5125.
+
+	  This driver can also be built as a module. If so, the module will
+	  be called fd5121.
+
 config SENSORS_INA233
 	tristate "Texas Instruments INA233 and compatibles"
 	help
diff --git a/drivers/hwmon/pmbus/Makefile b/drivers/hwmon/pmbus/Makefile
index ffc05f493213..70f4afb41fe0 100644
--- a/drivers/hwmon/pmbus/Makefile
+++ b/drivers/hwmon/pmbus/Makefile
@@ -13,6 +13,7 @@ obj-$(CONFIG_SENSORS_APS_379)	+= aps-379.o
 obj-$(CONFIG_SENSORS_BEL_PFE)	+= bel-pfe.o
 obj-$(CONFIG_SENSORS_BPA_RS600)	+= bpa-rs600.o
 obj-$(CONFIG_SENSORS_DELTA_AHE50DC_FAN) += delta-ahe50dc-fan.o
+obj-$(CONFIG_SENSORS_FD5121)	+= fd5121.o
 obj-$(CONFIG_SENSORS_FSP_3Y)	+= fsp-3y.o
 obj-$(CONFIG_SENSORS_HAC300S)	+= hac300s.o
 obj-$(CONFIG_SENSORS_IBM_CFFPS)	+= ibm-cffps.o
diff --git a/drivers/hwmon/pmbus/fd5121.c b/drivers/hwmon/pmbus/fd5121.c
new file mode 100644
index 000000000000..e68c6d6cabbd
--- /dev/null
+++ b/drivers/hwmon/pmbus/fd5121.c
@@ -0,0 +1,1004 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright 2026 Semiconductor Components Industries, LLC ("onsemi").
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/err.h>
+#include <linux/slab.h>
+#include <linux/delay.h>
+#include <linux/i2c.h>
+#include <linux/hwmon.h>
+#include <linux/hwmon-sysfs.h>
+#include <linux/unaligned.h>
+
+#include "pmbus.h"
+
+enum chips { chip_fd5121, chip_fd5123, chip_fd5125 };
+
+#define CTLR_ID_UNKNOWN				0
+#define CTLR_ID_FD5121				0xFD5121
+#define CTLR_ID_FD5123				0xFD5123
+#define CTLR_ID_FD5125				0xFD5125
+
+#define FD5121_NUM_PAGES			2
+
+/* Custom PMBUS commands */
+#define PMBUS_REG_VOUT_MIN			0x2B
+#define PMBUS_REG_POWER_MODE			0x34
+#define PMBUS_REG_VIN_ON			0x35
+#define PMBUS_REG_VIN_OFF			0x36
+#define PMBUS_REG_VIN_UV_FAULT_RESPONSE		0x5A
+#define PMBUS_REG_IIN_OC_FAULT_RESPONSE		0x5C
+#define PMBUS_REG_TON_DELAY			0x60
+#define PMBUS_REG_POUT_OP_FAULT_RESPONSE	0x69
+#define PMBUS_REG_READ_VAUX			0x85
+
+#define PMBUS_REG_IKNEE_SET			0x2D
+#define PMBUS_REG_PIN_COUNTER			0x2E
+#define PMBUS_REG_VMIN_AWARE			0x2F
+#define PMBUS_REG_VAUX_UV_FAULT_LIMIT		0x6C
+#define PMBUS_REG_VAUX_OV_FAULT_LIMIT		0x6D
+#define PMBUS_REG_VAUX_UV_FAULT_RESPONSE	0x6E
+#define PMBUS_REG_VAUX_OV_FAULT_RESPONSE	0x6F
+#define PMBUS_REG_VAUX_UV_WARNING		0x75
+#define PMBUS_REG_VAUX_OV_WARNING		0x76
+#define PMBUS_REG_MFR_FREE_USER_CONFIG_TABLES	0xCF
+#define PMBUS_REG_MFR_ADDRESS_TABLE		0xD0
+#define PMBUS_REG_MFR_STATUS_ONSEMI		0xD1
+#define PMBUS_REG_MFR_UNLOCK			0xD2
+#define PMBUS_REG_MFR_FAULTY_SPS		0xD3
+#define PMBUS_REG_TLVR_FAULTS			0xD4
+#define PMBUS_REG_MFR_USER_STORE_CONFIG_TAB	0xD5
+#define PMBUS_REG_MFR_USER_CONFIG_INDEX		0xD6
+#define PMBUS_REG_MFR_PWM_DISCONNECTION		0xD7
+#define PMBUS_REG_MFR_VR_DISCONNECTION		0xD8
+#define PMBUS_REG_MFR_TON_SLEW			0xD9
+#define PMBUS_REG_MFR_TOFF_SLEW			0xDA
+#define PMBUS_REG_MFR_RAIL_NAME			0xDB
+#define PMBUS_REG_MFR_VOUT_DROOP		0xDC
+#define PMBUS_REG_MFR_USER_RESTORE_CONFIG_TAB	0xDD
+#define PMBUS_REG_MFR_SVR_GO			0xDE
+#define PMBUS_REG_MFR_SET_PWD			0xDF
+#define PMBUS_REG_MFR_CONFIG_ACTIVATE		0xE0
+#define PMBUS_REG_MFR_CONFIG_RECOVER		0xE1
+#define PMBUS_REG_MFR_OTP_DUMP			0xE2
+#define PMBUS_REG_MFR_BBR_EN			0xE3
+#define PMBUS_REG_MFR_DPM_MIN			0xE4
+#define PMBUS_REG_MFR_VBOOT			0xE5
+#define PMBUS_REG_MFR_PRECLAMP_OFFSET		0xE6
+#define PMBUS_REG_MFR_TLVR_DIAGN		0xE7
+#define PMBUS_REG_MFR_READ_VSYS			0xE8
+#define PMBUS_REG_MFR_SPECIFIC_E9		0xE9
+#define PMBUS_REG_MFR_SPECIFIC_EA		0xEA
+#define PMBUS_REG_MFR_SS_CBC			0xEB
+#define PMBUS_REG_MFR_AMD_STATUS		0xEC
+#define PMBUS_REG_MFR_CHECKSUM			0xEE
+#define PMBUS_REG_CSE_INDEX			0xF0
+#define PMBUS_REG_COUT_MEASURE			0xF1
+#define PMBUS_REG_VR_COUT			0xF2
+#define PMBUS_REG_BBR_RAM			0xF3
+#define PMBUS_REG_BBR_OTP			0xF4
+#define PMBUS_REG_READ_PSYS			0xF5
+#define PMBUS_REG_POSTCLAMP_OFFSET		0xF6
+#define PMBUS_REG_PGOOD_DELAY			0xF7
+#define PMBUS_REG_MFR_SPECIFIC_F8		0xF8
+#define PMBUS_REG_MFR_SPECIFIC_F9		0xF9
+#define PMBUS_REG_MFR_PWD_PROGRAM_RAM		0xFA
+#define PMBUS_REG_MFR_PWD_PROGRAM_I2C		0xFB
+#define PMBUS_REG_MFR_PWD_ENABLE_OTP_STORE	0xFC
+
+/* List of recognized commands */
+static const u8 cc_list[] = {
+	PMBUS_PAGE,
+	PMBUS_OPERATION,
+	PMBUS_ON_OFF_CONFIG,
+	PMBUS_CLEAR_FAULTS,
+	PMBUS_WRITE_PROTECT,
+	PMBUS_CAPABILITY,
+	PMBUS_VOUT_MODE,
+	PMBUS_VOUT_COMMAND,
+	PMBUS_VOUT_MAX,
+	PMBUS_VOUT_MARGIN_HIGH,
+	PMBUS_VOUT_MARGIN_LOW,
+	PMBUS_VOUT_TRANSITION_RATE,
+	PMBUS_REG_VOUT_MIN,
+	PMBUS_REG_IKNEE_SET,
+	PMBUS_REG_PIN_COUNTER,
+	PMBUS_REG_VMIN_AWARE,
+	PMBUS_REG_POWER_MODE,
+	PMBUS_REG_VIN_ON,
+	PMBUS_REG_VIN_OFF,
+	PMBUS_VOUT_OV_FAULT_LIMIT,
+	PMBUS_VOUT_OV_FAULT_RESPONSE,
+	PMBUS_VOUT_UV_FAULT_LIMIT,
+	PMBUS_VOUT_UV_FAULT_RESPONSE,
+	PMBUS_IOUT_OC_FAULT_LIMIT,
+	PMBUS_IOUT_OC_FAULT_RESPONSE,
+	PMBUS_IOUT_OC_WARN_LIMIT,
+	PMBUS_OT_FAULT_LIMIT,
+	PMBUS_OT_FAULT_RESPONSE,
+	PMBUS_OT_WARN_LIMIT,
+	PMBUS_VIN_OV_FAULT_LIMIT,
+	PMBUS_VIN_OV_FAULT_RESPONSE,
+	PMBUS_VIN_OV_WARN_LIMIT,
+	PMBUS_VIN_UV_WARN_LIMIT,
+	PMBUS_VIN_UV_FAULT_LIMIT,
+	PMBUS_REG_VIN_UV_FAULT_RESPONSE,
+	PMBUS_IIN_OC_FAULT_LIMIT,
+	PMBUS_REG_IIN_OC_FAULT_RESPONSE,
+	PMBUS_IIN_OC_WARN_LIMIT,
+	PMBUS_REG_TON_DELAY,
+	PMBUS_POUT_OP_FAULT_LIMIT,
+	PMBUS_REG_POUT_OP_FAULT_RESPONSE,
+	PMBUS_POUT_OP_WARN_LIMIT,
+	PMBUS_PIN_OP_WARN_LIMIT,
+	PMBUS_REG_VAUX_UV_FAULT_LIMIT,
+	PMBUS_REG_VAUX_OV_FAULT_LIMIT,
+	PMBUS_REG_VAUX_UV_FAULT_RESPONSE,
+	PMBUS_REG_VAUX_OV_FAULT_RESPONSE,
+	PMBUS_REG_VAUX_UV_WARNING,
+	PMBUS_REG_VAUX_OV_WARNING,
+	PMBUS_STATUS_BYTE,
+	PMBUS_STATUS_WORD,
+	PMBUS_STATUS_VOUT,
+	PMBUS_STATUS_IOUT,
+	PMBUS_STATUS_INPUT,
+	PMBUS_STATUS_TEMPERATURE,
+	PMBUS_STATUS_CML,
+	PMBUS_STATUS_OTHER,
+	PMBUS_STATUS_MFR_SPECIFIC,
+	PMBUS_REG_READ_VAUX,
+	PMBUS_READ_VIN,
+	PMBUS_READ_IIN,
+	PMBUS_READ_VOUT,
+	PMBUS_READ_IOUT,
+	PMBUS_READ_TEMPERATURE_1,
+	PMBUS_READ_POUT,
+	PMBUS_READ_PIN,
+	PMBUS_REVISION,
+	PMBUS_MFR_ID,
+	PMBUS_MFR_MODEL,
+	PMBUS_MFR_REVISION,
+	PMBUS_IC_DEVICE_ID,
+	PMBUS_REG_MFR_FREE_USER_CONFIG_TABLES,
+	PMBUS_REG_MFR_ADDRESS_TABLE,
+	PMBUS_REG_MFR_STATUS_ONSEMI,
+	PMBUS_REG_MFR_UNLOCK,
+	PMBUS_REG_MFR_FAULTY_SPS,
+	PMBUS_REG_TLVR_FAULTS,
+	PMBUS_REG_MFR_USER_STORE_CONFIG_TAB,
+	PMBUS_REG_MFR_USER_CONFIG_INDEX,
+	PMBUS_REG_MFR_PWM_DISCONNECTION,
+	PMBUS_REG_MFR_VR_DISCONNECTION,
+	PMBUS_REG_MFR_TON_SLEW,
+	PMBUS_REG_MFR_TOFF_SLEW,
+	PMBUS_REG_MFR_RAIL_NAME,
+	PMBUS_REG_MFR_VOUT_DROOP,
+	PMBUS_REG_MFR_USER_RESTORE_CONFIG_TAB,
+	PMBUS_REG_MFR_SVR_GO,
+	PMBUS_REG_MFR_SET_PWD,
+	PMBUS_REG_MFR_CONFIG_ACTIVATE,
+	PMBUS_REG_MFR_CONFIG_RECOVER,
+	PMBUS_REG_MFR_OTP_DUMP,
+	PMBUS_REG_MFR_BBR_EN,
+	PMBUS_REG_MFR_DPM_MIN,
+	PMBUS_REG_MFR_VBOOT,
+	PMBUS_REG_MFR_PRECLAMP_OFFSET,
+	PMBUS_REG_MFR_TLVR_DIAGN,
+	PMBUS_REG_MFR_READ_VSYS,
+	PMBUS_REG_MFR_SPECIFIC_E9,
+	PMBUS_REG_MFR_SPECIFIC_EA,
+	PMBUS_REG_MFR_SS_CBC,
+	PMBUS_REG_MFR_AMD_STATUS,
+	PMBUS_REG_MFR_CHECKSUM,
+	PMBUS_REG_CSE_INDEX,
+	PMBUS_REG_COUT_MEASURE,
+	PMBUS_REG_VR_COUT,
+	PMBUS_REG_BBR_RAM,
+	PMBUS_REG_BBR_OTP,
+	PMBUS_REG_READ_PSYS,
+	PMBUS_REG_POSTCLAMP_OFFSET,
+	PMBUS_REG_PGOOD_DELAY,
+	PMBUS_REG_MFR_SPECIFIC_F8,
+	PMBUS_REG_MFR_SPECIFIC_F9,
+	PMBUS_REG_MFR_PWD_PROGRAM_RAM,
+	PMBUS_REG_MFR_PWD_PROGRAM_I2C,
+	PMBUS_REG_MFR_PWD_ENABLE_OTP_STORE,
+};
+
+/* Following registers expect block read */
+static const u8 blk_rd_cc[] = {
+	PMBUS_SMBALERT_MASK,
+	PMBUS_MFR_DATE,
+	PMBUS_IC_DEVICE_REV,
+};
+
+struct fd5121_data {
+	struct attribute_group *groups[3];
+	struct pmbus_driver_info info;
+	struct device *dev;
+	u32 id;
+};
+
+static s32 fd5121_read_block_data(const struct i2c_client *client,
+				  u8 cmd_code, u8 len, u8 *pbuf)
+{
+	s32 ret = 0;
+
+	if (!i2c_check_functionality(client->adapter,
+				     I2C_FUNC_SMBUS_READ_BLOCK_DATA)) {
+
+		/* Payload length is in the first byte. */
+		ret = i2c_smbus_read_i2c_block_data(client, cmd_code,
+						    len, pbuf);
+		if (ret < 0)
+			return ret;
+		ret = pbuf[0];
+		if (ret > len)
+			ret = len;
+		for (int idx = 0; idx < ret; idx++)
+			pbuf[idx] = pbuf[idx + 1];
+		return ret;
+	}
+	ret = i2c_smbus_read_block_data(client, cmd_code, pbuf);
+	if (ret < 0)
+		return ret;
+	return min_t(s32, ret, len);
+}
+
+/* Command code that expects block read, not word read */
+static bool fd5121_blk_rd_reg(u8 cmd_code)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(blk_rd_cc); i++) {
+		if (cmd_code == blk_rd_cc[i])
+			return true;
+	}
+	return false;
+}
+
+static ssize_t fd5121_send_byte_store(struct device *dev,
+				      struct device_attribute *da,
+				      const char *buf, size_t count)
+{
+	struct i2c_client *client = to_i2c_client(dev->parent);
+	u8 val = 0;
+	int ret;
+
+	ret = kstrtou8(buf, 10, &val);
+	if (ret < 0)
+		return ret;
+	ret = i2c_smbus_write_byte(client, val);
+	if (ret < 0)
+		return ret;
+	return count;
+}
+
+static int fd5121_config_activate(struct i2c_client *client)
+{
+	return i2c_smbus_write_byte_data(client,
+					 PMBUS_REG_MFR_CONFIG_ACTIVATE, 0xAA);
+}
+
+static ssize_t fd5121_byte_store(struct device *dev,
+				 struct device_attribute *da,
+				 const char *buf, size_t count)
+{
+	struct sensor_device_attribute *attr = to_sensor_dev_attr(da);
+	struct i2c_client *client = to_i2c_client(dev->parent);
+	u8 reg = attr->index;
+	int ret = 0;
+	u8 val = 0;
+
+	switch (reg) {
+	case PMBUS_REG_MFR_CONFIG_ACTIVATE:
+		ret = fd5121_config_activate(client);
+		if (ret < 0)
+			return ret;
+		return count;
+	default:
+		ret = kstrtou8(buf, 10, &val);
+		if (ret < 0)
+			return ret;
+		break;
+	}
+	if (reg == PMBUS_PAGE && ((val != 0 && val != 1 &&
+	    val != GENMASK(7, 0))))
+		return -EINVAL;
+	ret = i2c_smbus_write_byte_data(client, reg, val);
+	if (ret < 0)
+		return ret;
+	return count;
+}
+
+static ssize_t fd5121_byte_show(struct device *dev,
+				struct device_attribute *da, char *buf)
+{
+	struct sensor_device_attribute *attr = to_sensor_dev_attr(da);
+	struct i2c_client *client = to_i2c_client(dev->parent);
+	u8 reg = attr->index;
+	s32 ret;
+
+	ret = i2c_smbus_read_byte_data(client, reg);
+	if (ret < 0)
+		return ret;
+	return scnprintf(buf, PAGE_SIZE, "%d\n", ret & 0xFF);
+}
+
+static ssize_t fd5121_word_store(struct device *dev,
+				 struct device_attribute *da,
+				 const char *buf, size_t count)
+{
+	struct sensor_device_attribute *attr = to_sensor_dev_attr(da);
+	struct i2c_client *client = to_i2c_client(dev->parent);
+	u8 reg = attr->index;
+	s16 val = 0;
+	int ret = 0;
+
+	switch (reg) {
+	case PMBUS_REG_MFR_PWD_PROGRAM_RAM:
+		val = 0xC93F;
+		break;
+	default:
+		ret = kstrtos16(buf, 10, &val);
+		if (ret)
+			return ret;
+		break;
+	}
+	ret = i2c_smbus_write_word_data(client, reg, val);
+	if (ret < 0)
+		return ret;
+	return count;
+}
+
+static ssize_t fd5121_word_show(struct device *dev,
+				struct device_attribute *da, char *buf)
+{
+	struct sensor_device_attribute *attr = to_sensor_dev_attr(da);
+	struct i2c_client *client = to_i2c_client(dev->parent);
+	u8 data[I2C_SMBUS_BLOCK_MAX] = { 0 };
+	u8 reg = attr->index;
+	s32 ret = 0;
+
+	if (fd5121_blk_rd_reg(reg)) {
+		ret = fd5121_read_block_data(client, reg, 2, data);
+		if (ret >= 0)
+			ret = get_unaligned_le16(data);
+	} else
+		ret = i2c_smbus_read_word_data(client, reg);
+	if (ret < 0)
+		return ret;
+	return scnprintf(buf, PAGE_SIZE, "%d\n", ret & 0xFFFF);
+}
+
+static s32 fd5121_write_block_data(const struct i2c_client *client,
+				   u8 cmd_code, u8 len, u8 *pbuf)
+{
+	s32 ret = 0;
+
+	if (!i2c_check_functionality(client->adapter,
+				     I2C_FUNC_SMBUS_WRITE_BLOCK_DATA))
+		ret = i2c_smbus_write_i2c_block_data(client, cmd_code,
+						     len, pbuf);
+	else
+		ret = i2c_smbus_write_block_data(client, cmd_code,
+						 len, pbuf);
+	return ret;
+}
+
+static s32 fd5121_read_long(struct i2c_client *client, u8 cmd_code, u32 *pval)
+{
+	u8 buffer[I2C_SMBUS_BLOCK_MAX] = { 0 };
+	s32 ret;
+
+	ret = fd5121_read_block_data(client, cmd_code, 4, buffer);
+	if (ret < 0)
+		return ret;
+	if (ret < 4)
+		return -EIO;
+
+	*pval = get_unaligned_le32(buffer);
+	return 0;
+}
+
+static ssize_t fd5121_long_store(struct device *dev,
+				 struct device_attribute *da,
+				 const char *buf, size_t count)
+{
+	struct sensor_device_attribute *attr = to_sensor_dev_attr(da);
+	struct i2c_client *client = to_i2c_client(dev->parent);
+	u8 reg = attr->index;
+	u8 buffer[4];
+	u32 val = 0;
+	int ret = 0;
+	u8 len;
+
+	ret = kstrtou32(buf, 10, &val);
+	if (ret)
+		return ret;
+
+	len = (u8) sizeof(buffer);
+	for (u8 i = 0; i < len; i++) {
+		buffer[i] = val & 0xFF;
+		val >>= 8;
+	}
+	ret = fd5121_write_block_data(client, reg, len, buffer);
+	if (ret < 0)
+		return ret;
+	return count;
+}
+
+static ssize_t fd5121_long_show(struct device *dev,
+				struct device_attribute *da, char *buf)
+{
+	struct sensor_device_attribute *attr = to_sensor_dev_attr(da);
+	struct i2c_client *client = to_i2c_client(dev->parent);
+	u8 reg = attr->index;
+	u32 val = 0;
+	s32 ret = 0;
+
+	ret = fd5121_read_long(client, reg, &val);
+	if (ret < 0)
+		return ret;
+	return scnprintf(buf, PAGE_SIZE, "%d\n", val);
+}
+
+static ssize_t fd5121_block_show(struct device *dev,
+				 struct device_attribute *da, char *buf)
+{
+	struct i2c_client *client = to_i2c_client(dev->parent);
+	struct sensor_device_attribute *attr = to_sensor_dev_attr(da);
+	u8 buffer[I2C_SMBUS_BLOCK_MAX] = { 0 };
+	u8 reg = attr->index;
+	int printed = 0;
+	s32 ret = 0;
+	u8 len = 0;
+	int i = 0;
+
+	if (reg == PMBUS_REG_MFR_FAULTY_SPS) {
+		int to_print = 0;
+
+		len = 7;
+		ret = fd5121_read_block_data(client, reg, len, buffer);
+		if (ret < 0)
+			return ret;
+		printed = 0;
+		to_print = (ret < len) ? ret : len;
+		for (i = 0; i < to_print; i++)
+			printed += scnprintf(buf + printed,
+					     PAGE_SIZE - printed,
+					     "%02x", buffer[i]);
+		printed += scnprintf(buf + printed,
+				     PAGE_SIZE - printed, "\n");
+		return printed;
+	} else if (reg == PMBUS_REG_BBR_RAM ||
+		   reg == PMBUS_REG_BBR_OTP) {
+		u32 len = (reg == PMBUS_REG_BBR_OTP) ? 165 : 164;
+
+		/* Extra byte may be needed in case we need to store
+		 * the length of the data
+		 */
+		u8 *tmp_in = kcalloc(len+1, sizeof(u8), GFP_KERNEL);
+
+		if (tmp_in == NULL)
+			return -ENOMEM;
+		ret = fd5121_read_block_data(client, reg, len, tmp_in);
+		if (ret < 0) {
+			kfree(tmp_in);
+			return ret;
+		}
+
+		printed = 0;
+		for (i = 0; i < ret; i++)
+			printed += scnprintf(buf + printed,
+					     PAGE_SIZE - printed, "%02x",
+					     tmp_in[i]);
+		printed += scnprintf(buf + printed,
+				     PAGE_SIZE - printed, "\n");
+
+		kfree(tmp_in);
+		return printed;
+	} else
+		return -ENODATA;
+}
+
+static SENSOR_DEVICE_ATTR_RW(page, fd5121_byte,
+			     PMBUS_PAGE);
+static SENSOR_DEVICE_ATTR_RO(vout_raw, fd5121_word,
+			     PMBUS_READ_VOUT);
+static SENSOR_DEVICE_ATTR_RW(operation, fd5121_byte,
+			     PMBUS_OPERATION);
+static SENSOR_DEVICE_ATTR_RW(on_off_config, fd5121_byte,
+			     PMBUS_ON_OFF_CONFIG);
+static SENSOR_DEVICE_ATTR_WO(clear_faults, fd5121_byte,
+			     PMBUS_CLEAR_FAULTS);
+static SENSOR_DEVICE_ATTR_RW(write_protect, fd5121_byte,
+			     PMBUS_WRITE_PROTECT);
+static SENSOR_DEVICE_ATTR_RO(capability, fd5121_byte,
+			     PMBUS_CAPABILITY);
+static SENSOR_DEVICE_ATTR_RW(smbalert_mask, fd5121_word,
+			     PMBUS_SMBALERT_MASK);
+static SENSOR_DEVICE_ATTR_RO(vout_mode, fd5121_byte,
+			     PMBUS_VOUT_MODE);
+static SENSOR_DEVICE_ATTR_RW(vout_command, fd5121_word,
+			     PMBUS_VOUT_COMMAND);
+static SENSOR_DEVICE_ATTR_RW(vout_max, fd5121_word,
+			     PMBUS_VOUT_MAX);
+static SENSOR_DEVICE_ATTR_RW(vout_margin_high, fd5121_word,
+			     PMBUS_VOUT_MARGIN_HIGH);
+static SENSOR_DEVICE_ATTR_RW(vout_margin_low, fd5121_word,
+			     PMBUS_VOUT_MARGIN_LOW);
+static SENSOR_DEVICE_ATTR_RW(vout_transition_rate, fd5121_word,
+			     PMBUS_VOUT_TRANSITION_RATE);
+static SENSOR_DEVICE_ATTR_RW(vout_min, fd5121_word,
+			     PMBUS_REG_VOUT_MIN);
+static SENSOR_DEVICE_ATTR_RW(power_mode, fd5121_byte,
+			     PMBUS_REG_POWER_MODE);
+static SENSOR_DEVICE_ATTR_RW(vin_on, fd5121_word,
+			     PMBUS_REG_VIN_ON);
+static SENSOR_DEVICE_ATTR_RW(vin_off, fd5121_word,
+			     PMBUS_REG_VIN_OFF);
+static SENSOR_DEVICE_ATTR_RW(vin_uv_fault_response, fd5121_byte,
+			     PMBUS_REG_VIN_UV_FAULT_RESPONSE);
+static SENSOR_DEVICE_ATTR_RW(iin_oc_fault_response, fd5121_byte,
+			     PMBUS_REG_IIN_OC_FAULT_RESPONSE);
+static SENSOR_DEVICE_ATTR_RW(ton_delay, fd5121_word,
+			     PMBUS_REG_TON_DELAY);
+static SENSOR_DEVICE_ATTR_RW(pout_op_fault_response, fd5121_byte,
+			     PMBUS_REG_POUT_OP_FAULT_RESPONSE);
+static SENSOR_DEVICE_ATTR_RO(read_vaux, fd5121_word,
+			     PMBUS_REG_READ_VAUX);
+static SENSOR_DEVICE_ATTR_RW(iknee_set, fd5121_word,
+			     PMBUS_REG_IKNEE_SET);
+static SENSOR_DEVICE_ATTR_RW(pin_counter, fd5121_byte,
+			     PMBUS_REG_PIN_COUNTER);
+static SENSOR_DEVICE_ATTR_RW(vmin_aware, fd5121_word,
+			     PMBUS_REG_VMIN_AWARE);
+static SENSOR_DEVICE_ATTR_RW(vout_ov_fault_response, fd5121_byte,
+			     PMBUS_VOUT_OV_FAULT_RESPONSE);
+static SENSOR_DEVICE_ATTR_RW(vout_uv_fault_response, fd5121_byte,
+			     PMBUS_VOUT_UV_FAULT_RESPONSE);
+static SENSOR_DEVICE_ATTR_RW(iout_oc_fault_response, fd5121_byte,
+			     PMBUS_IOUT_OC_FAULT_RESPONSE);
+static SENSOR_DEVICE_ATTR_RW(ot_fault_response, fd5121_byte,
+			     PMBUS_OT_FAULT_RESPONSE);
+static SENSOR_DEVICE_ATTR_RW(vin_ov_fault_response, fd5121_byte,
+			     PMBUS_VIN_OV_FAULT_RESPONSE);
+static SENSOR_DEVICE_ATTR_RW(vaux_uv_fault_limit, fd5121_word,
+			     PMBUS_REG_VAUX_UV_FAULT_LIMIT);
+static SENSOR_DEVICE_ATTR_RW(vaux_ov_fault_limit, fd5121_word,
+			     PMBUS_REG_VAUX_OV_FAULT_LIMIT);
+static SENSOR_DEVICE_ATTR_RW(vaux_uv_fault_response, fd5121_byte,
+			     PMBUS_REG_VAUX_UV_FAULT_RESPONSE);
+static SENSOR_DEVICE_ATTR_RW(vaux_ov_fault_response, fd5121_byte,
+			     PMBUS_REG_VAUX_OV_FAULT_RESPONSE);
+static SENSOR_DEVICE_ATTR_RW(vaux_uv_warning, fd5121_word,
+			     PMBUS_REG_VAUX_UV_WARNING);
+static SENSOR_DEVICE_ATTR_RW(vaux_ov_warning, fd5121_word,
+			     PMBUS_REG_VAUX_OV_WARNING);
+static SENSOR_DEVICE_ATTR_RO(free_user_config_tables, fd5121_byte,
+			     PMBUS_REG_MFR_FREE_USER_CONFIG_TABLES);
+static SENSOR_DEVICE_ATTR_RW(address_table, fd5121_byte,
+			     PMBUS_REG_MFR_ADDRESS_TABLE);
+static SENSOR_DEVICE_ATTR_RW(status_onsemi, fd5121_word,
+			     PMBUS_REG_MFR_STATUS_ONSEMI);
+static SENSOR_DEVICE_ATTR_RO(status_byte, fd5121_byte,
+			     PMBUS_STATUS_BYTE);
+static SENSOR_DEVICE_ATTR_RO(status_cml, fd5121_byte,
+			     PMBUS_STATUS_CML);
+static SENSOR_DEVICE_ATTR_RO(status_other, fd5121_byte,
+			     PMBUS_STATUS_OTHER);
+static SENSOR_DEVICE_ATTR_RO(status_mfr_specific, fd5121_byte,
+			     PMBUS_STATUS_MFR_SPECIFIC);
+static SENSOR_DEVICE_ATTR_RO(revision, fd5121_byte,
+			     PMBUS_REVISION);
+static SENSOR_DEVICE_ATTR_RO(id, fd5121_long,
+			     PMBUS_MFR_ID);
+static SENSOR_DEVICE_ATTR_RO(model, fd5121_long,
+			     PMBUS_MFR_MODEL);
+static SENSOR_DEVICE_ATTR_RO(mfr_revision, fd5121_long,
+			     PMBUS_MFR_REVISION);
+static SENSOR_DEVICE_ATTR_RW(date, fd5121_word,
+			     PMBUS_MFR_DATE);
+static SENSOR_DEVICE_ATTR_RO(ic_device_id, fd5121_long,
+			     PMBUS_IC_DEVICE_ID);
+static SENSOR_DEVICE_ATTR_RO(ic_device_rev, fd5121_word,
+			     PMBUS_IC_DEVICE_REV);
+static SENSOR_DEVICE_ATTR_WO(unlock, fd5121_byte,
+			     PMBUS_REG_MFR_UNLOCK);
+static SENSOR_DEVICE_ATTR_RO(faulty_sps, fd5121_block,
+			     PMBUS_REG_MFR_FAULTY_SPS);
+static SENSOR_DEVICE_ATTR_RO(tlvr_faults, fd5121_word,
+			     PMBUS_REG_TLVR_FAULTS);
+static SENSOR_DEVICE_ATTR_RW(user_store_config_tab, fd5121_byte,
+			     PMBUS_REG_MFR_USER_STORE_CONFIG_TAB);
+static SENSOR_DEVICE_ATTR_RO(user_config_index, fd5121_byte,
+			     PMBUS_REG_MFR_USER_CONFIG_INDEX);
+static SENSOR_DEVICE_ATTR_RO(pwm_disconnection, fd5121_word,
+			     PMBUS_REG_MFR_PWM_DISCONNECTION);
+static SENSOR_DEVICE_ATTR_RO(vr_disconnection, fd5121_byte,
+			     PMBUS_REG_MFR_VR_DISCONNECTION);
+static SENSOR_DEVICE_ATTR_RW(ton_slew, fd5121_byte,
+			     PMBUS_REG_MFR_TON_SLEW);
+static SENSOR_DEVICE_ATTR_RW(toff_slew, fd5121_byte,
+			     PMBUS_REG_MFR_TOFF_SLEW);
+static SENSOR_DEVICE_ATTR_RW(rail_name, fd5121_word,
+			     PMBUS_REG_MFR_RAIL_NAME);
+static SENSOR_DEVICE_ATTR_RW(vout_droop, fd5121_byte,
+			     PMBUS_REG_MFR_VOUT_DROOP);
+static SENSOR_DEVICE_ATTR_WO(svr_go, fd5121_send_byte,
+			     PMBUS_REG_MFR_SVR_GO);
+static SENSOR_DEVICE_ATTR_RW(user_restore_config_tab, fd5121_byte,
+			     PMBUS_REG_MFR_USER_RESTORE_CONFIG_TAB);
+static SENSOR_DEVICE_ATTR_WO(set_pwd, fd5121_byte,
+			     PMBUS_REG_MFR_SET_PWD);
+static SENSOR_DEVICE_ATTR_RW(config_activate, fd5121_byte,
+			     PMBUS_REG_MFR_CONFIG_ACTIVATE);
+static SENSOR_DEVICE_ATTR_RW(config_recover, fd5121_byte,
+			     PMBUS_REG_MFR_CONFIG_RECOVER);
+static SENSOR_DEVICE_ATTR_RW(otp_dump, fd5121_byte,
+			     PMBUS_REG_MFR_OTP_DUMP);
+static SENSOR_DEVICE_ATTR_RW(bbr_en, fd5121_byte,
+			     PMBUS_REG_MFR_BBR_EN);
+static SENSOR_DEVICE_ATTR_RW(dpm_min, fd5121_byte,
+			     PMBUS_REG_MFR_DPM_MIN);
+static SENSOR_DEVICE_ATTR_RW(vboot, fd5121_word,
+			     PMBUS_REG_MFR_VBOOT);
+static SENSOR_DEVICE_ATTR_RW(preclamp_offset, fd5121_word,
+			     PMBUS_REG_MFR_PRECLAMP_OFFSET);
+static SENSOR_DEVICE_ATTR_RW(tlvr_diagn, fd5121_word,
+			     PMBUS_REG_MFR_TLVR_DIAGN);
+static SENSOR_DEVICE_ATTR_RO(vsys, fd5121_word,
+			     PMBUS_REG_MFR_READ_VSYS);
+static SENSOR_DEVICE_ATTR_RW(specific_e9, fd5121_word,
+			     PMBUS_REG_MFR_SPECIFIC_E9);
+static SENSOR_DEVICE_ATTR_RW(specific_ea, fd5121_long,
+			     PMBUS_REG_MFR_SPECIFIC_EA);
+static SENSOR_DEVICE_ATTR_RO(ss_cbc, fd5121_word,
+			     PMBUS_REG_MFR_SS_CBC);
+static SENSOR_DEVICE_ATTR_RO(amd_status, fd5121_byte,
+			     PMBUS_REG_MFR_AMD_STATUS);
+static SENSOR_DEVICE_ATTR_RO(checksum, fd5121_word,
+			     PMBUS_REG_MFR_CHECKSUM);
+static SENSOR_DEVICE_ATTR_RO(cse_index, fd5121_word,
+			     PMBUS_REG_CSE_INDEX);
+static SENSOR_DEVICE_ATTR_RW(cout_measure, fd5121_word,
+			     PMBUS_REG_COUT_MEASURE);
+static SENSOR_DEVICE_ATTR_RO(vr_cout, fd5121_word,
+			     PMBUS_REG_VR_COUT);
+static SENSOR_DEVICE_ATTR_RO(bbr_ram, fd5121_block,
+			     PMBUS_REG_BBR_RAM);
+static SENSOR_DEVICE_ATTR_RO(bbr_otp, fd5121_block,
+			     PMBUS_REG_BBR_OTP);
+static SENSOR_DEVICE_ATTR_RO(psys, fd5121_word,
+			     PMBUS_REG_READ_PSYS);
+static SENSOR_DEVICE_ATTR_RW(postclamp_offset, fd5121_word,
+			     PMBUS_REG_POSTCLAMP_OFFSET);
+static SENSOR_DEVICE_ATTR_RW(pgood_delay, fd5121_byte,
+			     PMBUS_REG_PGOOD_DELAY);
+static SENSOR_DEVICE_ATTR_RW(specific_f8, fd5121_word,
+			     PMBUS_REG_MFR_SPECIFIC_F8);
+static SENSOR_DEVICE_ATTR_RW(specific_f9, fd5121_long,
+			     PMBUS_REG_MFR_SPECIFIC_F9);
+static SENSOR_DEVICE_ATTR_RW(pwd_program_ram, fd5121_word,
+			     PMBUS_REG_MFR_PWD_PROGRAM_RAM);
+static SENSOR_DEVICE_ATTR_RW(pwd_program_i2c, fd5121_word,
+			     PMBUS_REG_MFR_PWD_PROGRAM_I2C);
+static SENSOR_DEVICE_ATTR_RW(pwd_enable_otp_store, fd5121_word,
+			     PMBUS_REG_MFR_PWD_ENABLE_OTP_STORE);
+
+static struct attribute *fd5121_non_paged_attrs[] = {
+	&sensor_dev_attr_page.dev_attr.attr,
+	&sensor_dev_attr_capability.dev_attr.attr,
+	&sensor_dev_attr_pin_counter.dev_attr.attr,
+	&sensor_dev_attr_vaux_uv_fault_limit.dev_attr.attr,
+	&sensor_dev_attr_vaux_ov_fault_limit.dev_attr.attr,
+	&sensor_dev_attr_vaux_uv_warning.dev_attr.attr,
+	&sensor_dev_attr_vaux_ov_warning.dev_attr.attr,
+	&sensor_dev_attr_free_user_config_tables.dev_attr.attr,
+	&sensor_dev_attr_address_table.dev_attr.attr,
+	&sensor_dev_attr_unlock.dev_attr.attr,
+	&sensor_dev_attr_faulty_sps.dev_attr.attr,
+	&sensor_dev_attr_tlvr_faults.dev_attr.attr,
+	&sensor_dev_attr_user_store_config_tab.dev_attr.attr,
+	&sensor_dev_attr_user_config_index.dev_attr.attr,
+	&sensor_dev_attr_pwm_disconnection.dev_attr.attr,
+	&sensor_dev_attr_vr_disconnection.dev_attr.attr,
+	&sensor_dev_attr_user_restore_config_tab.dev_attr.attr,
+	&sensor_dev_attr_svr_go.dev_attr.attr,
+	&sensor_dev_attr_set_pwd.dev_attr.attr,
+	&sensor_dev_attr_config_activate.dev_attr.attr,
+	&sensor_dev_attr_config_recover.dev_attr.attr,
+	&sensor_dev_attr_otp_dump.dev_attr.attr,
+	&sensor_dev_attr_bbr_en.dev_attr.attr,
+	&sensor_dev_attr_vboot.dev_attr.attr,
+	&sensor_dev_attr_vsys.dev_attr.attr,
+	&sensor_dev_attr_specific_e9.dev_attr.attr,
+	&sensor_dev_attr_specific_ea.dev_attr.attr,
+	&sensor_dev_attr_ss_cbc.dev_attr.attr,
+	&sensor_dev_attr_checksum.dev_attr.attr,
+	&sensor_dev_attr_cse_index.dev_attr.attr,
+	&sensor_dev_attr_cout_measure.dev_attr.attr,
+	&sensor_dev_attr_vr_cout.dev_attr.attr,
+	&sensor_dev_attr_bbr_ram.dev_attr.attr,
+	&sensor_dev_attr_bbr_otp.dev_attr.attr,
+	&sensor_dev_attr_psys.dev_attr.attr,
+	&sensor_dev_attr_specific_f8.dev_attr.attr,
+	&sensor_dev_attr_specific_f9.dev_attr.attr,
+	&sensor_dev_attr_pwd_program_ram.dev_attr.attr,
+	&sensor_dev_attr_pwd_program_i2c.dev_attr.attr,
+	&sensor_dev_attr_pwd_enable_otp_store.dev_attr.attr,
+	&sensor_dev_attr_revision.dev_attr.attr,
+	&sensor_dev_attr_id.dev_attr.attr,
+	&sensor_dev_attr_model.dev_attr.attr,
+	&sensor_dev_attr_mfr_revision.dev_attr.attr,
+	&sensor_dev_attr_date.dev_attr.attr,
+	&sensor_dev_attr_ic_device_id.dev_attr.attr,
+	&sensor_dev_attr_ic_device_rev.dev_attr.attr,
+	NULL
+};
+
+static struct attribute *fd5121_paged_attrs[] = {
+	&sensor_dev_attr_operation.dev_attr.attr,
+	&sensor_dev_attr_vout_raw.dev_attr.attr,
+	&sensor_dev_attr_on_off_config.dev_attr.attr,
+	&sensor_dev_attr_clear_faults.dev_attr.attr,
+	&sensor_dev_attr_write_protect.dev_attr.attr,
+	&sensor_dev_attr_smbalert_mask.dev_attr.attr,
+	&sensor_dev_attr_vout_mode.dev_attr.attr,
+	&sensor_dev_attr_vout_command.dev_attr.attr,
+	&sensor_dev_attr_vout_margin_high.dev_attr.attr,
+	&sensor_dev_attr_vout_margin_low.dev_attr.attr,
+	&sensor_dev_attr_vout_min.dev_attr.attr,
+	&sensor_dev_attr_vin_on.dev_attr.attr,
+	&sensor_dev_attr_vin_off.dev_attr.attr,
+	&sensor_dev_attr_vout_ov_fault_response.dev_attr.attr,
+	&sensor_dev_attr_vout_uv_fault_response.dev_attr.attr,
+	&sensor_dev_attr_iout_oc_fault_response.dev_attr.attr,
+	&sensor_dev_attr_ot_fault_response.dev_attr.attr,
+	&sensor_dev_attr_vin_ov_fault_response.dev_attr.attr,
+	&sensor_dev_attr_status_byte.dev_attr.attr,
+	&sensor_dev_attr_iknee_set.dev_attr.attr,
+	&sensor_dev_attr_vmin_aware.dev_attr.attr,
+	&sensor_dev_attr_power_mode.dev_attr.attr,
+	&sensor_dev_attr_vin_uv_fault_response.dev_attr.attr,
+	&sensor_dev_attr_iin_oc_fault_response.dev_attr.attr,
+	&sensor_dev_attr_ton_delay.dev_attr.attr,
+	&sensor_dev_attr_pout_op_fault_response.dev_attr.attr,
+	&sensor_dev_attr_vaux_uv_fault_response.dev_attr.attr,
+	&sensor_dev_attr_vaux_ov_fault_response.dev_attr.attr,
+	&sensor_dev_attr_status_onsemi.dev_attr.attr,
+	&sensor_dev_attr_status_cml.dev_attr.attr,
+	&sensor_dev_attr_status_other.dev_attr.attr,
+	&sensor_dev_attr_status_mfr_specific.dev_attr.attr,
+	&sensor_dev_attr_read_vaux.dev_attr.attr,
+	&sensor_dev_attr_ton_slew.dev_attr.attr,
+	&sensor_dev_attr_toff_slew.dev_attr.attr,
+	&sensor_dev_attr_rail_name.dev_attr.attr,
+	&sensor_dev_attr_vout_droop.dev_attr.attr,
+	&sensor_dev_attr_dpm_min.dev_attr.attr,
+	&sensor_dev_attr_preclamp_offset.dev_attr.attr,
+	&sensor_dev_attr_tlvr_diagn.dev_attr.attr,
+	&sensor_dev_attr_amd_status.dev_attr.attr,
+	&sensor_dev_attr_postclamp_offset.dev_attr.attr,
+	&sensor_dev_attr_pgood_delay.dev_attr.attr,
+	&sensor_dev_attr_vout_max.dev_attr.attr,
+	&sensor_dev_attr_vout_transition_rate.dev_attr.attr,
+	NULL
+};
+
+static struct attribute_group fd5121_groups[2] = {
+	{ .name = "global", .attrs = fd5121_non_paged_attrs },
+	{ .name = "paged", .attrs = fd5121_paged_attrs }
+};
+
+/* Regulator descriptors for VOUT rails (VID encoded) */
+static struct regulator_desc fd5121_reg_desc[] = {
+	PMBUS_REGULATOR_STEP_ONE("vout1", 3001, 1000, 200000),
+	PMBUS_REGULATOR_STEP_ONE("vout2", 3001, 1000, 200000),
+};
+
+static int fd5121_valid_reg(struct i2c_client *client, int reg)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(cc_list); i++) {
+		if (reg == cc_list[i])
+			return 0;
+	}
+
+	if (fd5121_blk_rd_reg(reg))
+		return 0;
+	return -ENXIO;
+}
+
+static int fd5121_read_word_data(struct i2c_client *client, int page,
+				 int phase, int reg)
+{
+	int ret;
+
+	ret = fd5121_valid_reg(client, reg);
+	if (ret < 0)
+		return ret;
+
+	ret = pmbus_read_word_data(client, page, phase, reg);
+	if (ret < 0)
+		return ret;
+
+	/* Chip reports VOUT_MODE as vid. But gives raw value 1mV per bit.
+	 * So, encode the READ_VOUT value so that it gets decoded and
+	 * reported correctly.
+	 */
+	if (reg == PMBUS_READ_VOUT)
+		ret = DIV_ROUND_CLOSEST(155000 - ret * 100, 625);
+	return ret;
+}
+
+static int fd5121_read_byte_data(struct i2c_client *client, int page, int reg)
+{
+	int ret;
+
+	ret = fd5121_valid_reg(client, reg);
+	if (ret < 0)
+		return ret;
+
+	return pmbus_read_byte_data(client, page, reg);
+}
+
+static int fd5121_write_byte_data(struct i2c_client *client, int page,
+				  int reg, u8 value)
+{
+	int ret;
+
+	ret = fd5121_valid_reg(client, reg);
+	if (ret < 0)
+		return ret;
+	return pmbus_write_byte_data(client, page, reg, value);
+}
+
+static int fd5121_write_byte(struct i2c_client *client, int page, u8 byte)
+{
+	return pmbus_write_byte(client, page, byte);
+}
+
+static int fd5121_write_word_data(struct i2c_client *client, int page,
+				    int reg, u16 word)
+{
+	int ret;
+
+	ret = fd5121_valid_reg(client, reg);
+	if (ret < 0)
+		return ret;
+	ret = pmbus_write_word_data(client, page, reg, word);
+	return ret;
+}
+
+static u32 fd5121_get_dev_id(struct i2c_client *client)
+{
+	u32 dev_id = 0;
+	s32 ret = 0;
+
+	ret = fd5121_read_long(client, PMBUS_IC_DEVICE_ID, &dev_id);
+	if (ret < 0)
+		return CTLR_ID_UNKNOWN;
+
+	switch (dev_id) {
+	case CTLR_ID_FD5121:
+	case CTLR_ID_FD5123:
+	case CTLR_ID_FD5125:
+		break;
+	default:
+		if (dev_id != 0)
+			dev_err(&client->dev, "Unknown device 0x%x",
+				dev_id);
+		return CTLR_ID_UNKNOWN;
+	}
+	return dev_id;
+}
+
+static int fd5121_probe(struct i2c_client *client)
+{
+	struct pmbus_driver_info *info;
+	struct fd5121_data *pdata;
+	u32 id;
+
+	if (!i2c_check_functionality(client->adapter, I2C_FUNC_I2C))
+		return -EOPNOTSUPP;
+
+	pdata = devm_kzalloc(&client->dev, sizeof(struct fd5121_data),
+			     GFP_KERNEL);
+	if (pdata == NULL)
+		return -ENOMEM;
+
+	pdata->dev = &client->dev;
+	pdata->groups[0] = &fd5121_groups[0];
+	pdata->groups[1] = &fd5121_groups[1];
+
+	id = fd5121_get_dev_id(client);
+	if (id == CTLR_ID_UNKNOWN)
+		return -ENODEV;
+
+	pdata->id = id;
+
+	switch (id) {
+	case CTLR_ID_FD5121:
+	case CTLR_ID_FD5123:
+	case CTLR_ID_FD5125:
+		break;
+	default:
+		dev_err(&client->dev, "Failed to read device ID");
+		return -ENODEV;
+	}
+
+	info = &pdata->info;
+	info->groups = (const struct attribute_group **)&pdata->groups[0];
+	info->write_word_data = fd5121_write_word_data;
+	info->write_byte = fd5121_write_byte;
+	info->write_byte_data = fd5121_write_byte_data;
+	info->read_word_data = fd5121_read_word_data;
+	info->read_byte_data = fd5121_read_byte_data;
+
+	info->pages = FD5121_NUM_PAGES;
+	info->format[PSC_VOLTAGE_IN] = linear;
+	info->format[PSC_VOLTAGE_OUT] = vid;
+
+	fd5121_reg_desc[0].id = 0;
+	fd5121_reg_desc[1].id = 1;
+
+	/* Device implements VID coding with 1 mV steps from 0.200 V
+	 * up to 3.200 V
+	 */
+	info->num_regulators = FD5121_NUM_PAGES;
+	info->reg_desc = fd5121_reg_desc;
+	info->format[PSC_CURRENT_IN] = linear;
+	info->format[PSC_CURRENT_OUT] = linear;
+	info->format[PSC_POWER] = linear;
+	info->format[PSC_TEMPERATURE] = linear;
+	for (u8 idx = 0; idx < info->pages; idx++) {
+		info->func[idx] = PMBUS_HAVE_IOUT | PMBUS_HAVE_STATUS_IOUT;
+		info->func[idx] |= PMBUS_HAVE_VOUT | PMBUS_HAVE_STATUS_VOUT;
+		info->func[idx] |= PMBUS_HAVE_TEMP | PMBUS_HAVE_STATUS_TEMP;
+		info->func[idx] |= PMBUS_HAVE_PIN | PMBUS_HAVE_POUT;
+		info->func[idx] |= PMBUS_HAVE_VIN | PMBUS_HAVE_IIN;
+		info->func[idx] |= PMBUS_HAVE_STATUS_INPUT;
+		info->vrm_version[idx] = amd625mv;
+	}
+	return pmbus_do_probe(client, info);
+}
+
+#ifdef CONFIG_OF
+static const struct of_device_id fd5121_of_match[] = {
+	{ .compatible = "onnn,fd5121" },
+	{ }
+};
+MODULE_DEVICE_TABLE(of, fd5121_of_match);
+#endif
+
+static const struct i2c_device_id fd5121_id[] = {
+	{ "fd5121", chip_fd5121 },
+	{ "fd5123", chip_fd5123 },
+	{ "fd5125", chip_fd5125 },
+	{ }
+};
+MODULE_DEVICE_TABLE(i2c, fd5121_id);
+
+static struct i2c_driver fd5121_driver = {
+	.driver = {
+		.name = "fd5121",
+		.of_match_table = of_match_ptr(fd5121_of_match),
+	},
+	.probe = fd5121_probe,
+	.id_table = fd5121_id,
+};
+
+module_i2c_driver(fd5121_driver);
+
+MODULE_AUTHOR("Selva Rajagopal <selvamani.rajagopal@onsemi.com>");
+MODULE_DESCRIPTION("PMBus driver for FD5121");
+MODULE_LICENSE("GPL");
+MODULE_IMPORT_NS("PMBUS");
+

-- 
2.43.0



^ permalink raw reply related

* Re: [PATCH v8 09/46] KVM: guest_memfd: Introduce function to check GFN private/shared status
From: Binbin Wu @ 2026-06-23  5:25 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, brauner, chao.p.peng, david, jmattson,
	jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-9-9d2959357853@google.com>



On 6/19/2026 8:31 AM, Ackerley Tng via B4 Relay wrote:
> From: Ackerley Tng <ackerleytng@google.com>
> 
> Introduce function for KVM to check the private/shared status of guest
           ^
Nit:       a
 > memory at a given GFN.
> 
> This will be used in a later patch.

[...]

>  
> +bool kvm_gmem_is_private(struct kvm *kvm, gfn_t gfn)
> +{
> +	struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
> +	struct inode *inode;
> +
> +	/*
> +	 * If this gfn has no associated memslot, there's no chance of the gfn
> +	 * being backed by private memory, since guest_memfd must be used for
> +	 * private memory,

"guest_memfd must be used for private memory" is a bit confusing to me.


> and guest_memfd must be associated with some memslot.
> +	 */
> +	if (!slot)
> +		return 0;
> +
> +	CLASS(gmem_get_file, file)(slot);
> +	if (!file)
> +		return 0;
> +
> +	inode = file_inode(file);
> +
> +	/*
> +	 * Rely on the maple tree's internal RCU lock to ensure a
> +	 * stable result. This result can become stale as soon as the
> +	 * lock is dropped, so the caller _must_ still protect
> +	 * consumption of private vs. shared by checking
> +	 * mmu_invalidate_retry_gfn() under mmu_lock to serialize
> +	 * against ongoing attribute updates.
> +	 */
> +	return kvm_gmem_is_private_mem(inode, kvm_gmem_get_index(slot, gfn));
> +}
> +EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_is_private);
> +
>  static struct file_operations kvm_gmem_fops = {
>  	.mmap		= kvm_gmem_mmap,
>  	.open		= generic_file_open,
> 


^ permalink raw reply

* Re: [RFC PATCH 0/2] kasan: hw_tags: Add option to tag only at allocation time
From: Dev Jain @ 2026-06-23  5:02 UTC (permalink / raw)
  To: Catalin Marinas, Harry Yoo
  Cc: ryabinin.a.a, akpm, corbet, glider, andreyknvl, dvyukov,
	vincenzo.frascino, kasan-dev, linux-mm, linux-kernel, skhan,
	workflows, linux-doc, linux-arm-kernel, ryan.roberts,
	anshuman.khandual, kaleshsingh, 21cnbao, david, will
In-Reply-To: <ajltLd6FQg1aMge_@arm.com>



On 22/06/26 10:43 pm, Catalin Marinas wrote:
> Hi Harry,
> 
> On Mon, Jun 22, 2026 at 09:42:10PM +0900, Harry Yoo wrote:
>> On 6/19/26 10:19 PM, Catalin Marinas wrote:
>>> On Thu, Jun 18, 2026 at 10:35:15PM +0900, Harry Yoo wrote:
>>>> On 6/12/26 1:44 PM, Dev Jain wrote:
>>>>> Now, when a memory object will be freed, it will retain the random tag it
>>>>> had at allocation time. This compromises on catching UAF bugs, till the
>>>>> time the object is not reallocated, at which point it will have a new
>>>>> random tag.
>>>>>
>>>>> Hence, not catching "use-after-free-before-reallocation" and not catching
>>>>> "double-free" will be the compromise for reduced KASAN overhead.
>>>>
>>>> I doubt users who care about security enough to enable HW_TAGS KASAN
>>>> are willing to compromise on security just to save a few instructions
>>>> to store tags in the free path.
>>>>
>>>> To me, it looks like too much of a compromise on security for little
>>>> performance gain.
>>>
>>> I don't think there's much compromise on security for use-after-free.
>>
>> I think it depends... OH, WAIT! I see what you mean.
>>
>> You mean use-after-free before reallocation does not lead to much
>> compromise on security because objects are initialized after allocation?
>>
>> You're probably right.
>>
>> Hmm, but stores to e.g.) free pointer, fields initialized by
>> constructor or accessed by SLAB_TYPESAFE_BY_RCU semantics after free
>> will be undiscovered if they happen before reallocation.
> 
> Even with SLAB_TYPESAFE_BY_RCU, the object isn't tagged on free either
> (or realloc, only if the actual slab page ends up freed). But we don't
> get type confusion for such slab.
> 
> However, without tagging on free, one could argue that it reduces
> security for cases where the page is re-allocated as untagged - e.g. all
> user pages mapped without PROT_MTE. Currently we have a deterministic
> tag check fault if the page is coloured as KASAN_TAG_INVALID. I think

So you are saying that a stale kernel pointer can continue to use the
reallocated page, because for non-PROT_MTE case the page does not get
a new tag. Makes sense.
> for this patch, it might be better to only do such skip on free in
> kasan_poison_slab() rather than kasan_poison(). Freed pages would then
> be tagged.

I think you mean to say, "skip tag on free when freeing pages into buddy"?
So that would mean, kasan_poison() will do the poisoning also in the
case of value == KASAN_PAGE_FREE.

> 
> An alternative would be tagging on free only with a new tag and skipping
> it on re-alloc. But we'd need to track when it's a completely new
> allocation or a reused object (I haven't looked I'm pretty sure it's
> doable).

That was our original approach, and IIRC we had concluded there was no
security compromise. However it is difficult to implement - it has cases
like, what happens when two differently tagged pages are coalesced by
buddy and someone gets that large page as an allocation.

> 


^ permalink raw reply

* Re: [PATCH v8 07/46] KVM: Rename memory attribute APIs to prepare for in-place gmem conversion
From: Binbin Wu @ 2026-06-23  4:55 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, brauner, chao.p.peng, david, jmattson,
	jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-7-9d2959357853@google.com>

On 6/19/2026 8:31 AM, Ackerley Tng via B4 Relay wrote:

> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index d370e834d619e..eb26d4ea8945a 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2534,13 +2534,13 @@ static inline bool kvm_memslot_is_gmem_only(const struct kvm_memory_slot *slot)
>  }
>  
>  #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> -static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
> +static inline unsigned long kvm_get_vm_memory_attributes(struct kvm *kvm, gfn_t gfn)
>  {
>  	return xa_to_value(xa_load(&kvm->mem_attr_array, gfn));
>  }
>  
> -bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> -				     unsigned long mask, unsigned long attrs);
> +bool kvm_range_has_vm_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> +					unsigned long mask, unsigned long attrs);
>  bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
>  					struct kvm_gfn_range *range);
>  bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
> @@ -2548,7 +2548,14 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
>  
>  static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
>  {
> -	return kvm_get_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
> +	return kvm_get_vm_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
> +}
> +static inline bool kvm_mem_range_is_private(struct kvm *kvm, gfn_t start,
> +					    gfn_t end)
> +{
> +	return kvm_range_has_vm_memory_attributes(kvm, start, end,
> +						  KVM_MEMORY_ATTRIBUTE_PRIVATE,
> +						  KVM_MEMORY_ATTRIBUTE_PRIVATE);
>  }

This function is added, but never used in this patch series.
Is it intended to be called only when CONFIG_KVM_VM_MEMORY_ATTRIBUTES is
enabled?



>  #else
>  static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)

^ permalink raw reply

* Re: [PATCH v8 06/46] KVM: Enumerate support for PRIVATE memory iff kvm_arch_has_private_mem is defined
From: Binbin Wu @ 2026-06-23  3:10 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, brauner, chao.p.peng, david, jmattson,
	jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-6-9d2959357853@google.com>

On 6/19/2026 8:31 AM, Ackerley Tng via B4 Relay wrote:
> From: Ackerley Tng <ackerleytng@google.com>
> 
> Explicitly guard reporting support for KVM_MEMORY_ATTRIBUTE_PRIVATE based
> on kvm_arch_has_private_mem being #defined in anticipation of decoupling
> kvm_supported_mem_attributes() from CONFIG_KVM_VM_MEMORY_ATTRIBUTES.
> guest_memfd support for memory attributes will be unconditional to avoid
> yet more macros (all architectures that support guest_memfd are expected to
> use per-gmem attributes at some point), at which point enumerating support
> KVM_MEMORY_ATTRIBUTE_PRIVATE based solely on memory attributes being
> supported _somewhere_ would result in KVM over-reporting support on arm64.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Reviewed-by: Fuad Tabba <tabba@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>

Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>

> ---
>  virt/kvm/kvm_main.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 1ccc4895a4c26..7b989b659cf82 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2421,8 +2421,10 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
>  #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
>  static u64 kvm_supported_mem_attributes(struct kvm *kvm)
>  {
> +#ifdef kvm_arch_has_private_mem
>  	if (!kvm || kvm_arch_has_private_mem(kvm))
>  		return KVM_MEMORY_ATTRIBUTE_PRIVATE;
> +#endif
>  
>  	return 0;
>  }
> 


^ permalink raw reply

* Re: [PATCH v8 04/46] KVM: Decouple kvm_has_arch_private_mem from CONFIG_KVM_VM_MEMORY_ATTRIBUTES
From: Binbin Wu @ 2026-06-23  2:51 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, brauner, chao.p.peng, david, jmattson,
	jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-4-9d2959357853@google.com>

On 6/19/2026 8:31 AM, Ackerley Tng via B4 Relay wrote:
> From: Sean Christopherson <seanjc@google.com>
> 
> When memory attributes become trackable in guest_memfd, the concept of
> having private memory is no longer dependent on
> CONFIG_KVM_VM_MEMORY_ATTRIBUTES.
> 
> With this, on x86, kvm_arch_has_private_mem() is defined if some CoCo
> platform support (or the testing CONFIG_KVM_SW_PROTECTED_VM) is compiled
> in.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Co-developed-by: Ackerley Tng <ackerleytng@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>

Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>

One nit below.

> ---
>  arch/x86/include/asm/kvm_host.h | 4 +++-
>  include/linux/kvm_host.h        | 2 +-
>  2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 8e8eb8a5e8a6b..1bde67cf6eb0e 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2394,7 +2394,9 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
>  		       int tdp_max_root_level, int tdp_huge_page_level);
>  
>  
> -#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> +#if defined(CONFIG_KVM_SW_PROTECTED_VM) ||	\
> +	defined(CONFIG_KVM_INTEL_TDX) ||	\
> +	defined(CONFIG_KVM_AMD_SEV)

Nit:
Vertically align the defined(XXX) statements for better readability?


>  #define kvm_arch_has_private_mem(kvm) ((kvm)->arch.has_private_mem)
>  #endif
>  
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 201d0f2143976..d370e834d619e 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -722,7 +722,7 @@ static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
>  }
>  #endif
>  
> -#ifndef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> +#ifndef kvm_arch_has_private_mem
>  static inline bool kvm_arch_has_private_mem(struct kvm *kvm)
>  {
>  	return false;
> 


^ permalink raw reply

* Re: [PATCH v8 03/46] KVM: Move KVM_VM_MEMORY_ATTRIBUTES config definition to x86
From: Binbin Wu @ 2026-06-23  2:48 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, brauner, chao.p.peng, david, jmattson,
	jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-3-9d2959357853@google.com>

On 6/19/2026 8:31 AM, Ackerley Tng via B4 Relay wrote:
> From: Sean Christopherson <seanjc@google.com>
> 
> Bury KVM_VM_MEMORY_ATTRIBUTES in x86 to discourage other architectures
> from adding support for per-VM memory attributes, because tracking private
> vs. shared memory on a per-VM basis is now deprecated in favor of tracking
> on a per-guest_memfd basis, and while RWX memory attributes are on the
> horizon, they too are expected to be x86-only.
> 
> This will also allow modifying KVM_VM_MEMORY_ATTRIBUTES to be
> user-selectable (in x86) without creating weirdness in KVM's Kconfigs.
> Now that guest_memfd supports in-place conversions, it's entirely possible
> to run x86 CoCo VMs without support for KVM_VM_MEMORY_ATTRIBUTES.
> 
> Leave the code itself in common KVM so that it's trivial to undo this
> change if new per-VM attributes do come along.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Reviewed-by: Fuad Tabba <tabba@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>

Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>

> ---
>  arch/x86/kvm/Kconfig | 3 +++
>  virt/kvm/Kconfig     | 3 ---
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 26f6afd51bbdc..24f96396cfa1c 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -80,6 +80,9 @@ config KVM_WERROR
>  
>  	  If in doubt, say "N".
>  
> +config KVM_VM_MEMORY_ATTRIBUTES
> +	bool
> +
>  config KVM_SW_PROTECTED_VM
>  	bool "Enable support for KVM software-protected VMs"
>  	depends on EXPERT
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 5119cb37145fc..297e4399fbd49 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -100,9 +100,6 @@ config KVM_ELIDE_TLB_FLUSH_IF_YOUNG
>  config KVM_MMU_LOCKLESS_AGING
>         bool
>  
> -config KVM_VM_MEMORY_ATTRIBUTES
> -       bool
> -
>  config KVM_GUEST_MEMFD
>         select XARRAY_MULTI
>         bool
> 


^ permalink raw reply

* Re: [PATCH v8 02/46] KVM: Rename KVM_GENERIC_MEMORY_ATTRIBUTES to KVM_VM_MEMORY_ATTRIBUTES
From: Binbin Wu @ 2026-06-23  2:48 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, brauner, chao.p.peng, david, jmattson,
	jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-2-9d2959357853@google.com>

On 6/19/2026 8:31 AM, Ackerley Tng via B4 Relay wrote:
> From: Sean Christopherson <seanjc@google.com>
> 
> Rename the per-VM memory attributes Kconfig to make it explicitly about
> per-VM attributes in anticipation of adding memory attributes support to
> guest_memfd, at which point it will be possible (and desirable) to have
> memory attributes without the per-VM support, even in x86.
> 
> No functional change intended.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Reviewed-by: Fuad Tabba <tabba@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>

Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>


^ permalink raw reply

* Re: [PATCH v8 00/46] guest_memfd: In-place conversion support
From: Xiaoyao Li @ 2026-06-23  2:39 UTC (permalink / raw)
  To: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, jmattson, jthoughton, michael.roth, oupton, pankaj.gupta,
	qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
	tabba, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-0-9d2959357853@google.com>

On 6/19/2026 8:31 AM, Ackerley Tng via B4 Relay wrote:
> TODOs
> 
> + Retest with TDX selftests. v7 was tested with TDX [12], but the setup there was
>    wrong. Conversions were successful (no errors), but the shared memory being
>    tested is actually in a completely different host physical page.

Glad to see you knew it already (I was going to report this to the 
original POC TDX patch)

^ permalink raw reply

* Re: [PATCH v8 01/46] KVM: guest_memfd: Introduce per-gmem attributes, use to guard user mappings
From: Binbin Wu @ 2026-06-23  2:14 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: ackerleytng, aik, andrew.jones, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <ajnjTJdQKD1Kz3tf@google.com>



On 6/23/2026 9:37 AM, Sean Christopherson wrote:
> On Mon, Jun 22, 2026, Binbin Wu wrote:
>> On 6/19/2026 8:31 AM, Ackerley Tng via B4 Relay wrote:
>>
>> [...]
>>
>>>  
>>> +static u64 kvm_gmem_get_attributes(struct inode *inode, pgoff_t index)
>>> +{
>>> +	struct maple_tree *mt = &GMEM_I(inode)->attributes;
>>> +	void *entry = mtree_load(mt, index);
>>> +
>>> +	return WARN_ON_ONCE(!entry) ? 0 : xa_to_value(entry);
>>
>> If the entry is unexpectedly missing, returning 0 means the attribute would
>> be treated as shared.  And then in kvm_gmem_fault_user_mapping(), it would
>> allow the userspace to fault in the folio.
>>
>> Should gmem deny such edge case?
> 
> After several bugs this year where a WARN_ON_ONCE() fired, but was entirely
> insufficient to prevent true badness, I'm definitely senstive to making the "bad"
> behavior as harmless as possible.
> 
> However, in this case I think we're just hosed.  If KVM treats the memory as
> private, KVM will incorrectly do prepare(), incorrectly allow populate(), and
> will caused missed invalidations (though I suppose __kvm_gmem_set_attributes()
> "only" lies to userspace in that case).
> 
> That said, assuming SHARED is definitely odd for cases where guest_memfd *can't*
> hold shared memory.  Ditto for assuming PRIVATE.  

Indeed.

> What if we instead fall back to
> the "init" state, e.g.?
LGTM.

> 
> static u64 kvm_gmem_get_attributes(struct inode *inode, pgoff_t index)
> {
> 	struct maple_tree *mt = &GMEM_I(inode)->attributes;
> 	void *entry = mtree_load(mt, index);
> 
> 	if (WARN_ON_ONCE(!entry)) {
> 		bool shared = GMEM_I(inode)->flags & GUEST_MEMFD_FLAG_INIT_SHARED;
> 
> 		return shared ? 0 : KVM_MEMORY_ATTRIBUTE_PRIVATE;
> 	}
> 
> 	return xa_to_value(entry);
> }
> 


^ permalink raw reply

* [PATCH v7 10/10] tracing/probes: Add a new testcase for BTF typecasts
From: Masami Hiramatsu (Google) @ 2026-06-23  1:45 UTC (permalink / raw)
  To: Steven Rostedt, Mathieu Desnoyers
  Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178217904992.643090.15726197350652241270.stgit@devnote2>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

With the introduction of container_of-style BTF typecasting and
per-CPU variable access support in trace probes, we need a way to
verify their functionality and prevent regressions.

Add a new ftrace kselftest and update the trace event sample module
to test and validate these features.

Specifically, update the trace-events-sample module to set up a
periodic timer whose callback accesses a per-CPU counter. Introduce
a new sample trace event, foo_timer_fn, to trace this callback
and log the current counter value.

Then, add a new test case, btf_probe_event.tc, which defines a
dynamic probe on the timer callback. The probe uses BTF typecasting
to recover the parent structure from the timer argument and
this_cpu_read() to fetch the per-CPU counter. The test verifies
the integrity of the implementation by ensuring the values
recorded by the dynamic probe match those from the static tracepoint.

Assisted-by: Antigravity:gemini-3.5-flash
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v6:
  - Update testcase according to changes.
 Changes in v5:
  - Add more syntax test cases.
 Changes in v4:
  - Fix uprobe $current test.
 Changes in v3:
  - Add syntax test case.
  - Update testcase to use this_cpu_read()
 Changes in v2:
  - Use timer_shutdown_sync() instead of timer_delete_sync() for teardown.
---
 samples/trace_events/trace-events-sample.c         |   40 +++++++++++++++-
 samples/trace_events/trace-events-sample.h         |   34 ++++++++++++-
 .../ftrace/test.d/dynevent/btf_probe_event.tc      |   51 ++++++++++++++++++++
 .../ftrace/test.d/dynevent/fprobe_syntax_errors.tc |   11 ++++
 .../ftrace/test.d/kprobe/kprobe_syntax_errors.tc   |   11 ++++
 .../ftrace/test.d/kprobe/uprobe_syntax_errors.tc   |    5 ++
 6 files changed, 147 insertions(+), 5 deletions(-)
 create mode 100644 tools/testing/selftests/ftrace/test.d/dynevent/btf_probe_event.tc

diff --git a/samples/trace_events/trace-events-sample.c b/samples/trace_events/trace-events-sample.c
index 0b7a6efdb247..ca5d98c360cb 100644
--- a/samples/trace_events/trace-events-sample.c
+++ b/samples/trace_events/trace-events-sample.c
@@ -94,6 +94,20 @@ static int simple_thread_fn(void *arg)
 static DEFINE_MUTEX(thread_mutex);
 static int simple_thread_cnt;
 
+static struct foo_timer_data *foo_timer_data;
+
+static void sample_timer_cb(struct timer_list *t)
+{
+	struct foo_timer_data *data = container_of(t, struct foo_timer_data, timer);
+
+	get_cpu();
+	trace_foo_timer_fn(data);
+	(*this_cpu_ptr(data->counter))++;
+	put_cpu();
+
+	mod_timer(t, jiffies + HZ);
+}
+
 int foo_bar_reg(void)
 {
 	mutex_lock(&thread_mutex);
@@ -132,9 +146,27 @@ void foo_bar_unreg(void)
 
 static int __init trace_event_init(void)
 {
+	foo_timer_data = kzalloc_obj(*foo_timer_data, GFP_KERNEL);
+	if (!foo_timer_data)
+		return -ENOMEM;
+
+	foo_timer_data->name = "sample_timer_counter";
+	foo_timer_data->counter = alloc_percpu(int);
+	if (!foo_timer_data->counter) {
+		kfree(foo_timer_data);
+		return -ENOMEM;
+	}
+
+	timer_setup(&foo_timer_data->timer, sample_timer_cb, 0);
+	mod_timer(&foo_timer_data->timer, jiffies + HZ);
+
 	simple_tsk = kthread_run(simple_thread, NULL, "event-sample");
-	if (IS_ERR(simple_tsk))
-		return -1;
+	if (IS_ERR(simple_tsk)) {
+		timer_shutdown_sync(&foo_timer_data->timer);
+		free_percpu(foo_timer_data->counter);
+		kfree(foo_timer_data);
+		return PTR_ERR(simple_tsk);
+	}
 
 	return 0;
 }
@@ -147,6 +179,10 @@ static void __exit trace_event_exit(void)
 		kthread_stop(simple_tsk_fn);
 	simple_tsk_fn = NULL;
 	mutex_unlock(&thread_mutex);
+
+	timer_shutdown_sync(&foo_timer_data->timer);
+	free_percpu(foo_timer_data->counter);
+	kfree(foo_timer_data);
 }
 
 module_init(trace_event_init);
diff --git a/samples/trace_events/trace-events-sample.h b/samples/trace_events/trace-events-sample.h
index 1a05fc153353..816848a456a2 100644
--- a/samples/trace_events/trace-events-sample.h
+++ b/samples/trace_events/trace-events-sample.h
@@ -247,12 +247,14 @@
  */
 
 /*
- * It is OK to have helper functions in the file, but they need to be protected
- * from being defined more than once. Remember, this file gets included more
- * than once.
+ * It is OK to have helper functions and data structures in the file, but they
+ * need to be protected from being defined more than once. Remember, this file
+ * gets included more than once.
  */
 #ifndef __TRACE_EVENT_SAMPLE_HELPER_FUNCTIONS
 #define __TRACE_EVENT_SAMPLE_HELPER_FUNCTIONS
+#include <linux/timer.h>
+
 static inline int __length_of(const int *list)
 {
 	int i;
@@ -270,6 +272,13 @@ enum {
 	TRACE_SAMPLE_BAR = 4,
 	TRACE_SAMPLE_ZOO = 8,
 };
+
+struct foo_timer_data {
+	const char		*name;
+	struct timer_list	timer;
+	int __percpu		*counter;
+};
+
 #endif
 
 /*
@@ -595,6 +604,25 @@ TRACE_EVENT(foo_rel_loc,
 		  __get_rel_bitmask(bitmask),
 		  __get_rel_cpumask(cpumask))
 );
+
+TRACE_EVENT(foo_timer_fn,
+
+	TP_PROTO(struct foo_timer_data *data),
+
+	TP_ARGS(data),
+
+	TP_STRUCT__entry(
+		__string(	name,			data->name	)
+		__field(	int,			count		)
+	),
+
+	TP_fast_assign(
+		__assign_str(name);
+		__entry->count	= *this_cpu_ptr(data->counter);
+	),
+
+	TP_printk("name=%s count=%d", __get_str(name), __entry->count)
+);
 #endif
 
 /***** NOTICE! The #if protection ends here. *****/
diff --git a/tools/testing/selftests/ftrace/test.d/dynevent/btf_probe_event.tc b/tools/testing/selftests/ftrace/test.d/dynevent/btf_probe_event.tc
new file mode 100644
index 000000000000..96791e120b7d
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/dynevent/btf_probe_event.tc
@@ -0,0 +1,51 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: BTF event with typecast and percpu access
+# requires: dynamic_events "this_cpu_read(<fetcharg>)":README "[(structname[,field])]<argname>[->field[->field|.field...]]":README
+
+# Check if the sample module is loaded
+if ! lsmod | grep -q trace_events_sample; then
+  modprobe trace-events-sample || exit_unsupported
+fi
+
+echo 0 > events/enable
+echo > dynamic_events
+
+# The sample_timer_cb(struct timer_list *t) is called.
+# We want to check (STRUCT,FIELD)VAR typecast and this_cpu_read() access.
+# (foo_timer_data,timer)t converts t to struct foo_timer_data * using container_of.
+# data->counter is a per-cpu pointer to int.
+# this_cpu_read(data->counter) should give the value of the counter.
+
+echo 'f:mysample/myevent sample_timer_cb name=(foo_timer_data,timer)t->name:string count=this_cpu_read((foo_timer_data,timer)t->counter)' >> dynamic_events
+
+echo 1 > events/mysample/myevent/enable
+echo 1 > events/sample-trace/foo_timer_fn/enable
+
+sleep 2
+
+echo 0 > events/mysample/myevent/enable
+echo 0 > events/sample-trace/foo_timer_fn/enable
+
+# Compare the values.
+MATCH=0
+while read line; do
+  if echo $line | grep -q "foo_timer_fn:"; then
+    NAME=`echo $line | sed 's/.*name=\([^ ]*\) .*/\1/'`
+    COUNT=`echo $line | sed 's/.*count=\([^ ]*\).*/\1/'`
+    if grep -q "myevent:.*name=\"${NAME}\" count=$COUNT" trace; then
+       MATCH=$((MATCH+1))
+    fi
+  fi
+done < trace
+
+if [ $MATCH -eq 0 ]; then
+  echo "No matching events found"
+  exit_fail
+fi
+
+# Clean up
+echo 0 > events/mysample/myevent/enable
+echo 0 > events/sample-trace/foo_timer_fn/enable
+echo > dynamic_events
+clear_trace
diff --git a/tools/testing/selftests/ftrace/test.d/dynevent/fprobe_syntax_errors.tc b/tools/testing/selftests/ftrace/test.d/dynevent/fprobe_syntax_errors.tc
index fee479295e2f..e111d426a984 100644
--- a/tools/testing/selftests/ftrace/test.d/dynevent/fprobe_syntax_errors.tc
+++ b/tools/testing/selftests/ftrace/test.d/dynevent/fprobe_syntax_errors.tc
@@ -112,6 +112,17 @@ check_error 'f vfs_read%return $retval->^foo'	# NO_PTR_STRCT
 check_error 'f vfs_read file->^foo'		# NO_BTF_FIELD
 check_error 'f vfs_read file^-.foo'		# BAD_HYPHEN
 check_error 'f vfs_read ^file:string'		# BAD_TYPE4STR
+if grep -qF "[(structname" README ; then
+check_error 'f vfs_read arg1=(task_struct)file^'		# TYPECAST_REQ_FIELD
+check_error 'f vfs_read arg1=(a)((b)((c)(^(d)file->d)->c)->b)->a'	# TOO_MANY_NESTED
+check_error 'f vfs_read arg1=(task_struct,^in_execve)file->comm'	# TYPECAST_NOT_ALIGNED
+check_error 'f vfs_read arg1=(task_struct,^foo_bar)file->pid'	# NO_BTF_FIELD
+check_error 'f vfs_read arg1=(^task_struct1234)file->pid'	# NO_PTR_STRCT
+check_error 'f vfs_read arg1=(task_struct,se^->group_node)file->comm'	# TYPECAST_BAD_ARROW
+check_error 'f vfs_read arg1=(task_struct,^->pid)file->comm'	# NO_BTF_FIELD
+check_error 'f vfs_read arg1=(task_struct,^.pid)file->comm'	# NO_BTF_FIELD
+check_error 'f vfs_read arg1=(task_struct,^.)file->comm'	# NO_BTF_FIELD
+fi
 fi
 
 else
diff --git a/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc
index 8f1c58f0c239..626adeb2e840 100644
--- a/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc
+++ b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc
@@ -115,6 +115,17 @@ check_error 'p vfs_read+20 ^$arg*'		# NOFENTRY_ARGS
 check_error 'p vfs_read ^hoge'			# NO_BTFARG
 check_error 'p kfree ^$arg10'			# NO_BTFARG (exceed the number of parameters)
 check_error 'r kfree ^$retval'			# NO_RETVAL
+if grep -qF "[(structname" README ; then
+check_error 'p vfs_read arg1=(task_struct)file^'		# TYPECAST_REQ_FIELD
+check_error 'p vfs_read arg1=(a)((b)((c)(^(d)file->d)->c)->b)->a'	# TOO_MANY_NESTED
+check_error 'p vfs_read arg1=(task_struct,^in_execve)file->comm'	# TYPECAST_NOT_ALIGNED
+check_error 'p vfs_read arg1=(task_struct,^foo_bar)file->pid'	# NO_BTF_FIELD
+check_error 'p vfs_read arg1=(^task_struct1234)file->pid'		# NO_PTR_STRCT
+check_error 'p vfs_read arg1=(task_struct,se^->group_node)file->comm'	# TYPECAST_BAD_ARROW
+check_error 'p vfs_read arg1=(task_struct,^->pid)file->comm'	# NO_BTF_FIELD
+check_error 'p vfs_read arg1=(task_struct,^.pid)file->comm'	# NO_BTF_FIELD
+check_error 'p vfs_read arg1=(task_struct,^.)file->comm'	# NO_BTF_FIELD
+fi
 else
 check_error 'p vfs_read ^$arg*'			# NOSUP_BTFARG
 fi
diff --git a/tools/testing/selftests/ftrace/test.d/kprobe/uprobe_syntax_errors.tc b/tools/testing/selftests/ftrace/test.d/kprobe/uprobe_syntax_errors.tc
index c817158b99db..e12dc967ec76 100644
--- a/tools/testing/selftests/ftrace/test.d/kprobe/uprobe_syntax_errors.tc
+++ b/tools/testing/selftests/ftrace/test.d/kprobe/uprobe_syntax_errors.tc
@@ -28,4 +28,9 @@ if grep -q ".*symstr.*" README; then
 check_error 'p /bin/sh:10 $stack0:^symstr'	# BAD_TYPE
 fi
 
+# $current is not supported by uprobe
+if grep -q "\$current.*" README; then
+check_error 'p /bin/sh:10 ^$current:u8'	# BAD_VAR
+fi
+
 exit 0


^ permalink raw reply related

* [PATCH v7 09/10] tracing/probes: Add this_cpu_read() and this_cpu_ptr() dereference method to fetcharg
From: Masami Hiramatsu (Google) @ 2026-06-23  1:45 UTC (permalink / raw)
  To: Steven Rostedt, Mathieu Desnoyers
  Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178217904992.643090.15726197350652241270.stgit@devnote2>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

When tracing the kernel local variables, sometimes we need to get the
CPU local variables. To access it, current simple dereference is not
enough.

Thus, introduce a special this_cpu_read() dereference to access per-cpu
variable for the current CPU (accessing other CPU variable may race with
updates on other CPUs). Also this_cpu_ptr() is for accessing per-cpu
pointer.

Those are working as same as the kernel percpu macro.

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v6:
  - Rebased on dump fetcharg patch.
  - Fix to fetch static percpu variable with @SYM correctly.
 Changes in v5:
  - Simplify this_cpu_read() into +0(this_cpu_ptr()).
 Changes in v3:
  - Remove NULL check for percpu var because it is just an offset, could be 0.
  - Simplify process_fetch_insn_bottom() code.
  - If the last operation is this_cpu_read(), read only memory of the specific
    size (of type).
 Changes in v2:
  - Drop +CPU/+PCPU and introduce this_cpu_read() and this_cpu_ptr().
  - Support these method with BTF typecast.
  - Just check the base address is NOT NULL instead of is_kernel_percpu_address().
---
 Documentation/trace/eprobetrace.rst |    2 
 Documentation/trace/fprobetrace.rst |    2 
 Documentation/trace/kprobetrace.rst |    2 
 kernel/trace/trace.c                |    1 
 kernel/trace/trace_probe.c          |  143 ++++++++++++++++++++++++++---------
 kernel/trace/trace_probe.h          |    3 -
 kernel/trace/trace_probe_tmpl.h     |   22 ++++-
 7 files changed, 130 insertions(+), 45 deletions(-)

diff --git a/Documentation/trace/eprobetrace.rst b/Documentation/trace/eprobetrace.rst
index 680e0af43d5d..279396951b34 100644
--- a/Documentation/trace/eprobetrace.rst
+++ b/Documentation/trace/eprobetrace.rst
@@ -39,6 +39,8 @@ Synopsis of eprobe_events
   @SYM[+|-offs]	: Fetch memory at SYM +|- offs (SYM should be a data symbol)
   $comm		: Fetch current task comm.
   +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4)
+  this_cpu_read(FETCHARG) : Read the value of the per-CPU variable FETCHARG on the current CPU.
+  this_cpu_ptr(FETCHARG) : Get the address of the per-CPU variable FETCHARG on the current CPU.
   \IMM		: Store an immediate value to the argument.
   NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
   FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
diff --git a/Documentation/trace/fprobetrace.rst b/Documentation/trace/fprobetrace.rst
index 3392cab016b3..3439bc9bd351 100644
--- a/Documentation/trace/fprobetrace.rst
+++ b/Documentation/trace/fprobetrace.rst
@@ -52,6 +52,8 @@ Synopsis of fprobe-events
   $comm         : Fetch current task comm.
   $current      : Fetch the address of the current task_struct.
   +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*4)(\*5)
+  this_cpu_read(FETCHARG) : Read the value of the per-CPU variable FETCHARG on the current CPU.
+  this_cpu_ptr(FETCHARG) : Get the address of the per-CPU variable FETCHARG on the current CPU.
   \IMM          : Store an immediate value to the argument.
   NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
   FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
index 81e4fe38791d..9ae330eb0a52 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -55,6 +55,8 @@ Synopsis of kprobe_events
   $comm		: Fetch current task comm.
   $current      : Fetch the address of the current task_struct.
   +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4)
+  this_cpu_read(FETCHARG) : Read the value of the per-CPU variable FETCHARG on the current CPU.
+  this_cpu_ptr(FETCHARG) : Get the address of the per-CPU variable FETCHARG on the current CPU.
   \IMM		: Store an immediate value to the argument.
   NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
   FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 7a5676524f1a..d4121acc2938 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4332,6 +4332,7 @@ static const char readme_msg[] =
 	"\t           $stack<index>, $stack, $retval, $comm, $current\n"
 #endif
 	"\t           +|-[u]<offset>(<fetcharg>), \\imm-value, \\\"imm-string\"\n"
+	"\t           this_cpu_read(<fetcharg>), this_cpu_ptr(<fetcharg>)\n"
 	"\t     kernel return probes support: $retval, $arg<N>, $comm\n"
 	"\t     type: s8/16/32/64, u8/16/32/64, x8/16/32/64, char, string, symbol,\n"
 	"\t           b<bit-width>@<bit-offset>/<container-size>, ustring,\n"
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index ddcbc8c12ad9..85aed7327530 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -349,6 +349,100 @@ static int parse_trace_event(char *arg, struct fetch_insn *code,
 	return -EINVAL;
 }
 
+/* this_cpu_* parser */
+#define THIS_CPU_PTR_PREFIX "this_cpu_ptr("
+#define THIS_CPU_READ_PREFIX "this_cpu_read("
+#define THIS_CPU_PTR_LEN (sizeof(THIS_CPU_PTR_PREFIX) - 1)
+#define THIS_CPU_READ_LEN (sizeof(THIS_CPU_READ_PREFIX) - 1)
+
+static int
+parse_probe_arg(char *arg, const struct fetch_type *type,
+		struct fetch_insn **pcode, struct fetch_insn *end,
+		struct traceprobe_parse_context *ctx);
+
+/* handle dereference nested call */
+static inline int handle_dereference(char *arg, struct fetch_insn **pcode,
+	struct fetch_insn *end, struct traceprobe_parse_context *ctx,
+	int deref, long offset)
+{
+	const struct fetch_type *type = find_fetch_type(NULL, ctx->flags);
+	struct fetch_insn *code = *pcode;
+	int cur_offs = ctx->offset;
+	char *tmp;
+	int ret;
+
+	tmp = strrchr(arg, ')');
+	if (!tmp) {
+		trace_probe_log_err(ctx->offset + strlen(arg),
+					DEREF_OPEN_BRACE);
+		return -EINVAL;
+	}
+
+	*tmp = '\0';
+	ret = parse_probe_arg(arg, type, &code, end, ctx);
+	if (ret)
+		return ret;
+	ctx->offset = cur_offs;
+	if (code->op == FETCH_OP_COMM || code->op == FETCH_OP_IMMSTR) {
+		trace_probe_log_err(ctx->offset, COMM_CANT_DEREF);
+		return -EINVAL;
+	}
+
+	/*
+	 * this_cpu_ptr(@SYM) does not use SYM value, but use SYM address.
+	 * So we overwrite the last FETCH_OP_DEREF with FETCH_OP_CPU_PTR.
+	 */
+	if (!(deref == FETCH_OP_CPU_PTR && *arg == '@')) {
+		code++;
+		if (code == end) {
+			trace_probe_log_err(ctx->offset, TOO_MANY_OPS);
+			return -EINVAL;
+		}
+	}
+	*pcode = code;
+
+	code->op = deref;
+	code->offset = offset;
+	/* Reset the last type if used */
+	ctx->last_type = NULL;
+	return 0;
+}
+
+static int parse_this_cpu(char *arg, struct fetch_insn **pcode,
+			  struct fetch_insn *end,
+			  struct traceprobe_parse_context *ctx)
+{
+	struct fetch_insn *code;
+	bool is_ptr = false;
+	int ret;
+
+	if (str_has_prefix(arg, THIS_CPU_PTR_PREFIX)) {
+		arg += THIS_CPU_PTR_LEN;
+		ctx->offset += THIS_CPU_PTR_LEN;
+		is_ptr = true;
+	} else if (str_has_prefix(arg, THIS_CPU_READ_PREFIX)) {
+		arg += THIS_CPU_READ_LEN;
+		ctx->offset += THIS_CPU_READ_LEN;
+	} else
+		return -EINVAL;
+
+	ret = handle_dereference(arg, pcode, end, ctx, FETCH_OP_CPU_PTR, 0);
+	if (ret || is_ptr)
+		return ret;
+
+	/* this_cpu_read(VAR) -> +0(this_cpu_ptr(VAR)) */
+	code = *pcode;
+	code++;
+	if (code == end) {
+		trace_probe_log_err(ctx->offset, TOO_MANY_OPS);
+		return -EINVAL;
+	}
+	code->op = FETCH_OP_DEREF;
+	code->offset = 0;
+	*pcode = code;
+	return 0;
+}
+
 #ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
 
 static u32 btf_type_int(const struct btf_type *t)
@@ -913,11 +1007,6 @@ static char *find_matched_close_paren(char *s)
 	return NULL;
 }
 
-static int
-parse_probe_arg(char *arg, const struct fetch_type *type,
-		struct fetch_insn **pcode, struct fetch_insn *end,
-		struct traceprobe_parse_context *ctx);
-
 static int handle_typecast(char *arg, struct fetch_insn **pcode,
 			   struct fetch_insn *end,
 			   struct traceprobe_parse_context *ctx)
@@ -970,7 +1059,9 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
 		/* Skip '(' */
 		ctx->offset += 1;
 		tmp++;
-	} else if (*tmp == '+' || *tmp == '-') {
+	} else if (*tmp == '+' || *tmp == '-' ||
+		   str_has_prefix(tmp, THIS_CPU_PTR_PREFIX) ||
+		   str_has_prefix(tmp, THIS_CPU_READ_PREFIX)) {
 		/* Dereference can have another field access inside it. */
 		char *open = strchr(tmp + 1, '(');
 
@@ -1483,36 +1574,9 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
 		}
 		ctx->offset += (tmp + 1 - arg) + (arg[0] != '-' ? 1 : 0);
 		arg = tmp + 1;
-		tmp = strrchr(arg, ')');
-		if (!tmp) {
-			trace_probe_log_err(ctx->offset + strlen(arg),
-					    DEREF_OPEN_BRACE);
-			return -EINVAL;
-		} else {
-			const struct fetch_type *t2 = find_fetch_type(NULL, ctx->flags);
-			int cur_offs = ctx->offset;
-
-			*tmp = '\0';
-			ret = parse_probe_arg(arg, t2, &code, end, ctx);
-			if (ret)
-				break;
-			ctx->offset = cur_offs;
-			if (code->op == FETCH_OP_COMM ||
-			    code->op == FETCH_OP_IMMSTR) {
-				trace_probe_log_err(ctx->offset, COMM_CANT_DEREF);
-				return -EINVAL;
-			}
-			if (++code == end) {
-				trace_probe_log_err(ctx->offset, TOO_MANY_OPS);
-				return -EINVAL;
-			}
-			*pcode = code;
-
-			code->op = deref;
-			code->offset = offset;
-			/* Reset the last type if used */
-			ctx->last_type = NULL;
-		}
+		ret = handle_dereference(arg, pcode, end, ctx, deref, offset);
+		if (ret < 0)
+			return ret;
 		break;
 	case '\\':	/* Immediate value */
 		if (arg[1] == '"') {	/* Immediate string */
@@ -1533,15 +1597,18 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
 		ret = handle_typecast(arg, pcode, end, ctx);
 		break;
 	default:
-		if (isalpha(arg[0]) || arg[0] == '_') {	/* BTF variable */
+		if (str_has_prefix(arg, THIS_CPU_PTR_PREFIX) ||
+		    str_has_prefix(arg, THIS_CPU_READ_PREFIX)) {
+			ret = parse_this_cpu(arg, pcode, end, ctx);
+		} else if (isalpha(arg[0]) || arg[0] == '_') {	/* BTF variable */
 			if (!tparg_is_function_entry(ctx->flags) &&
 			    !tparg_is_function_return(ctx->flags)) {
 				trace_probe_log_err(ctx->offset, NOSUP_BTFARG);
 				return -EINVAL;
 			}
 			ret = parse_btf_arg(arg, pcode, end, ctx);
-			break;
 		}
+		break;
 	}
 	if (!ret && code->op == FETCH_OP_NOP) {
 		/* Parsed, but do not find fetch method */
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 37f065b82a32..e6c925817640 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -101,6 +101,7 @@ typedef int (*print_type_func_t)(struct trace_seq *, void *, void *);
 	/* Stage 2 (dereference) ops */					\
 	FETCH_OP(DEREF, offset),	/* Dereference: .offset */	\
 	FETCH_OP(UDEREF, offset),	/* User-space dereference: .offset */\
+	FETCH_OP(CPU_PTR, none),	/* Per-CPU pointer: .offset */	\
 	/* Stage 3 (store) ops */					\
 	FETCH_OP(ST_RAW, store),	/* Raw value: .size */		\
 	FETCH_OP(ST_MEM, store),	/* Memory: .offset, .size */	\
@@ -595,7 +596,7 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
 	C(TYPECAST_NOT_EVENT,	"Typecasts are only for eprobe fields"), \
 	C(TYPECAST_REQ_FIELD,	"Typecast requires a field access"),	\
 	C(TOO_MANY_NESTED,	"Too many nested typecasts/dereferences"), \
-	C(TYPECAST_SYM_OFFSET,	"@SYM+/-OFFSET with typecast needs parentheses") \
+	C(TYPECAST_SYM_OFFSET,	"@SYM+/-OFFSET with typecast needs parentheses"), \
 	C(TYPECAST_NOT_ALIGNED,	"Typecast field option is not byte-aligned"), \
 	C(TYPECAST_BAD_ARROW,	"Typecast field option does not support -> operator"),
 
diff --git a/kernel/trace/trace_probe_tmpl.h b/kernel/trace/trace_probe_tmpl.h
index d0e9662cde00..8db12f758fda 100644
--- a/kernel/trace/trace_probe_tmpl.h
+++ b/kernel/trace/trace_probe_tmpl.h
@@ -129,25 +129,35 @@ process_fetch_insn_bottom(struct fetch_insn *code, unsigned long val,
 	struct fetch_insn *s3 = NULL;
 	int total = 0, ret = 0, i = 0;
 	u32 loc = 0;
-	unsigned long lval = val;
+	unsigned long lval, llval = val;
 
 stage2:
 	/* 2nd stage: dereference memory if needed */
 	do {
-		if (code->op == FETCH_OP_DEREF) {
-			lval = val;
+		lval = val;
+		switch (code->op) {
+		case FETCH_OP_DEREF:
 			ret = probe_mem_read(&val, (void *)val + code->offset,
 					     sizeof(val));
-		} else if (code->op == FETCH_OP_UDEREF) {
-			lval = val;
+			break;
+		case FETCH_OP_UDEREF:
 			ret = probe_mem_read_user(&val,
 				 (void *)val + code->offset, sizeof(val));
-		} else
 			break;
+		case FETCH_OP_CPU_PTR:
+			val = (unsigned long)this_cpu_ptr((void __percpu *)val);
+			ret = 0;
+			break;
+		default:
+			lval = llval;
+			goto out;
+		}
 		if (ret)
 			return ret;
+		llval = lval;
 		code++;
 	} while (1);
+out:
 
 	s3 = code;
 stage3:


^ permalink raw reply related

* [PATCH v7 08/10] tracing/probes: Add $current variable support
From: Masami Hiramatsu (Google) @ 2026-06-23  1:45 UTC (permalink / raw)
  To: Steven Rostedt, Mathieu Desnoyers
  Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178217904992.643090.15726197350652241270.stgit@devnote2>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Since we can use the BTF to cast value to a structure pointer type,
it is useful to introduce "$current" special variable support to
fetcharg.

User can define a fetcharg to access current task_struct properties
using BTF info. e.g.

  $current->cpus_ptr

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v7:
  - Fix to use force-typecast for task_struct implicitly.
 Changes in v6:
  - Rebased on dump fetcharg patch.
  - Remove function name/eprobe requirement for $current.
 Changes in v5:
  - Use s32 for bof_find_btf_id().
 Changes in v4:
  - Add $current in README when CONFIG_HAVE_FUNCTION_ARG_ACCESS_API=y case.
  - Fix to prohibit using $current in eprobes and address based kprobes.
 Changes in v3:
  - Remove $current support from eprobes (because eprobes is only for event)
  - Prohibit uprobes to use $current.
 Changes in v2:
   - Support to parse $current in parse_btf_arg().
   - If no typecast on $current, it automatically casted to task_struct.
   - Check error case if $current follows something except for "-".
---
 Documentation/trace/fprobetrace.rst |    1 +
 Documentation/trace/kprobetrace.rst |    1 +
 kernel/trace/trace.c                |    4 ++--
 kernel/trace/trace_probe.c          |   38 ++++++++++++++++++++++++++++++++++-
 kernel/trace/trace_probe.h          |    1 +
 kernel/trace/trace_probe_tmpl.h     |    3 +++
 6 files changed, 45 insertions(+), 3 deletions(-)

diff --git a/Documentation/trace/fprobetrace.rst b/Documentation/trace/fprobetrace.rst
index 290a9e6f7491..3392cab016b3 100644
--- a/Documentation/trace/fprobetrace.rst
+++ b/Documentation/trace/fprobetrace.rst
@@ -50,6 +50,7 @@ Synopsis of fprobe-events
   $argN         : Fetch the Nth function argument. (N >= 1) (\*2)
   $retval       : Fetch return value.(\*3)
   $comm         : Fetch current task comm.
+  $current      : Fetch the address of the current task_struct.
   +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*4)(\*5)
   \IMM          : Store an immediate value to the argument.
   NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
index a62707e6a9f2..81e4fe38791d 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -53,6 +53,7 @@ Synopsis of kprobe_events
   $argN		: Fetch the Nth function argument. (N >= 1) (\*1)
   $retval	: Fetch return value.(\*2)
   $comm		: Fetch current task comm.
+  $current      : Fetch the address of the current task_struct.
   +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4)
   \IMM		: Store an immediate value to the argument.
   NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 0e36af853199..7a5676524f1a 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4323,13 +4323,13 @@ static const char readme_msg[] =
 	"\t     args: <name>=fetcharg[:type]\n"
 	"\t fetcharg: (%<register>|$<efield>), @<address>, @<symbol>[+|-<offset>],\n"
 #ifdef CONFIG_HAVE_FUNCTION_ARG_ACCESS_API
-	"\t           $stack<index>, $stack, $retval, $comm, $arg<N>,\n"
+	"\t           $stack<index>, $stack, $retval, $comm, $arg<N>, $current\n"
 #ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
 	"\t           [(structname[,field])]<argname>[->field[->field|.field...]],\n"
 	"\t           [(structname[,field])](fetcharg)->field[->field|.field...],\n"
 #endif
 #else
-	"\t           $stack<index>, $stack, $retval, $comm,\n"
+	"\t           $stack<index>, $stack, $retval, $comm, $current\n"
 #endif
 	"\t           +|-[u]<offset>(<fetcharg>), \\imm-value, \\\"imm-string\"\n"
 	"\t     kernel return probes support: $retval, $arg<N>, $comm\n"
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 755c0f94a54b..ddcbc8c12ad9 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -700,7 +700,9 @@ static int parse_btf_arg(char *varname,
 	int i, is_ptr, ret;
 	u32 tid;
 
-	if (!ctx->funcname && !(ctx->flags & TPARG_FL_TEVENT))
+	/* Note: field is not separated at this point, so check prefix. */
+	if (!str_has_prefix(varname, "$current") &&
+	    !ctx->funcname && !(ctx->flags & TPARG_FL_TEVENT))
 		return -EINVAL;
 
 	is_ptr = split_next_field(varname, &field, ctx);
@@ -713,6 +715,22 @@ static int parse_btf_arg(char *varname,
 		return -EOPNOTSUPP;
 	}
 
+	if (!strcmp(varname, "$current")) {
+		s32 stid;
+
+		code->op = FETCH_OP_CURRENT;
+		/* If no typecast is specified for $current, use task_struct by default */
+		stid = bpf_find_btf_id("task_struct", BTF_KIND_STRUCT, &ctx->struct_btf);
+		if (stid < 0) {
+			trace_probe_log_err(ctx->offset, NO_BTF_ENTRY);
+			return -ENOENT;
+		}
+		/* btf_type_skip_modifier() requires u32 for type id. */
+		tid = stid;
+		ctx->last_struct = btf_type_skip_modifiers(ctx->struct_btf, tid, &tid);
+		goto found;
+	}
+
 	if (ctx->flags & TPARG_FL_RETURN && !strcmp(varname, "$retval")) {
 		code->op = FETCH_OP_RETVAL;
 		/* Check whether the function return type is not void, even with typecast. */
@@ -1271,6 +1289,24 @@ static int parse_probe_vars(char *orig_arg, const struct fetch_type *t,
 		return 0;
 	}
 
+	/* $current returns the address of the current task_struct. */
+	if (str_has_prefix(arg, "current")) {
+		/* $current is only supported by kernel probe. */
+		if (!(ctx->flags & TPARG_FL_KERNEL)) {
+			err = TP_ERR_BAD_VAR;
+			goto inval;
+		}
+		arg += strlen("current");
+		if (*arg == '-' && IS_ENABLED(CONFIG_PROBE_EVENTS_BTF_ARGS))
+			return parse_btf_arg(orig_arg, pcode, end, ctx);
+
+		if (*arg != '\0')
+			goto inval;
+
+		code->op = FETCH_OP_CURRENT;
+		return 0;
+	}
+
 #ifdef CONFIG_HAVE_FUNCTION_ARG_ACCESS_API
 	len = str_has_prefix(arg, "arg");
 	if (len) {
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 213d2aea4f06..37f065b82a32 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -92,6 +92,7 @@ typedef int (*print_type_func_t)(struct trace_seq *, void *, void *);
 	FETCH_OP(RETVAL, none),		/* Return value */		\
 	FETCH_OP(IMM, imm),		/* Immediate: .immediate */	\
 	FETCH_OP(COMM, none),		/* Current comm */		\
+	FETCH_OP(CURRENT, none),	/* Current task_struct address */\
 	FETCH_OP(ARG, param),		/* Argument: .param = index */	\
 	FETCH_OP(FOFFS, imm),		/* File offset: .immediate */	\
 	FETCH_OP(IMMSTR, string),	/* Allocated string: .data */	\
diff --git a/kernel/trace/trace_probe_tmpl.h b/kernel/trace/trace_probe_tmpl.h
index 51436f19083b..d0e9662cde00 100644
--- a/kernel/trace/trace_probe_tmpl.h
+++ b/kernel/trace/trace_probe_tmpl.h
@@ -112,6 +112,9 @@ process_common_fetch_insn(struct fetch_insn *code, unsigned long *val)
 	case FETCH_OP_IMMSTR:
 		*val = (unsigned long)code->data;
 		break;
+	case FETCH_OP_CURRENT:
+		*val = (unsigned long)current;
+		break;
 	default:
 		return -EILSEQ;
 	}


^ permalink raw reply related

* [PATCH v7 07/10] tracing/probes: Support field specifier option for typecast
From: Masami Hiramatsu (Google) @ 2026-06-23  1:45 UTC (permalink / raw)
  To: Steven Rostedt, Mathieu Desnoyers
  Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178217904992.643090.15726197350652241270.stgit@devnote2>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Add a field specifier option for the typecast. This works like
container_of() macro.

    (STRUCT[,FIELD[.FIELD2...]])VAR

This is equivalent to :

    container_of(VAR, struct STRUCT, FIELD[.FIELD2...])

For example:

 echo "f tick_nohz_handler next_tick=(tick_sched,sched_timer)timer->next_tick" >> dynamic_events

This will trace tick_nohz_handler() with its tick_sched::next_tick which
is converted from @timer by contianer_of(tick, struct tick_sched, sched_timer).
So, if you enabkle both fprobes:tick_nohz_handler__entry and
timer:hrtimer_expire_entry events, we will see something like:


          <idle>-0       [002] d.h1.  3778.087272: hrtimer_expire_entry: hrtimer=00000000d63db328 f
unction=tick_nohz_handler now=3777450051040
          <idle>-0       [002] d.h1.  3778.087281: tick_nohz_handler__entry: (tick_nohz_handler+0x4
/0x140) next_tick=3777450000000


Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v6:
  - Update according to the allways nested patch.
 Changes in v3:
  - Fix error caret position.
 Changes in v2:
  - Use byteoffset for typecast field offset instead of bitoffset. This fixes negative modulo calculation.
  - Check whether a field is specified after typecast.
  - Reject if typecast field option  has arrow operator.
---
 Documentation/trace/eprobetrace.rst |    5 +
 Documentation/trace/fprobetrace.rst |    8 +-
 Documentation/trace/kprobetrace.rst |    8 +-
 kernel/trace/trace.c                |    4 -
 kernel/trace/trace_probe.c          |  170 ++++++++++++++++++++++++-----------
 kernel/trace/trace_probe.h          |    5 +
 6 files changed, 136 insertions(+), 64 deletions(-)

diff --git a/Documentation/trace/eprobetrace.rst b/Documentation/trace/eprobetrace.rst
index cd0b4aa7f896..680e0af43d5d 100644
--- a/Documentation/trace/eprobetrace.rst
+++ b/Documentation/trace/eprobetrace.rst
@@ -49,7 +49,10 @@ Synopsis of eprobe_events
   (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
                   a pointer to STRUCT and then derference the pointer defined by
                   ->MEMBER. Note that when this is used, the FIELD name does not
-                  need to be prefixed with a '$'.
+                  need to be prefixed with a '$'. ASGN can be specified optionally.
+		  If ASGN is specified, FIELD will be cast to the same offset
+		  position as the ASGN member, rather than to the beginning of
+		  the STRUCT.
   (STRUCT)(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
 		  also be used with another FETCHARG instead of FIELD.
 
diff --git a/Documentation/trace/fprobetrace.rst b/Documentation/trace/fprobetrace.rst
index 6b8bb27bb62d..290a9e6f7491 100644
--- a/Documentation/trace/fprobetrace.rst
+++ b/Documentation/trace/fprobetrace.rst
@@ -57,10 +57,12 @@ Synopsis of fprobe-events
                   (u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types
                   (x8/x16/x32/x64), "char", "string", "ustring", "symbol", "symstr"
                   and bitfield are supported.
-  (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
+  (STRUCT[,ASGN])FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
                   a pointer to STRUCT and then derference the pointer defined by
-                  ->MEMBER.
-  (STRUCT)(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
+                  ->MEMBER. ASGN can be specified optionally. If ASGN is specified,
+		  FIELD will be cast to the same offset position as the ASGN member,
+		  rather than to the beginning of the STRUCT.
+  (STRUCT[,ASGN])(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
                  also be used with another FETCHARG instead of FIELD.
 
   (\*1) This is available only when BTF is enabled.
diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
index c4382765d5b2..a62707e6a9f2 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -61,11 +61,13 @@ Synopsis of kprobe_events
 		  (x8/x16/x32/x64), VFS layer common type(%pd/%pD), "char",
                   "string", "ustring", "symbol", "symstr" and bitfield are
                   supported.
-  (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
+  (STRUCT[,ASGN])FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
                   a pointer to STRUCT and then derference the pointer defined by
                   ->MEMBER. Note that this is available only when the probe is
-		   on function entry.
-  (STRUCT)(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
+		   on function entry. ASGN can be specified optionally. If ASGN
+		   is specified, FIELD will be cast to the same offset position
+		   as the ASGN member, rather than to the beginning of the STRUCT.
+  (STRUCT[,ASGN])(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
                  also be used with another FETCHARG instead of FIELD.
 
   (\*1) only for the probe on function entry (offs == 0). Note, this argument access
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 4f70318918c2..0e36af853199 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4325,8 +4325,8 @@ static const char readme_msg[] =
 #ifdef CONFIG_HAVE_FUNCTION_ARG_ACCESS_API
 	"\t           $stack<index>, $stack, $retval, $comm, $arg<N>,\n"
 #ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
-	"\t           [(structname)]<argname>[->field[->field|.field...]],\n"
-	"\t           [(structname)](fetcharg)->field[->field|.field...],\n"
+	"\t           [(structname[,field])]<argname>[->field[->field|.field...]],\n"
+	"\t           [(structname[,field])](fetcharg)->field[->field|.field...],\n"
 #endif
 #else
 	"\t           $stack<index>, $stack, $retval, $comm,\n"
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 2a50a9188c0c..755c0f94a54b 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -574,6 +574,65 @@ static int split_next_field(char *varname, char **next_field,
 	return ret;
 }
 
+/* Inner loop for solving dot operator ('.'). Return bit-offset of the given field */
+static int get_bitoffset_of_field(char **pfieldname, const struct btf_type **ptype,
+				  struct traceprobe_parse_context *ctx)
+{
+	const struct btf_type *type = *ptype;
+	const struct btf_member *field;
+	struct btf *btf = ctx_btf(ctx);
+	char *fieldname = *pfieldname;
+	int bitoffs = 0;
+	u32 anon_offs;
+	char *next;
+	int is_ptr;
+	s32 tid;
+
+	do {
+		next = NULL;
+		is_ptr = split_next_field(fieldname, &next, ctx);
+		if (is_ptr < 0)
+			return is_ptr;
+
+		anon_offs = 0;
+		field = btf_find_struct_member(btf, type, fieldname,
+						&anon_offs);
+		if (IS_ERR(field)) {
+			trace_probe_log_err(ctx->offset, BAD_BTF_TID);
+			return PTR_ERR(field);
+		}
+		if (!field) {
+			trace_probe_log_err(ctx->offset, NO_BTF_FIELD);
+			return -ENOENT;
+		}
+		/* Add anonymous structure/union offset */
+		bitoffs += anon_offs;
+
+		/* Accumulate the bit-offsets of the dot-connected fields */
+		if (btf_type_kflag(type)) {
+			bitoffs += BTF_MEMBER_BIT_OFFSET(field->offset);
+			ctx->last_bitsize = BTF_MEMBER_BITFIELD_SIZE(field->offset);
+		} else {
+			bitoffs += field->offset;
+			ctx->last_bitsize = 0;
+		}
+
+		type = btf_type_skip_modifiers(btf, field->type, &tid);
+		if (!type) {
+			trace_probe_log_err(ctx->offset, BAD_BTF_TID);
+			return -EINVAL;
+		}
+
+		if (next)
+			ctx->offset += next - fieldname;
+		fieldname = next;
+	} while (!is_ptr && fieldname);
+
+	*pfieldname = fieldname;
+	*ptype = type;
+
+	return bitoffs;
+}
 /*
  * Parse the field of data structure. The @type must be a pointer type
  * pointing the target data structure type.
@@ -583,16 +642,14 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
 			   struct traceprobe_parse_context *ctx)
 {
 	struct fetch_insn *code = *pcode;
-	const struct btf_member *field;
-	u32 bitoffs, anon_offs;
-	bool is_struct = ctx->struct_btf != NULL;
 	struct btf *btf = ctx_btf(ctx);
-	char *next;
-	int is_ptr;
+	bool is_first_field = true;
+	int bitoffs;
 	s32 tid;
 
 	do {
-		if (!is_struct) {
+		/* For the first field of typecast, @type will be the target structure type. */
+		if (!(is_first_field && ctx->struct_btf)) {
 			/* Outer loop for solving arrow operator ('->') */
 			if (BTF_INFO_KIND(type->info) != BTF_KIND_PTR) {
 				trace_probe_log_err(ctx->offset, NO_PTR_STRCT);
@@ -606,60 +663,25 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
 				return -EINVAL;
 			}
 		}
-		/* Only the first type can skip being a pointer */
-		is_struct = false;
-
-		bitoffs = 0;
-		do {
-			/* Inner loop for solving dot operator ('.') */
-			next = NULL;
-			is_ptr = split_next_field(fieldname, &next, ctx);
-			if (is_ptr < 0)
-				return is_ptr;
-
-			anon_offs = 0;
-			field = btf_find_struct_member(btf, type, fieldname,
-						       &anon_offs);
-			if (IS_ERR(field)) {
-				trace_probe_log_err(ctx->offset, BAD_BTF_TID);
-				return PTR_ERR(field);
-			}
-			if (!field) {
-				trace_probe_log_err(ctx->offset, NO_BTF_FIELD);
-				return -ENOENT;
-			}
-			/* Add anonymous structure/union offset */
-			bitoffs += anon_offs;
-
-			/* Accumulate the bit-offsets of the dot-connected fields */
-			if (btf_type_kflag(type)) {
-				bitoffs += BTF_MEMBER_BIT_OFFSET(field->offset);
-				ctx->last_bitsize = BTF_MEMBER_BITFIELD_SIZE(field->offset);
-			} else {
-				bitoffs += field->offset;
-				ctx->last_bitsize = 0;
-			}
-
-			type = btf_type_skip_modifiers(btf, field->type, &tid);
-			if (!type) {
-				trace_probe_log_err(ctx->offset, BAD_BTF_TID);
-				return -EINVAL;
-			}
-
-			ctx->offset += next - fieldname;
-			fieldname = next;
-		} while (!is_ptr && fieldname);
 
+		bitoffs = get_bitoffset_of_field(&fieldname, &type, ctx);
+		if (bitoffs < 0)
+			return bitoffs;
 		if (++code == end) {
 			trace_probe_log_err(ctx->offset, TOO_MANY_OPS);
 			return -EINVAL;
 		}
 		code->op = FETCH_OP_DEREF;	/* TODO: user deref support */
 		code->offset = bitoffs / 8;
+		if (is_first_field && ctx->struct_btf) {
+			/* The first field can be typecasted with field option. */
+			code->offset -= ctx->prefix_byteoffs;
+		}
 		*pcode = code;
 
 		ctx->last_bitoffs = bitoffs % 8;
 		ctx->last_type = type;
+		is_first_field = false;
 	} while (fieldname);
 
 	return 0;
@@ -815,6 +837,46 @@ static int query_btf_struct(const char *sname, struct traceprobe_parse_context *
 	return 0;
 }
 
+static int parse_btf_casttype(char *casttype, struct traceprobe_parse_context *ctx)
+{
+	char *field;
+	int ret;
+
+	/* Field option - evaluated later. */
+	field = strchr(casttype, ',');
+	if (field)
+		*field++ = '\0';
+
+	ret = query_btf_struct(casttype, ctx);
+	if (ret < 0) {
+		trace_probe_log_err(ctx->offset, NO_PTR_STRCT);
+		return -EINVAL;
+	}
+
+	if (field) {
+		struct btf_type *type = (struct btf_type *)ctx->last_struct;
+
+		ctx->offset += field - casttype;
+		ret = get_bitoffset_of_field(&field, &ctx->last_struct, ctx);
+		if (ret < 0)
+			return ret;
+		if (ret % 8) {
+			trace_probe_log_err(ctx->offset, TYPECAST_NOT_ALIGNED);
+			return -EINVAL;
+		}
+		if (field != NULL) {
+			/* this means @field skips an arrow operator ("->"). */
+			trace_probe_log_err(ctx->offset - 2, TYPECAST_BAD_ARROW);
+			return -EINVAL;
+		}
+		ctx->prefix_byteoffs = ret / 8;
+		/* Restore the original struct type (overwritten by get_bitoffset_of_field) */
+		ctx->last_struct = type;
+	}
+
+	return ret;
+}
+
 /* Find the matching closing parenthesis for a given opening parenthesis. */
 static char *find_matched_close_paren(char *s)
 {
@@ -946,14 +1008,14 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
 		tmp = close + 2; /* Skip ">" after inner variable name */
 
 	/* resolve the typecast struct name */
-	ret = query_btf_struct(arg + 1, ctx);
-	if (ret < 0) {
-		trace_probe_log_err(orig_offset + 1, NO_PTR_STRCT);
-		return -EINVAL;
-	}
+	ctx->offset = orig_offset + 1; /* for the '(' */
+	ret = parse_btf_casttype(arg + 1, ctx);
+	if (ret < 0)
+		return ret;
 
 	ctx->offset = orig_offset + tmp - arg;
 	ret = parse_btf_field(tmp, ctx->last_struct, pcode, end, ctx);
+	ctx->prefix_byteoffs = 0;
 	return ret;
 }
 
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index bc487b366da6..213d2aea4f06 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -451,6 +451,7 @@ struct traceprobe_parse_context {
 	unsigned int flags;
 	int offset;
 	int nested_level;
+	int prefix_byteoffs;	/* The byte offset of the prefix field of typecast */
 };
 
 /* Each typecast consumes nested level. So the max number of typecast is 3. */
@@ -593,7 +594,9 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
 	C(TYPECAST_NOT_EVENT,	"Typecasts are only for eprobe fields"), \
 	C(TYPECAST_REQ_FIELD,	"Typecast requires a field access"),	\
 	C(TOO_MANY_NESTED,	"Too many nested typecasts/dereferences"), \
-	C(TYPECAST_SYM_OFFSET,	"@SYM+/-OFFSET with typecast needs parentheses")
+	C(TYPECAST_SYM_OFFSET,	"@SYM+/-OFFSET with typecast needs parentheses") \
+	C(TYPECAST_NOT_ALIGNED,	"Typecast field option is not byte-aligned"), \
+	C(TYPECAST_BAD_ARROW,	"Typecast field option does not support -> operator"),
 
 #undef C
 #define C(a, b)		TP_ERR_##a


^ permalink raw reply related

* [PATCH v7 06/10] tracing/probes: Type casting always involves nested calls
From: Masami Hiramatsu (Google) @ 2026-06-23  1:45 UTC (permalink / raw)
  To: Steven Rostedt, Mathieu Desnoyers
  Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178217904992.643090.15726197350652241270.stgit@devnote2>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

This allows type casting to various fetchargs without parentheses
by recursively calling parse_probe_arg on the target when type
casting is used.

For example, this allows the following expressions:
 - (STRUCT)%REG->FIELD
 - (STRUCT)$stackN->FIELD
 - (STRUCT)@SYM->FIELD

Note that @SYM+/-OFFSET with typecast needs parentheses like:
  - (STRUCT)(@SYM-8)->FIELD

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v7:
  - Prohibit using @SYM+/-OFFSET without parentheses.
  - Cleanup parse_btf_arg() since ctx->struct_btf is always NULL now.
 Changes in v6:
  - Newly added.
---
 kernel/trace/trace_probe.c |  122 ++++++++++++++++++++++++++------------------
 kernel/trace/trace_probe.h |    4 +
 2 files changed, 74 insertions(+), 52 deletions(-)

diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index e8eae3aab652..2a50a9188c0c 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -691,19 +691,6 @@ static int parse_btf_arg(char *varname,
 		return -EOPNOTSUPP;
 	}
 
-	if (ctx->flags & TPARG_FL_TEVENT) {
-		ret = parse_trace_event(varname, code, ctx);
-		if (ret < 0) {
-			trace_probe_log_err(ctx->offset, BAD_ATTACH_ARG);
-			return ret;
-		}
-		/* TEVENT is only here via a typecast */
-		if (WARN_ON_ONCE(ctx->struct_btf == NULL))
-			return -EINVAL;
-		type = ctx->last_struct;
-		goto found_type;
-	}
-
 	if (ctx->flags & TPARG_FL_RETURN && !strcmp(varname, "$retval")) {
 		code->op = FETCH_OP_RETVAL;
 		/* Check whether the function return type is not void, even with typecast. */
@@ -715,13 +702,6 @@ static int parse_btf_arg(char *varname,
 			tid = ctx->proto->type;
 			goto found;
 		}
-		/*
-		 * Even if we can not find appropriate BTF info, we can still access
-		 * the field via typecast.
-		 */
-		if (ctx->struct_btf)
-			goto found;
-
 		if (field) {
 			trace_probe_log_err(ctx->offset + field - varname,
 					    NO_BTF_ENTRY);
@@ -766,11 +746,7 @@ static int parse_btf_arg(char *varname,
 	return -ENOENT;
 
 found:
-	if (ctx->struct_btf)
-		type = ctx->last_struct;
-	else
-		type = btf_type_skip_modifiers(ctx->btf, tid, &tid);
-found_type:
+	type = btf_type_skip_modifiers(ctx->btf, tid, &tid);
 	if (!type) {
 		trace_probe_log_err(ctx->offset, BAD_BTF_TID);
 		return -EINVAL;
@@ -867,7 +843,7 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
 			   struct traceprobe_parse_context *ctx)
 {
 	int orig_offset = ctx->offset;
-	bool nested = false;
+	char *close;
 	char *tmp;
 	int ret;
 
@@ -878,6 +854,17 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
 		return -EOPNOTSUPP;
 	}
 
+	/*
+	 * Always consider the token after typecast as a nested call
+	 * For example: (STRUCT)VAR->FIELD and (STRUCT)(VAR)->FIELD are same.
+	 * VAR is solved in the nested call.
+	 */
+	ctx->nested_level++;
+	if (ctx->nested_level > TRACEPROBE_MAX_NESTED_LEVEL) {
+		trace_probe_log_err(ctx->offset, TOO_MANY_NESTED);
+		return -E2BIG;
+	}
+
 	tmp = strchr(arg, ')');
 	if (!tmp) {
 		trace_probe_log_err(ctx->offset + strlen(arg),
@@ -886,11 +873,10 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
 	}
 	*tmp++ = '\0';
 
-	/* Handle the nested structure like (STRUCT)(VAR->FIELD)->... */
+	ctx->offset += tmp - arg;
 	if (*tmp == '(') {
-		char *close = find_matched_close_paren(tmp);
+		close = find_matched_close_paren(tmp);
 
-		ctx->offset += tmp - arg;
 		if (!close) {
 			trace_probe_log_err(ctx->offset, DEREF_OPEN_BRACE);
 			return -EINVAL;
@@ -901,27 +887,65 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
 					    TYPECAST_REQ_FIELD);
 			return -EINVAL;
 		}
-
-		ctx->nested_level++;
-		if (ctx->nested_level > TRACEPROBE_MAX_NESTED_LEVEL) {
-			trace_probe_log_err(ctx->offset, TOO_MANY_NESTED);
-			return -E2BIG;
+		/* Skip '(' */
+		ctx->offset += 1;
+		tmp++;
+	} else if (*tmp == '+' || *tmp == '-') {
+		/* Dereference can have another field access inside it. */
+		char *open = strchr(tmp + 1, '(');
+
+		if (!open) {
+			trace_probe_log_err(ctx->offset,
+					    DEREF_NEED_BRACE);
+			return -EINVAL;
 		}
-		*close = '\0';
+		close = find_matched_close_paren(open);
+		if (!close) {
+			trace_probe_log_err(ctx->offset + strlen(tmp),
+					    DEREF_OPEN_BRACE);
+			return -EINVAL;
+		}
+		close++;
+		/* We expect a field access for typecast */
+		if (close[0] != '-' || close[1] != '>') {
+			trace_probe_log_err(ctx->offset + close - tmp + 1,
+					    TYPECAST_REQ_FIELD);
+			return -EINVAL;
+		}
+	} else {
+		if (tmp[0] == '@') {
+			close = strpbrk(tmp, "+-");
+			if (close && isdigit(close[1])) {
+				trace_probe_log_err(ctx->offset,
+						    TYPECAST_SYM_OFFSET);
+				return -EINVAL;
+			}
+		}
+		/* Inner variable name */
+		close = strchr(tmp, '-');
+		if (!close || close[1] != '>') {
+			trace_probe_log_err(ctx->offset + strlen(tmp),
+					    TYPECAST_REQ_FIELD);
+			return -EINVAL;
+		}
+	}
+	*close = '\0';
 
-		ctx->offset += 1;	/* for the '(' */
-		/* We need to parse the nested one */
-		ret = parse_probe_arg(tmp + 1, find_fetch_type(NULL, ctx->flags),
-				pcode, end, ctx);
-		if (ret < 0)
-			return ret;
-		ctx->nested_level--;
-		clear_struct_btf(ctx);
+	/* We need to parse the nested one */
+	ret = parse_probe_arg(tmp, find_fetch_type(NULL, ctx->flags),
+			      pcode, end, ctx);
+	if (ret < 0)
+		return ret;
+	ctx->nested_level--;
+	clear_struct_btf(ctx);
 
-		tmp = close + 3;/* Skip "->" after closing parenthesis */
-		nested = true;
-	}
+	/* Let tmp point the field name. */
+	if (close[1] == '-')
+		tmp = close + 3; /* Skip "->" after closing parenthesis */
+	else
+		tmp = close + 2; /* Skip ">" after inner variable name */
 
+	/* resolve the typecast struct name */
 	ret = query_btf_struct(arg + 1, ctx);
 	if (ret < 0) {
 		trace_probe_log_err(orig_offset + 1, NO_PTR_STRCT);
@@ -929,11 +953,7 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
 	}
 
 	ctx->offset = orig_offset + tmp - arg;
-	/* If it is nested, tmp points to the field name. */
-	if (nested)
-		ret = parse_btf_field(tmp, ctx->last_struct, pcode, end, ctx);
-	else
-		ret = parse_btf_arg(tmp, pcode, end, ctx);
+	ret = parse_btf_field(tmp, ctx->last_struct, pcode, end, ctx);
 	return ret;
 }
 
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 122d31b1cb14..bc487b366da6 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -453,6 +453,7 @@ struct traceprobe_parse_context {
 	int nested_level;
 };
 
+/* Each typecast consumes nested level. So the max number of typecast is 3. */
 #define TRACEPROBE_MAX_NESTED_LEVEL 3
 
 extern int traceprobe_parse_probe_arg(struct trace_probe *tp, int i,
@@ -591,7 +592,8 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
 	C(EVENT_TOO_BIG,	"Event too big (too many fields?)"),  \
 	C(TYPECAST_NOT_EVENT,	"Typecasts are only for eprobe fields"), \
 	C(TYPECAST_REQ_FIELD,	"Typecast requires a field access"),	\
-	C(TOO_MANY_NESTED,	"Too many nested typecasts/dereferences"),
+	C(TOO_MANY_NESTED,	"Too many nested typecasts/dereferences"), \
+	C(TYPECAST_SYM_OFFSET,	"@SYM+/-OFFSET with typecast needs parentheses")
 
 #undef C
 #define C(a, b)		TP_ERR_##a


^ permalink raw reply related

* [PATCH v7 05/10] tracing/probes: Support nested typecast
From: Masami Hiramatsu (Google) @ 2026-06-23  1:44 UTC (permalink / raw)
  To: Steven Rostedt, Mathieu Desnoyers
  Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178217904992.643090.15726197350652241270.stgit@devnote2>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

When we hit an open parenthesis right after typecast closing
parenthesis, it means we have nested typecast. This allows us to
typecast a generic data member in a structure to a pointer to
another structure.

For example, to cast a DATA_MEMBER of VAR structure to STRUCT pointer
and get MEMBER value.

  (STRUCT)(VAR->DATA_MEMBER)->MEMBER

Also, we can nest typecast.

  (STRUCT1)((STRUCT2)$ARG->FIELD2)->FIELD1

Currently the max nest level is limited to 3.

This also allows user to use typecasting for registers or stacks on
kprobe events. e.g.

  (STRUCT)(%ax)->MEMBER

  (STRUCT)($stack0)->MEMBER


Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v6:
  - Add a WARN_ON_ONCE check for leaking nested_level (it must not happen.)
 Changes in v4:
  - Use orig_offset for reporting NO_PTR_STRCT error.
 Changes in v2:
  - Fix to skip "->" after closing parenthetsis.
---
 Documentation/trace/eprobetrace.rst |    2 +
 Documentation/trace/fprobetrace.rst |    2 +
 Documentation/trace/kprobetrace.rst |    2 +
 kernel/trace/trace.c                |    1 
 kernel/trace/trace_probe.c          |   81 ++++++++++++++++++++++++++++++++---
 kernel/trace/trace_probe.h          |    7 +++
 6 files changed, 86 insertions(+), 9 deletions(-)

diff --git a/Documentation/trace/eprobetrace.rst b/Documentation/trace/eprobetrace.rst
index fe3602540569..cd0b4aa7f896 100644
--- a/Documentation/trace/eprobetrace.rst
+++ b/Documentation/trace/eprobetrace.rst
@@ -50,6 +50,8 @@ Synopsis of eprobe_events
                   a pointer to STRUCT and then derference the pointer defined by
                   ->MEMBER. Note that when this is used, the FIELD name does not
                   need to be prefixed with a '$'.
+  (STRUCT)(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
+		  also be used with another FETCHARG instead of FIELD.
 
 Types
 -----
diff --git a/Documentation/trace/fprobetrace.rst b/Documentation/trace/fprobetrace.rst
index 7435ded2d66d..6b8bb27bb62d 100644
--- a/Documentation/trace/fprobetrace.rst
+++ b/Documentation/trace/fprobetrace.rst
@@ -60,6 +60,8 @@ Synopsis of fprobe-events
   (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
                   a pointer to STRUCT and then derference the pointer defined by
                   ->MEMBER.
+  (STRUCT)(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
+                 also be used with another FETCHARG instead of FIELD.
 
   (\*1) This is available only when BTF is enabled.
   (\*2) only for the probe on function entry (offs == 0). Note, this argument access
diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
index f73614997d52..c4382765d5b2 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -65,6 +65,8 @@ Synopsis of kprobe_events
                   a pointer to STRUCT and then derference the pointer defined by
                   ->MEMBER. Note that this is available only when the probe is
 		   on function entry.
+  (STRUCT)(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
+                 also be used with another FETCHARG instead of FIELD.
 
   (\*1) only for the probe on function entry (offs == 0). Note, this argument access
         is best effort, because depending on the argument type, it may be passed on
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index aa93e7b01146..4f70318918c2 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4326,6 +4326,7 @@ static const char readme_msg[] =
 	"\t           $stack<index>, $stack, $retval, $comm, $arg<N>,\n"
 #ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
 	"\t           [(structname)]<argname>[->field[->field|.field...]],\n"
+	"\t           [(structname)](fetcharg)->field[->field|.field...],\n"
 #endif
 #else
 	"\t           $stack<index>, $stack, $retval, $comm,\n"
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index f318f5f88b7e..e8eae3aab652 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -839,10 +839,35 @@ static int query_btf_struct(const char *sname, struct traceprobe_parse_context *
 	return 0;
 }
 
+/* Find the matching closing parenthesis for a given opening parenthesis. */
+static char *find_matched_close_paren(char *s)
+{
+	char *p = s;
+	int count = 0;
+
+	while (*p) {
+		if (*p == '(')
+			count++;
+		else if (*p == ')') {
+			if (--count == 0)
+				return p;
+		}
+		p++;
+	}
+	return NULL;
+}
+
+static int
+parse_probe_arg(char *arg, const struct fetch_type *type,
+		struct fetch_insn **pcode, struct fetch_insn *end,
+		struct traceprobe_parse_context *ctx);
+
 static int handle_typecast(char *arg, struct fetch_insn **pcode,
 			   struct fetch_insn *end,
 			   struct traceprobe_parse_context *ctx)
 {
+	int orig_offset = ctx->offset;
+	bool nested = false;
 	char *tmp;
 	int ret;
 
@@ -859,19 +884,56 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
 				    DEREF_OPEN_BRACE);
 		return -EINVAL;
 	}
-	*tmp = '\0';
-	ret = query_btf_struct(arg + 1, ctx);
-	*tmp = ')';
+	*tmp++ = '\0';
+
+	/* Handle the nested structure like (STRUCT)(VAR->FIELD)->... */
+	if (*tmp == '(') {
+		char *close = find_matched_close_paren(tmp);
 
+		ctx->offset += tmp - arg;
+		if (!close) {
+			trace_probe_log_err(ctx->offset, DEREF_OPEN_BRACE);
+			return -EINVAL;
+		}
+		/* We expect a field access for typecast */
+		if (close[1] != '-' || close[2] != '>') {
+			trace_probe_log_err(ctx->offset + close - tmp + 1,
+					    TYPECAST_REQ_FIELD);
+			return -EINVAL;
+		}
+
+		ctx->nested_level++;
+		if (ctx->nested_level > TRACEPROBE_MAX_NESTED_LEVEL) {
+			trace_probe_log_err(ctx->offset, TOO_MANY_NESTED);
+			return -E2BIG;
+		}
+		*close = '\0';
+
+		ctx->offset += 1;	/* for the '(' */
+		/* We need to parse the nested one */
+		ret = parse_probe_arg(tmp + 1, find_fetch_type(NULL, ctx->flags),
+				pcode, end, ctx);
+		if (ret < 0)
+			return ret;
+		ctx->nested_level--;
+		clear_struct_btf(ctx);
+
+		tmp = close + 3;/* Skip "->" after closing parenthesis */
+		nested = true;
+	}
+
+	ret = query_btf_struct(arg + 1, ctx);
 	if (ret < 0) {
-		trace_probe_log_err(ctx->offset + 1, NO_PTR_STRCT);
+		trace_probe_log_err(orig_offset + 1, NO_PTR_STRCT);
 		return -EINVAL;
 	}
 
-	tmp++;
-
-	ctx->offset += tmp - arg;
-	ret = parse_btf_arg(tmp, pcode, end, ctx);
+	ctx->offset = orig_offset + tmp - arg;
+	/* If it is nested, tmp points to the field name. */
+	if (nested)
+		ret = parse_btf_field(tmp, ctx->last_struct, pcode, end, ctx);
+	else
+		ret = parse_btf_arg(tmp, pcode, end, ctx);
 	return ret;
 }
 
@@ -1629,6 +1691,9 @@ static int traceprobe_parse_probe_arg_body(const char *argv, ssize_t *size,
 			      ctx);
 	if (ret < 0)
 		goto fail;
+	/* nested_level must be 0 here, otherwise there is a bug. */
+	if (WARN_ON_ONCE(ctx->nested_level))
+		goto fail;
 
 	/* Update storing type if BTF is available */
 	if (IS_ENABLED(CONFIG_PROBE_EVENTS_BTF_ARGS) &&
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 93f75bc384c7..122d31b1cb14 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -450,8 +450,11 @@ struct traceprobe_parse_context {
 	struct trace_probe *tp;
 	unsigned int flags;
 	int offset;
+	int nested_level;
 };
 
+#define TRACEPROBE_MAX_NESTED_LEVEL 3
+
 extern int traceprobe_parse_probe_arg(struct trace_probe *tp, int i,
 				      const char *argv,
 				      struct traceprobe_parse_context *ctx);
@@ -586,7 +589,9 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
 	C(TOO_MANY_ARGS,	"Too many arguments are specified"),	\
 	C(TOO_MANY_EARGS,	"Too many entry arguments specified"),	\
 	C(EVENT_TOO_BIG,	"Event too big (too many fields?)"),  \
-	C(TYPECAST_NOT_EVENT,	"Typecasts are only for eprobe fields"),
+	C(TYPECAST_NOT_EVENT,	"Typecasts are only for eprobe fields"), \
+	C(TYPECAST_REQ_FIELD,	"Typecast requires a field access"),	\
+	C(TOO_MANY_NESTED,	"Too many nested typecasts/dereferences"),
 
 #undef C
 #define C(a, b)		TP_ERR_##a


^ permalink raw reply related

* [PATCH v7 04/10] tracing/probes: Support typecast for various probe events
From: Masami Hiramatsu (Google) @ 2026-06-23  1:44 UTC (permalink / raw)
  To: Steven Rostedt, Mathieu Desnoyers
  Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178217904992.643090.15726197350652241270.stgit@devnote2>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Support BTF typecast feature on other probe events, but only if it is
kernel function entry or return, and must use function parameter name
or $retval. This means you can do:

  (STRUCT)PARAM->MEMBER

Note: you can not use other variables like $stackN, %reg etc. That
needs nesting support.

To support other probe events, we just need to use last_struct type
when we find a function parameter in parse_btf_arg().

This also updates <tracefs>/README file to show struct typecast.

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v5:
  - Add comments about $retval with typecast.
  - Even if the type of retvalue is not known, if user specifies typecast,
    use it for its type.
 Changes in v3:
  - Clarify the limitation.
 Changes in v2:
  - Fix to re-enable typecast on eprobe.
---
 Documentation/trace/fprobetrace.rst |    3 +++
 Documentation/trace/kprobetrace.rst |    4 ++++
 kernel/trace/trace.c                |    2 +-
 kernel/trace/trace_probe.c          |   23 +++++++++++++++++------
 kernel/trace/trace_probe.h          |    5 +++++
 5 files changed, 30 insertions(+), 7 deletions(-)

diff --git a/Documentation/trace/fprobetrace.rst b/Documentation/trace/fprobetrace.rst
index b4c2ca3d02c1..7435ded2d66d 100644
--- a/Documentation/trace/fprobetrace.rst
+++ b/Documentation/trace/fprobetrace.rst
@@ -57,6 +57,9 @@ Synopsis of fprobe-events
                   (u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types
                   (x8/x16/x32/x64), "char", "string", "ustring", "symbol", "symstr"
                   and bitfield are supported.
+  (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
+                  a pointer to STRUCT and then derference the pointer defined by
+                  ->MEMBER.
 
   (\*1) This is available only when BTF is enabled.
   (\*2) only for the probe on function entry (offs == 0). Note, this argument access
diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
index 3b6791c17e9b..f73614997d52 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -61,6 +61,10 @@ Synopsis of kprobe_events
 		  (x8/x16/x32/x64), VFS layer common type(%pd/%pD), "char",
                   "string", "ustring", "symbol", "symstr" and bitfield are
                   supported.
+  (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
+                  a pointer to STRUCT and then derference the pointer defined by
+                  ->MEMBER. Note that this is available only when the probe is
+		   on function entry.
 
   (\*1) only for the probe on function entry (offs == 0). Note, this argument access
         is best effort, because depending on the argument type, it may be passed on
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 6eb4d3097a4d..aa93e7b01146 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4325,7 +4325,7 @@ static const char readme_msg[] =
 #ifdef CONFIG_HAVE_FUNCTION_ARG_ACCESS_API
 	"\t           $stack<index>, $stack, $retval, $comm, $arg<N>,\n"
 #ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
-	"\t           <argname>[->field[->field|.field...]],\n"
+	"\t           [(structname)]<argname>[->field[->field|.field...]],\n"
 #endif
 #else
 	"\t           $stack<index>, $stack, $retval, $comm,\n"
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 5e6a680b2ce7..f318f5f88b7e 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -706,7 +706,7 @@ static int parse_btf_arg(char *varname,
 
 	if (ctx->flags & TPARG_FL_RETURN && !strcmp(varname, "$retval")) {
 		code->op = FETCH_OP_RETVAL;
-		/* Check whether the function return type is not void */
+		/* Check whether the function return type is not void, even with typecast. */
 		if (query_btf_context(ctx) == 0) {
 			if (ctx->proto->type == 0) {
 				trace_probe_log_err(ctx->offset, NO_RETVAL);
@@ -715,6 +715,13 @@ static int parse_btf_arg(char *varname,
 			tid = ctx->proto->type;
 			goto found;
 		}
+		/*
+		 * Even if we can not find appropriate BTF info, we can still access
+		 * the field via typecast.
+		 */
+		if (ctx->struct_btf)
+			goto found;
+
 		if (field) {
 			trace_probe_log_err(ctx->offset + field - varname,
 					    NO_BTF_ENTRY);
@@ -759,7 +766,10 @@ static int parse_btf_arg(char *varname,
 	return -ENOENT;
 
 found:
-	type = btf_type_skip_modifiers(ctx->btf, tid, &tid);
+	if (ctx->struct_btf)
+		type = ctx->last_struct;
+	else
+		type = btf_type_skip_modifiers(ctx->btf, tid, &tid);
 found_type:
 	if (!type) {
 		trace_probe_log_err(ctx->offset, BAD_BTF_TID);
@@ -836,10 +846,11 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
 	char *tmp;
 	int ret;
 
-	/* Currently this only works for eprobes */
-	if (!(ctx->flags & TPARG_FL_TEVENT)) {
-		trace_probe_log_err(ctx->offset, TYPECAST_NOT_EVENT);
-		return -EINVAL;
+	if (!(tparg_is_event_probe(ctx->flags) ||
+	      tparg_is_function_entry(ctx->flags) ||
+	      tparg_is_function_return(ctx->flags))) {
+		trace_probe_log_err(ctx->offset, NOSUP_BTFARG);
+		return -EOPNOTSUPP;
 	}
 
 	tmp = strchr(arg, ')');
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index a9a7e0fc04ca..93f75bc384c7 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -429,6 +429,11 @@ static inline bool tparg_is_function_return(unsigned int flags)
 	return (flags & TPARG_FL_LOC_MASK) == (TPARG_FL_KERNEL | TPARG_FL_RETURN);
 }
 
+static inline bool tparg_is_event_probe(unsigned int flags)
+{
+	return !!(flags & TPARG_FL_TEVENT);
+}
+
 struct traceprobe_parse_context {
 	struct trace_event_call *event;
 	/* BTF related parameters */


^ permalink raw reply related

* [PATCH v7 03/10] tracing/probes: Support dumping fetcharg program for debugging dynamic events
From: Masami Hiramatsu (Google) @ 2026-06-23  1:44 UTC (permalink / raw)
  To: Steven Rostedt, Mathieu Desnoyers
  Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178217904992.643090.15726197350652241270.stgit@devnote2>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

For debugging probe events, it is helpful to verify the compiled
fetch instructions for each probe argument. This introduces a new
kernel config CONFIG_PROBE_EVENTS_DUMP_FETCHARG to decode the
instruction sequence of each argument and display it under a
commented line starting with '#' immediately following the dynamic
event definition (such as in dynamic_events, kprobe_events,
uprobe_events, etc.).

For example:
 /sys/kernel/tracing # cat dynamic_events
 p:kprobes/p_vfs_read_0 vfs_read arg1=+0(file):ustring arg2=%ax:x16
 #  arg1: ARG(0) -> ST_USTRING(offset=0,size=4) -> END
 #  arg2: REG(80) -> ST_RAW(size=2) -> END

Assisted-by: Antigravity:gemini-3.5-flash
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v7:
   - Show trace event field name for FETCH_OP_TP_ARG.
   - Show immediate string value for FETCH_OP_IMMSTR.
   - Fix style issues warned by checkpatch.pl.
 Changes in v6:
   - Newly added.
---
 kernel/trace/Kconfig        |   11 +++++
 kernel/trace/trace_eprobe.c |    2 +
 kernel/trace/trace_fprobe.c |    2 +
 kernel/trace/trace_kprobe.c |    2 +
 kernel/trace/trace_probe.c  |   96 +++++++++++++++++++++++++++++++++++++++++++
 kernel/trace/trace_probe.h  |   79 +++++++++++++++++++++--------------
 kernel/trace/trace_uprobe.c |    3 +
 7 files changed, 163 insertions(+), 32 deletions(-)

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index e130da35808f..ed83fbfb4b7c 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -779,6 +779,17 @@ config PROBE_EVENTS_BTF_ARGS
 	  kernel function entry or a tracepoint.
 	  This is available only if BTF (BPF Type Format) support is enabled.
 
+config PROBE_EVENTS_DUMP_FETCHARG
+	depends on PROBE_EVENTS
+	bool "Dump of dynamic probe event fetch-arguments"
+	default n
+	help
+	  This shows the dump of fetch-arguments of dynamic probe events
+	  alongside their event definitions in the dynamic_events file
+	  as comment lines. This is useful to debug the probe events.
+
+	  If unsure, say N.
+
 config KPROBE_EVENTS
 	depends on KPROBES
 	depends on HAVE_REGS_AND_STACK_ACCESS_API
diff --git a/kernel/trace/trace_eprobe.c b/kernel/trace/trace_eprobe.c
index 50518b071414..462c31145733 100644
--- a/kernel/trace/trace_eprobe.c
+++ b/kernel/trace/trace_eprobe.c
@@ -87,6 +87,8 @@ static int eprobe_dyn_event_show(struct seq_file *m, struct dyn_event *ev)
 		seq_printf(m, " %s=%s", ep->tp.args[i].name, ep->tp.args[i].comm);
 	seq_putc(m, '\n');
 
+	trace_probe_dump_args(m, &ep->tp);
+
 	return 0;
 }
 
diff --git a/kernel/trace/trace_fprobe.c b/kernel/trace/trace_fprobe.c
index 4d1abbf66229..536781cd4c47 100644
--- a/kernel/trace/trace_fprobe.c
+++ b/kernel/trace/trace_fprobe.c
@@ -1449,6 +1449,8 @@ static int trace_fprobe_show(struct seq_file *m, struct dyn_event *ev)
 		seq_printf(m, " %s=%s", tf->tp.args[i].name, tf->tp.args[i].comm);
 	seq_putc(m, '\n');
 
+	trace_probe_dump_args(m, &tf->tp);
+
 	return 0;
 }
 
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index a8420e6abb56..cfa807d8e760 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -1320,6 +1320,8 @@ static int trace_kprobe_show(struct seq_file *m, struct dyn_event *ev)
 		seq_printf(m, " %s=%s", tk->tp.args[i].name, tk->tp.args[i].comm);
 	seq_putc(m, '\n');
 
+	trace_probe_dump_args(m, &tk->tp);
+
 	return 0;
 }
 
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index d1c55596725b..5e6a680b2ce7 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -2394,3 +2394,99 @@ int trace_probe_print_args(struct trace_seq *s, struct probe_arg *args, int nr_a
 	}
 	return 0;
 }
+
+#ifdef CONFIG_PROBE_EVENTS_DUMP_FETCHARG
+
+struct fetch_op_decode {
+	const char *name;
+	void (*decode)(struct seq_file *m, struct fetch_insn *insn);
+};
+
+static const struct fetch_op_decode fetch_op_decode[];
+
+static void fetcharg_decode_none(struct seq_file *m, struct fetch_insn *insn)
+{
+	seq_puts(m, fetch_op_decode[insn->op].name);
+}
+
+static void fetcharg_decode_param(struct seq_file *m, struct fetch_insn *insn)
+{
+	seq_printf(m, "%s(%u)", fetch_op_decode[insn->op].name, insn->param);
+}
+
+static void fetcharg_decode_imm(struct seq_file *m, struct fetch_insn *insn)
+{
+	seq_printf(m, "%s(0x%lx)", fetch_op_decode[insn->op].name, insn->immediate);
+}
+
+static void fetcharg_decode_string(struct seq_file *m, struct fetch_insn *insn)
+{
+	seq_printf(m, "%s(%s)", fetch_op_decode[insn->op].name, (char *)insn->data);
+}
+
+static void fetcharg_decode_symbol(struct seq_file *m, struct fetch_insn *insn)
+{
+	seq_printf(m, "%s(%s)", fetch_op_decode[insn->op].name, (char *)insn->data);
+}
+
+static void fetcharg_decode_offset(struct seq_file *m, struct fetch_insn *insn)
+{
+	seq_printf(m, "%s(offset=%d)", fetch_op_decode[insn->op].name, insn->offset);
+}
+
+static void fetcharg_decode_store(struct seq_file *m, struct fetch_insn *insn)
+{
+	if (insn->op == FETCH_OP_ST_RAW)
+		seq_printf(m, "%s(size=%u)", fetch_op_decode[insn->op].name, insn->size);
+	else
+		seq_printf(m, "%s(offset=%d,size=%u)", fetch_op_decode[insn->op].name,
+			  insn->offset, insn->size);
+}
+
+static void fetcharg_decode_bf(struct seq_file *m, struct fetch_insn *insn)
+{
+	seq_printf(m, "%s(basesize=%u,lshift=%u,rshift=%u)",
+		   fetch_op_decode[insn->op].name, insn->basesize, insn->lshift, insn->rshift);
+}
+
+static void fetcharg_decode_tp_arg(struct seq_file *m, struct fetch_insn *insn)
+{
+	struct ftrace_event_field *field = insn->data;
+
+	seq_printf(m, "%s(%s)", fetch_op_decode[insn->op].name, field->name);
+}
+
+#define FETCH_OP(opname, decode_fn) \
+	[FETCH_OP_##opname] = { .name = #opname, .decode = fetcharg_decode_##decode_fn }
+
+static const struct fetch_op_decode fetch_op_decode[] = FETCH_OP_LIST;
+#undef FETCH_OP
+
+static void trace_probe_dump_arg(struct seq_file *m, struct probe_arg *parg)
+{
+	int i;
+
+	seq_printf(m, "#  %s: ", parg->name);
+	for (i = 0; i < FETCH_INSN_MAX; i++) {
+		struct fetch_insn *insn = parg->code + i;
+
+		if (insn->op >= ARRAY_SIZE(fetch_op_decode) || !fetch_op_decode[insn->op].decode)
+			seq_printf(m, "unknown(%d)", insn->op);
+		else
+			fetch_op_decode[insn->op].decode(m, insn);
+
+		if (insn->op == FETCH_OP_END)
+			break;
+		seq_puts(m, " -> ");
+	}
+	seq_putc(m, '\n');
+}
+
+void trace_probe_dump_args(struct seq_file *m, struct trace_probe *tp)
+{
+	int i;
+
+	for (i = 0; i < tp->nr_args; i++)
+		trace_probe_dump_arg(m, &tp->args[i]);
+}
+#endif /* CONFIG_PROBE_EVENTS_DUMP_FETCHARG */
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index cd586e67b21a..a9a7e0fc04ca 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -83,38 +83,46 @@ static nokprobe_inline u32 update_data_loc(u32 loc, int consumed)
 /* Printing function type */
 typedef int (*print_type_func_t)(struct trace_seq *, void *, void *);
 
-enum fetch_op {
-	FETCH_OP_NOP = 0,
-	// Stage 1 (load) ops
-	FETCH_OP_REG,		/* Register : .param = offset */
-	FETCH_OP_STACK,		/* Stack : .param = index */
-	FETCH_OP_STACKP,	/* Stack pointer */
-	FETCH_OP_RETVAL,	/* Return value */
-	FETCH_OP_IMM,		/* Immediate : .immediate */
-	FETCH_OP_COMM,		/* Current comm */
-	FETCH_OP_ARG,		/* Function argument : .param */
-	FETCH_OP_FOFFS,		/* File offset: .immediate */
-	FETCH_OP_IMMSTR,	/* Allocated string: .data */
-	FETCH_OP_EDATA,		/* Entry data: .offset */
-	// Stage 2 (dereference) op
-	FETCH_OP_DEREF,		/* Dereference: .offset */
-	FETCH_OP_UDEREF,	/* User-space Dereference: .offset */
-	// Stage 3 (store) ops
-	FETCH_OP_ST_RAW,	/* Raw: .size */
-	FETCH_OP_ST_MEM,	/* Mem: .offset, .size */
-	FETCH_OP_ST_UMEM,	/* Mem: .offset, .size */
-	FETCH_OP_ST_STRING,	/* String: .offset, .size */
-	FETCH_OP_ST_USTRING,	/* User String: .offset, .size */
-	FETCH_OP_ST_SYMSTR,	/* Kernel Symbol String: .offset, .size */
-	FETCH_OP_ST_EDATA,	/* Store Entry Data: .offset */
-	// Stage 4 (modify) op
-	FETCH_OP_MOD_BF,	/* Bitfield: .basesize, .lshift, .rshift */
-	// Stage 5 (loop) op
-	FETCH_OP_LP_ARRAY,	/* Array: .param = loop count */
-	FETCH_OP_TP_ARG,	/* Trace Point argument */
-	FETCH_OP_END,
-	FETCH_NOP_SYMBOL,	/* Unresolved Symbol holder */
-};
+#define FETCH_OP_LIST	{						\
+	/* Stage 1 (load) ops */					\
+	FETCH_OP(NOP, none),		/* NOP */			\
+	FETCH_OP(REG, param),		/* Register: .param = offset */	\
+	FETCH_OP(STACK, param),		/* Stack: .param = index */	\
+	FETCH_OP(STACKP, none),		/* Stack pointer */		\
+	FETCH_OP(RETVAL, none),		/* Return value */		\
+	FETCH_OP(IMM, imm),		/* Immediate: .immediate */	\
+	FETCH_OP(COMM, none),		/* Current comm */		\
+	FETCH_OP(ARG, param),		/* Argument: .param = index */	\
+	FETCH_OP(FOFFS, imm),		/* File offset: .immediate */	\
+	FETCH_OP(IMMSTR, string),	/* Allocated string: .data */	\
+	FETCH_OP(EDATA, offset),	/* Entry data: .offset */	\
+	FETCH_OP(TP_ARG, tp_arg),	/* Tracepoint argument: .data */\
+	/* Stage 2 (dereference) ops */					\
+	FETCH_OP(DEREF, offset),	/* Dereference: .offset */	\
+	FETCH_OP(UDEREF, offset),	/* User-space dereference: .offset */\
+	/* Stage 3 (store) ops */					\
+	FETCH_OP(ST_RAW, store),	/* Raw value: .size */		\
+	FETCH_OP(ST_MEM, store),	/* Memory: .offset, .size */	\
+	FETCH_OP(ST_UMEM, store),	/* User memory: .offset, .size */\
+	FETCH_OP(ST_STRING, store),	/* String: .offset, .size */	\
+	FETCH_OP(ST_USTRING, store),	/* User string: .offset, .size */\
+	FETCH_OP(ST_SYMSTR, store),	/* Symbol name: .offset, .size */\
+	FETCH_OP(ST_EDATA, offset),	/* Entry data: .offset */	\
+	/* Stage 4 (modify) op */					\
+	FETCH_OP(MOD_BF, bf),		/* Bitfield: .basesize, .lshift, .rshift*/\
+	/* Stage 5 (loop) op */						\
+	FETCH_OP(LP_ARRAY, param),	/* Loop array: .param = count */\
+	/* End */							\
+	FETCH_OP(END, none),						\
+	/* Unresolved Symbol holder */					\
+	FETCH_OP(NOP_SYMBOL, symbol),	/* Non loaded symbol: .data = symbol name */\
+}
+
+#define FETCH_OP(opname, decode_fn) FETCH_OP_##opname
+enum fetch_op FETCH_OP_LIST;
+#undef FETCH_OP
+
+#define FETCH_NOP_SYMBOL FETCH_OP_NOP_SYMBOL
 
 struct fetch_insn {
 	enum fetch_op op;
@@ -370,6 +378,13 @@ bool trace_probe_match_command_args(struct trace_probe *tp,
 int trace_probe_create(const char *raw_command, int (*createfn)(int, const char **));
 int trace_probe_print_args(struct trace_seq *s, struct probe_arg *args, int nr_args,
 		 u8 *data, void *field);
+#ifdef CONFIG_PROBE_EVENTS_DUMP_FETCHARG
+void trace_probe_dump_args(struct seq_file *m, struct trace_probe *tp);
+#else
+static inline void trace_probe_dump_args(struct seq_file *m, struct trace_probe *tp)
+{
+}
+#endif
 
 #ifdef CONFIG_HAVE_FUNCTION_ARG_ACCESS_API
 int traceprobe_get_entry_data_size(struct trace_probe *tp);
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index c274346853d1..b2e264a4b96c 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -765,6 +765,9 @@ static int trace_uprobe_show(struct seq_file *m, struct dyn_event *ev)
 		seq_printf(m, " %s=%s", tu->tp.args[i].name, tu->tp.args[i].comm);
 
 	seq_putc(m, '\n');
+
+	trace_probe_dump_args(m, &tu->tp);
+
 	return 0;
 }
 


^ permalink raw reply related

* [PATCH v7 02/10] tracing/probes: Rename FETCH_OP_DATA to FETCH_OP_IMMSTR
From: Masami Hiramatsu (Google) @ 2026-06-23  1:44 UTC (permalink / raw)
  To: Steven Rostedt, Mathieu Desnoyers
  Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178217904992.643090.15726197350652241270.stgit@devnote2>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Since FETCH_OP_DATA is used solely to store immediate string
values, rename it to the more specific FETCH_OP_IMMSTR.

No behavior change, just rename it.

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 kernel/trace/trace_probe.c      |   12 ++++++------
 kernel/trace/trace_probe.h      |    2 +-
 kernel/trace/trace_probe_tmpl.h |    2 +-
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 502fa6da5949..d1c55596725b 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -1307,7 +1307,7 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
 				break;
 			ctx->offset = cur_offs;
 			if (code->op == FETCH_OP_COMM ||
-			    code->op == FETCH_OP_DATA) {
+			    code->op == FETCH_OP_IMMSTR) {
 				trace_probe_log_err(ctx->offset, COMM_CANT_DEREF);
 				return -EINVAL;
 			}
@@ -1328,7 +1328,7 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
 			ret = __parse_imm_string(arg + 2, &tmp, ctx->offset + 2);
 			if (ret)
 				break;
-			code->op = FETCH_OP_DATA;
+			code->op = FETCH_OP_IMMSTR;
 			code->data = tmp;
 		} else {
 			ret = str_to_immediate(arg + 1, &code->immediate);
@@ -1483,7 +1483,7 @@ static int finalize_fetch_insn(struct fetch_insn *code,
 		} else {
 			if (code->op != FETCH_OP_DEREF && code->op != FETCH_OP_UDEREF &&
 			    code->op != FETCH_OP_IMM && code->op != FETCH_OP_COMM &&
-			    code->op != FETCH_OP_DATA && code->op != FETCH_OP_TP_ARG) {
+			    code->op != FETCH_OP_IMMSTR && code->op != FETCH_OP_TP_ARG) {
 				trace_probe_log_err(ctx->offset + type_offset,
 						    BAD_STRING);
 				return -EINVAL;
@@ -1492,7 +1492,7 @@ static int finalize_fetch_insn(struct fetch_insn *code,
 
 		if (!strcmp(parg->type->name, "symstr") ||
 		    (code->op == FETCH_OP_IMM || code->op == FETCH_OP_COMM ||
-		     code->op == FETCH_OP_DATA) || code->op == FETCH_OP_TP_ARG ||
+		     code->op == FETCH_OP_IMMSTR) || code->op == FETCH_OP_TP_ARG ||
 		     parg->count) {
 			/*
 			 * IMM, DATA and COMM is pointing actual address, those
@@ -1668,7 +1668,7 @@ static int traceprobe_parse_probe_arg_body(const char *argv, ssize_t *size,
 	if (ret < 0) {
 		for (code = tmp; code < tmp + FETCH_INSN_MAX; code++)
 			if (code->op == FETCH_NOP_SYMBOL ||
-			    code->op == FETCH_OP_DATA)
+			    code->op == FETCH_OP_IMMSTR)
 				kfree(code->data);
 	}
 	kfree(tmp);
@@ -1767,7 +1767,7 @@ void traceprobe_free_probe_arg(struct probe_arg *arg)
 
 	while (code && code->op != FETCH_OP_END) {
 		if (code->op == FETCH_NOP_SYMBOL ||
-		    code->op == FETCH_OP_DATA)
+		    code->op == FETCH_OP_IMMSTR)
 			kfree(code->data);
 		code++;
 	}
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 0f09f7aaf93f..cd586e67b21a 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -94,7 +94,7 @@ enum fetch_op {
 	FETCH_OP_COMM,		/* Current comm */
 	FETCH_OP_ARG,		/* Function argument : .param */
 	FETCH_OP_FOFFS,		/* File offset: .immediate */
-	FETCH_OP_DATA,		/* Allocated data: .data */
+	FETCH_OP_IMMSTR,	/* Allocated string: .data */
 	FETCH_OP_EDATA,		/* Entry data: .offset */
 	// Stage 2 (dereference) op
 	FETCH_OP_DEREF,		/* Dereference: .offset */
diff --git a/kernel/trace/trace_probe_tmpl.h b/kernel/trace/trace_probe_tmpl.h
index f39b37fcdb3b..51436f19083b 100644
--- a/kernel/trace/trace_probe_tmpl.h
+++ b/kernel/trace/trace_probe_tmpl.h
@@ -109,7 +109,7 @@ process_common_fetch_insn(struct fetch_insn *code, unsigned long *val)
 	case FETCH_OP_COMM:
 		*val = (unsigned long)current->comm;
 		break;
-	case FETCH_OP_DATA:
+	case FETCH_OP_IMMSTR:
 		*val = (unsigned long)code->data;
 		break;
 	default:


^ permalink raw reply related

* [PATCH v7 01/10] tracing/probes: Fix double addition of offset for @+FOFFSET
From: Masami Hiramatsu (Google) @ 2026-06-23  1:44 UTC (permalink / raw)
  To: Steven Rostedt, Mathieu Desnoyers
  Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178217904992.643090.15726197350652241270.stgit@devnote2>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Since commit 533059281ee5 ("tracing: probeevent: Introduce new argument
 fetching code") wrongly use @offset local variable during the parsing,
the offset value is added twice when dereferencing.
Reset the @offset after setting it in FETCH_OP_FOFFS.

Fixes: 533059281ee5 ("tracing: probeevent: Introduce new argument fetching code")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: stable@vger.kernel.org
---
 kernel/trace/trace_probe.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 98532c503d02..502fa6da5949 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -1241,6 +1241,7 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
 
 			code->op = FETCH_OP_FOFFS;
 			code->immediate = (unsigned long)offset;  // imm64?
+			offset = 0;
 		} else {
 			/* uprobes don't support symbols */
 			if (!(ctx->flags & TPARG_FL_KERNEL)) {


^ permalink raw reply related

* [PATCH v7 00/10] tracing/probes: Add more typecast features
From: Masami Hiramatsu (Google) @ 2026-06-23  1:44 UTC (permalink / raw)
  To: Steven Rostedt, Mathieu Desnoyers
  Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest

Hi,

Here is the 7th version of series to introduce more typecast features
to probe events. The previous version is here:

 https://lore.kernel.org/all/178201238795.570818.15573963115625446598.stgit@devnote2/

In this version, I added 2 new fix and cleanup patches and update
according to Sashiko's review. [1/10] is a long-lived issue about
@+FOFFS, which was wrongly adding offset twice. [2/10] is a clean
up patch for renaming fetch_op name (good to dump it). 
This is applicable against probes/core branch on linux-trace tree.

Steve introduced BTF typecast feature for eprobe[1].
This series extends it and add more options:

1. Expanding BTF typecast to kprobe and fprobe.
   (currently only function entry/exit)

2. Introduce container_of like typecast. This adds a "assigned
   member" option to the typecast.

   (STRUCT,MEMBER)VAR->ANOTHER_MEMBER

   This casts VAR to STRUCT type but the VAR is as the address
   of STRUCT.MEMBER. In C, it is:

   container_of(VAR, STRUCT, MEMBER)->ANOTHER_MEMBER

3. Support nested typecast, e.g.

   (STRUCT)((STRUCT2)VAR->MEMBER2)->MEMBER

   the nest level must be smaller than 3.

4. Add $current variable to point "current" task_struct.
   This is useful with typecast, e.g.

   (task_struct)$current->pid

5. per-cpu dereference support.

   Intrdouce this_cpu_read(VAR) and this_cpu_ptr(VAR) to
   access per-cpu data on the current CPU (accessing other CPU
   data is not stable, because it can be changed.)

   You can access the member of per-cpu data structure using
   typecast like:

   (STRUCT)this_cpu_ptr(VAR)->MEMBER

And added fetcharg dump feature (for debug) and updated test scripts
to test part of them.

Thanks,

---
base-commit: 3ec75d0067f30eb5e0730f033766d6ab2feca7ae

Masami Hiramatsu (Google) (10):
      tracing/probes: Fix double addition of offset for @+FOFFSET
      tracing/probes: Rename FETCH_OP_DATA to FETCH_OP_IMMSTR
      tracing/probes: Support dumping fetcharg program for debugging dynamic events
      tracing/probes: Support typecast for various probe events
      tracing/probes: Support nested typecast
      tracing/probes: Type casting always involves nested calls
      tracing/probes: Support field specifier option for typecast
      tracing/probes: Add $current variable support
      tracing/probes: Add this_cpu_read() and this_cpu_ptr() dereference method to fetcharg
      tracing/probes: Add a new testcase for BTF typecasts


 Documentation/trace/eprobetrace.rst                |    9 
 Documentation/trace/fprobetrace.rst                |   10 
 Documentation/trace/kprobetrace.rst                |   11 
 kernel/trace/Kconfig                               |   11 
 kernel/trace/trace.c                               |    8 
 kernel/trace/trace_eprobe.c                        |    2 
 kernel/trace/trace_fprobe.c                        |    2 
 kernel/trace/trace_kprobe.c                        |    2 
 kernel/trace/trace_probe.c                         |  582 ++++++++++++++++----
 kernel/trace/trace_probe.h                         |   98 ++-
 kernel/trace/trace_probe_tmpl.h                    |   27 +
 kernel/trace/trace_uprobe.c                        |    3 
 samples/trace_events/trace-events-sample.c         |   40 +
 samples/trace_events/trace-events-sample.h         |   34 +
 .../ftrace/test.d/dynevent/btf_probe_event.tc      |   51 ++
 .../ftrace/test.d/dynevent/fprobe_syntax_errors.tc |   11 
 .../ftrace/test.d/kprobe/kprobe_syntax_errors.tc   |   11 
 .../ftrace/test.d/kprobe/uprobe_syntax_errors.tc   |    5 
 18 files changed, 756 insertions(+), 161 deletions(-)
 create mode 100644 tools/testing/selftests/ftrace/test.d/dynevent/btf_probe_event.tc

--
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox