Linux Documentation

Linux Documentation
 help / color / mirror / Atom feed

* [PATCH RFC v5 0/6] iio: add Open Sensor Fusion IIO driver
From: Jinseob Kim @ 2026-06-16  7:22 UTC (permalink / raw)
  To: Jonathan Cameron, Rob Herring, Krzysztof Kozlowski, Conor Dooley
  Cc: David Lechner, Nuno Sá, Andy Shevchenko, Jonathan Corbet,
	Shuah Khan, linux-iio, devicetree, linux-doc, linux-kernel,
	Jinseob Kim

Open Sensor Fusion is a sensor aggregation hub interface.  The Linux IIO
driver receives OSF protocol frames from a serdev-attached device,
discovers supported sensor streams from capability reports, and exposes
the supported raw sensor data through IIO devices.

The initial driver supports protocol major version 0 and the receive path
for accelerometer, gyroscope, magnetometer, and temperature samples.  The
current wire magic is OSF0, but OSF0 is a wire-format detail and not the
Linux driver identity.  Protocol compatibility is carried by the
protocol_major and protocol_minor fields in the fixed OSF frame header.

This is still RFC because the driver-facing OSF protocol subset, the
compatible binding, and future protocol compatibility rules are being
reviewed.

Runtime testing was done with an OSF GREEN prototype connected to a
Raspberry Pi over UART.  The driver registered osf-accel, osf-gyro,
osf-magn, and osf-temp IIO devices.  Direct raw reads and software kfifo
buffer reads were tested.

Changes since v4:
- Regenerated the series as a full standalone replacement series from a
  clean upstream base.
- Removed previous-version add/delete churn from the generated series.
- Clarified OSF0, protocol_major, and protocol_minor compatibility
  handling.
- Added required vcc-supply support to the binding.
- Added probe-time regulator enablement with devm_regulator_get_enable().
- Added the opensensorfusion vendor prefix.
- Fixed checkpatch cleanup issues in commit messages and driver style.

Jinseob Kim (6):
  dt-bindings: iio: add Open Sensor Fusion device
  Documentation: iio: add Open Sensor Fusion driver overview
  iio: osf: add protocol decoding
  iio: osf: add stream parser
  iio: osf: add UART transport
  iio: osf: register IIO devices from capabilities

 .../bindings/iio/opensensorfusion,osf.yaml    |  59 ++++
 .../devicetree/bindings/vendor-prefixes.yaml  |   2 +
 Documentation/iio/index.rst                   |   1 +
 Documentation/iio/open-sensor-fusion.rst      |  71 ++++
 MAINTAINERS                                   |  13 +
 drivers/iio/Kconfig                           |   1 +
 drivers/iio/Makefile                          |   1 +
 drivers/iio/opensensorfusion/Kconfig          |  14 +
 drivers/iio/opensensorfusion/Makefile         |   6 +
 drivers/iio/opensensorfusion/osf_core.c       | 306 ++++++++++++++++++
 drivers/iio/opensensorfusion/osf_core.h       |  70 ++++
 drivers/iio/opensensorfusion/osf_iio.c        | 275 ++++++++++++++++
 drivers/iio/opensensorfusion/osf_iio.h        |  22 ++
 drivers/iio/opensensorfusion/osf_protocol.c   | 249 ++++++++++++++
 drivers/iio/opensensorfusion/osf_protocol.h   |  97 ++++++
 drivers/iio/opensensorfusion/osf_serdev.c     | 117 +++++++
 drivers/iio/opensensorfusion/osf_stream.c     | 187 +++++++++++
 drivers/iio/opensensorfusion/osf_stream.h     |  31 ++
 18 files changed, 1522 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/iio/opensensorfusion,osf.yaml
 create mode 100644 Documentation/iio/open-sensor-fusion.rst
 create mode 100644 drivers/iio/opensensorfusion/Kconfig
 create mode 100644 drivers/iio/opensensorfusion/Makefile
 create mode 100644 drivers/iio/opensensorfusion/osf_core.c
 create mode 100644 drivers/iio/opensensorfusion/osf_core.h
 create mode 100644 drivers/iio/opensensorfusion/osf_iio.c
 create mode 100644 drivers/iio/opensensorfusion/osf_iio.h
 create mode 100644 drivers/iio/opensensorfusion/osf_protocol.c
 create mode 100644 drivers/iio/opensensorfusion/osf_protocol.h
 create mode 100644 drivers/iio/opensensorfusion/osf_serdev.c
 create mode 100644 drivers/iio/opensensorfusion/osf_stream.c
 create mode 100644 drivers/iio/opensensorfusion/osf_stream.h

-- 
2.43.0


^ permalink raw reply

* Re: [PATCH v3] Documentation/process: Add Researcher Guidelines
From: Andy Shevchenko @ 2026-06-16  7:09 UTC (permalink / raw)
  To: Kees Cook
  Cc: Dan Carpenter, Jonathan Corbet, Greg Kroah-Hartman,
	Stefano Zacchiroli, Steven Rostedt, Laura Abbott, Julia Lawall,
	Wenwen Wang, Gustavo A . R . Silva, Thorsten Leemhuis,
	linux-kernel, linux-doc, linux-hardening, Dawei Feng
In-Reply-To: <202606151547.DB4095584@keescook>

On Mon, Jun 15, 2026 at 03:48:02PM -0700, Kees Cook wrote:
> On Mon, Jun 15, 2026 at 03:48:22PM +0200, Andy Shevchenko wrote:
> > On Thu, May 28, 2026 at 01:34:34PM +0300, Dan Carpenter wrote:
> > > On Fri, Mar 04, 2022 at 10:14:18AM -0800, Kees Cook wrote:

...

> > > > +  x86_64 and arm64 defconfig builds with CONFIG_FOO_BAR=y using GCC
> > > > +  11.2 show no new warnings, and LeakMagic no longer warns about this
> > > > +  code path. As we don't have a FooBar device to test with, no runtime
> > > > +  testing was able to be performed.
> > > 
> > > People have started sending commit messages in this exact template and
> > > normally I would ask them resend with the meta commentary from this
> > > paragraph below the --- cut off line.
> > > 
> > > Do we really want this "Compile tested only" stuff in the permanent git
> > > log?
> > 
> > +1 here, can we rather avoid flooding commit messages with the meta, that
> > anyways is available in lore.kernel.org archives?
> 
> Hm, I have gotten a lot of push-back from maintainers (reasonablly)
> wanting to know the specific level of testing patches get. In the case
> of lacking hardware, this seems like useful information still.

I am *not* against providing this information (actually I on the same page),
I'm just against putting it into the commit message! We have a comment block
put it there, please.

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply

* Re: [PATCH v5 04/34] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration
From: Dongli Zhang @ 2026-06-16  6:47 UTC (permalink / raw)
  To: David Woodhouse, x86, kvm, linux-doc, linux-kernel, xen-devel,
	linux-kselftest
  Cc: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky,
	Paul Durrant, Jonathan Cameron, Marc Zyngier, Sascha Bischoff,
	Jack Allister, joe.jin, Joey Gouly
In-Reply-To: <20260608145455.89187-5-dwmw2@infradead.org>

I tested patches 02, 03, 04, and 26 by customizing QEMU to support kexec live
updates (LUO and KHO), preserving the memfd across kexec.

For my use case, I used KVM_[GS]ET_CLOCK_GUEST instead of the existing
KVM_[GS]ET_CLOCK. I didn't account the downtime in my QEMU code, although host
TSC never resets across kexec.

Clock drift was zero, and I did not observe any unnecessary master clock updates
after KVM_SET_CLOCK_GUEST completed.


Another interesting observation from my experiments is that tsc_khz changes
across kexec. Since the TSC value itself does not reset across kexec, I'm
wondering whether there is any reason to switch to the new tsc_khz value after
the kexec.

I previously sent a QEMU patch that takes advantage of your KVM commit
ffbb61d09fc5 ("KVM: x86: Accept KVM_[GS]ET_TSC_KHZ as a VM ioctl.").

[PATCH 1/1] target/i386/kvm: set VM ioctl KVM_SET_TSC_KHZ to maintain TSC
synchronization
https://lore.kernel.org/qemu-devel/20260210202041.153736-1-dongli.zhang@oracle.com


While live migration involves two different machines, kexec is performed on the
same machine. Given that the TSC value itself is preserved across kexec, would
it make sense to reuse the pre-kexec tsc_khz value instead of using the new
tsc_khz after kexec?

I tested this by using LUO to preserve tsc_khz across kexec, and the results
looked good.

Thank you very much!

Dongli Zhang

On 2026-06-08 7:47 AM, David Woodhouse wrote:
> From: Jack Allister <jalliste@amazon.com>
> 
> In the common case (where kvm->arch.use_master_clock is true), the KVM
> clock is defined as a simple arithmetic function of the guest TSC, based
> on a reference point stored in kvm->arch.master_kernel_ns and
> kvm->arch.master_cycle_now.
> 
> The existing KVM_[GS]ET_CLOCK functionality does not allow for this
> relationship to be precisely saved and restored by userspace. All it can
> currently do is set the KVM clock at a given UTC reference time, which
> is necessarily imprecise.
> 
> So on live update, the guest TSC can remain cycle accurate at precisely
> the same offset from the host TSC, but there is no way for userspace to
> restore the KVM clock accurately.
> 
> Even on live migration to a new host, where the accuracy of the guest
> time-keeping is fundamentally limited by the accuracy of wallclock
> synchronization between the source and destination hosts, the clock jump
> experienced by the guest's TSC and its KVM clock should at least be
> *consistent*. Even when the guest TSC suffers a discontinuity, its KVM
> clock should still remain the *same* arithmetic function of the guest
> TSC, and not suffer an *additional* discontinuity.
> 
> To allow for accurate migration of the KVM clock, add per-vCPU ioctls
> which save and restore the actual PV clock info in
> pvclock_vcpu_time_info.
> 
> The restoration in KVM_SET_CLOCK_GUEST works by creating a new reference
> point in time just as kvm_update_masterclock() does, and calculating the
> corresponding guest TSC value. This guest TSC value is then passed
> through the user-provided pvclock structure to generate the *intended*
> KVM clock value at that point in time, and through the *actual* KVM
> clock calculation. Then kvm->arch.kvmclock_offset is adjusted to
> eliminate the difference.
> 
> Where kvm->arch.use_master_clock is false (because the host TSC is
> unreliable, or the guest TSCs are configured strangely), the KVM clock
> is *not* defined as a function of the guest TSC so KVM_GET_CLOCK_GUEST
> returns an error. In this case, as documented, userspace shall use the
> legacy KVM_GET_CLOCK ioctl. The loss of precision is acceptable in this
> case since the clocks are imprecise in this mode anyway.
> 
> On *restoration*, if kvm->arch.use_master_clock is false, an error is
> returned for similar reasons and userspace shall fall back to using
> KVM_SET_CLOCK. This does mean that, as documented, userspace needs to
> use *both* KVM_GET_CLOCK_GUEST and KVM_GET_CLOCK and send both results
> with the migration data (unless the intent is to refuse to resume on a
> host with bad TSC).
> 
> Co-developed-by: David Woodhouse <dwmw@amazon.co.uk>
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> Signed-off-by: Jack Allister <jalliste@amazon.com>
> Reviewed-by: Paul Durrant <paul@xen.org>
> Cc: Dongli Zhang <dongli.zhang@oracle.com>
> ---
>  Documentation/virt/kvm/api.rst |  37 ++++++++
>  arch/x86/kvm/x86.c             | 164 +++++++++++++++++++++++++++++++++
>  include/uapi/linux/kvm.h       |   3 +
>  3 files changed, 204 insertions(+)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 52bbbb553ce1..2268b4442df6 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6553,6 +6553,43 @@ KVM_S390_KEYOP_SSKE
>    Sets the storage key for the guest address ``guest_addr`` to the key
>    specified in ``key``, returning the previous value in ``key``.
>  
> +4.145 KVM_GET_CLOCK_GUEST
> +----------------------------
> +
> +:Capability: none
> +:Architectures: x86_64
> +:Type: vcpu ioctl
> +:Parameters: struct pvclock_vcpu_time_info (out)
> +:Returns: 0 on success, <0 on error
> +
> +Retrieves the current time information structure used for KVM/PV clocks,
> +in precisely the form advertised to the guest vCPU, which gives parameters
> +for a direct conversion from a guest TSC value to nanoseconds.
> +
> +When the KVM clock is not in "master clock" mode, for example because the
> +host TSC is unreliable or the guest TSCs are oddly configured, the KVM clock
> +is actually defined by the host CLOCK_MONOTONIC_RAW instead of the guest TSC.
> +In this case, the KVM_GET_CLOCK_GUEST ioctl returns -EINVAL.
> +
> +4.146 KVM_SET_CLOCK_GUEST
> +----------------------------
> +
> +:Capability: none
> +:Architectures: x86_64
> +:Type: vcpu ioctl
> +:Parameters: struct pvclock_vcpu_time_info (in)
> +:Returns: 0 on success, <0 on error
> +
> +Sets the KVM clock (for the whole VM) in terms of the vCPU TSC, using the
> +pvclock structure as returned by KVM_GET_CLOCK_GUEST. This allows the precise
> +arithmetic relationship between guest TSC and KVM clock to be preserved by
> +userspace across migration.
> +
> +When the KVM clock is not in "master clock" mode, and the KVM clock is actually
> +defined by the host CLOCK_MONOTONIC_RAW, this ioctl returns -EINVAL. Userspace
> +may choose to set the clock using the less precise KVM_SET_CLOCK ioctl, or may
> +choose to fail, denying migration to a host whose TSC is misbehaving.
> +
>  .. _kvm_run:
>  
>  5. The kvm_run structure
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index d9ef165df6a1..b7e5f6e3dc6c 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6205,6 +6205,162 @@ static int kvm_get_reg_list(struct kvm_vcpu *vcpu,
>  	return 0;
>  }
>  
> +#ifdef CONFIG_X86_64
> +static int kvm_vcpu_ioctl_get_clock_guest(struct kvm_vcpu *v, void __user *argp)
> +{
> +	struct pvclock_vcpu_time_info hv_clock = {};
> +	struct kvm_vcpu_arch *vcpu = &v->arch;
> +	struct kvm_arch *ka = &v->kvm->arch;
> +	unsigned int seq;
> +
> +	/*
> +	 * If KVM_REQ_CLOCK_UPDATE is already pending, or if the pvclock
> +	 * has never been generated at all, call kvm_guest_time_update().
> +	 */
> +	if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, v) || !vcpu->hw_tsc_hz) {
> +		int idx = srcu_read_lock(&v->kvm->srcu);
> +		int ret = kvm_guest_time_update(v);
> +
> +		srcu_read_unlock(&v->kvm->srcu, idx);
> +		if (ret)
> +			return -EINVAL;
> +	}
> +
> +	/*
> +	 * Reconstruct the pvclock from the master clock state, matching
> +	 * exactly what kvm_guest_time_update() writes to the guest.
> +	 */
> +	do {
> +		seq = read_seqcount_begin(&ka->pvclock_sc);
> +
> +		if (!ka->use_master_clock)
> +			return -EINVAL;
> +
> +		hv_clock.tsc_timestamp = kvm_read_l1_tsc(v, ka->master_cycle_now);
> +		hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
> +	} while (read_seqcount_retry(&ka->pvclock_sc, seq));
> +
> +	hv_clock.tsc_shift = vcpu->pvclock_tsc_shift;
> +	hv_clock.tsc_to_system_mul = vcpu->pvclock_tsc_mul;
> +	hv_clock.flags = PVCLOCK_TSC_STABLE_BIT;
> +
> +	if (copy_to_user(argp, &hv_clock, sizeof(hv_clock)))
> +		return -EFAULT;
> +
> +	return 0;
> +}
> +
> +/*
> + * Reverse the calculation in the hv_clock definition.
> + *
> + * time_ns = ( (cycles << shift) * mul ) >> 32;
> + * (although shift can be negative, so that's bad C)
> + *
> + * So for a single second,
> + * NSEC_PER_SEC = ( ( FREQ_HZ << shift) * mul ) >> 32
> + * NSEC_PER_SEC << 32 = ( FREQ_HZ << shift ) * mul
> + * ( NSEC_PER_SEC << 32 ) / mul = FREQ_HZ << shift
> + * ( NSEC_PER_SEC << 32 ) / mul ) >> shift = FREQ_HZ
> + */
> +static u64 hvclock_to_hz(u32 mul, s8 shift)
> +{
> +	u64 tm = NSEC_PER_SEC << 32;
> +
> +	/* Maximise precision. Shift right until the top bit is set */
> +	tm <<= 2;
> +	shift += 2;
> +
> +	/* While 'mul' is even, increase the shift *after* the division */
> +	while (!(mul & 1)) {
> +		shift++;
> +		mul >>= 1;
> +	}
> +
> +	tm /= mul;
> +
> +	if (shift > 0)
> +		return tm >> shift;
> +	else
> +		return tm << -shift;
> +}
> +
> +static int kvm_vcpu_ioctl_set_clock_guest(struct kvm_vcpu *v, void __user *argp)
> +{
> +	struct pvclock_vcpu_time_info user_hv_clock;
> +	struct kvm *kvm = v->kvm;
> +	struct kvm_arch *ka = &kvm->arch;
> +	u64 curr_tsc_hz, user_tsc_hz;
> +	u64 user_clk_ns;
> +	u64 guest_tsc;
> +	int rc = 0;
> +
> +	if (copy_from_user(&user_hv_clock, argp, sizeof(user_hv_clock)))
> +		return -EFAULT;
> +
> +	if (user_hv_clock.pad0 || user_hv_clock.pad[0] || user_hv_clock.pad[1])
> +		return -EINVAL;
> +
> +	if (!user_hv_clock.tsc_to_system_mul)
> +		return -EINVAL;
> +
> +	if (user_hv_clock.tsc_shift < -32 || user_hv_clock.tsc_shift > 32)
> +		return -EINVAL;
> +
> +	user_tsc_hz = hvclock_to_hz(user_hv_clock.tsc_to_system_mul,
> +				    user_hv_clock.tsc_shift);
> +
> +	kvm_hv_request_tsc_page_update(kvm);
> +
> +	/*
> +	 * kvm_start_pvclock_update() takes tsc_write_lock and opens
> +	 * the pvclock seqcount; kvm_end_pvclock_update() closes both.
> +	 * All clock state modifications between them are atomic with
> +	 * respect to readers in kvm_guest_time_update().
> +	 */
> +	kvm_start_pvclock_update(kvm);
> +	pvclock_update_vm_gtod_copy(kvm);
> +
> +	if (!ka->use_master_clock) {
> +		rc = -EINVAL;
> +		goto out;
> +	}
> +
> +	curr_tsc_hz = (u64)get_cpu_tsc_khz() * 1000;
> +	if (unlikely(curr_tsc_hz == 0)) {
> +		rc = -EINVAL;
> +		goto out;
> +	}
> +
> +	if (kvm_caps.has_tsc_control)
> +		curr_tsc_hz = kvm_scale_tsc(curr_tsc_hz,
> +					    v->arch.l1_tsc_scaling_ratio);
> +
> +	/*
> +	 * Allow for a discrepancy of 1 kHz either way between the TSC
> +	 * frequency used to generate the user's pvclock and the current
> +	 * host's measured frequency, since they may not precisely match.
> +	 */
> +	if (user_tsc_hz < curr_tsc_hz - 1000 ||
> +	    user_tsc_hz > curr_tsc_hz + 1000) {
> +		rc = -ERANGE;
> +		goto out;
> +	}
> +
> +	/*
> +	 * Calculate the guest TSC at the new reference point, and the
> +	 * corresponding KVM clock value according to user_hv_clock.
> +	 * Adjust kvmclock_offset so both definitions agree.
> +	 */
> +	guest_tsc = kvm_read_l1_tsc(v, ka->master_cycle_now);
> +	user_clk_ns = __pvclock_read_cycles(&user_hv_clock, guest_tsc);
> +	ka->kvmclock_offset = user_clk_ns - ka->master_kernel_ns;
> +
> +out:
> +	kvm_end_pvclock_update(kvm);
> +	return rc;
> +}
> +#endif
> +
>  long kvm_arch_vcpu_ioctl(struct file *filp,
>  			 unsigned int ioctl, unsigned long arg)
>  {
> @@ -6605,6 +6761,14 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
>  		srcu_read_unlock(&vcpu->kvm->srcu, idx);
>  		break;
>  	}
> +#ifdef CONFIG_X86_64
> +	case KVM_SET_CLOCK_GUEST:
> +		r = kvm_vcpu_ioctl_set_clock_guest(vcpu, argp);
> +		break;
> +	case KVM_GET_CLOCK_GUEST:
> +		r = kvm_vcpu_ioctl_get_clock_guest(vcpu, argp);
> +		break;
> +#endif
>  #ifdef CONFIG_KVM_HYPERV
>  	case KVM_GET_SUPPORTED_HV_CPUID:
>  		r = kvm_ioctl_get_supported_hv_cpuid(vcpu, argp);
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 6c8afa2047bf..9b50191b859c 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1669,4 +1669,7 @@ struct kvm_pre_fault_memory {
>  	__u64 padding[5];
>  };
>  
> +#define KVM_SET_CLOCK_GUEST	_IOW(KVMIO, 0xd6, struct pvclock_vcpu_time_info)
> +#define KVM_GET_CLOCK_GUEST	_IOR(KVMIO, 0xd7, struct pvclock_vcpu_time_info)
> +
>  #endif /* __LINUX_KVM_H */


^ permalink raw reply

* [PATCH 7/7] hwmon: adm1275: Support module auto-loading
From: Matti Vaittinen @ 2026-06-16  6:47 UTC (permalink / raw)
  To: Matti Vaittinen, Matti Vaittinen, Matti Vaittinen
  Cc: Guenter Roeck, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Jonathan Corbet, Shuah Khan, Wensheng Wang, Ashish Yadav,
	Matti Vaittinen, Kim Seer Paller, Cedric Encarnacion,
	Chris Packham, Yuxi Wang, Charles Hsu, ChiShih Tsai, linux-hwmon,
	devicetree, linux-kernel, linux-doc
In-Reply-To: <cover.1781591132.git.mazziesaccount@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1844 bytes --]

From: Matti Vaittinen <mazziesaccount@gmail.com>

Populating the spi_device_id -table is not enough to make the
driver module automatically load when device-tree node for the bd12780
is parsed at boot.

Adding the of_device_id tables causes the driver module to be
automatically load at boot. Testing has been done with rather old Debian
system.

When inspecting the generated module-aliases with the insmod, following
entries seem to be the difference:

alias:          of:N*T*Crohm,bd12780C*
alias:          of:N*T*Crohm,bd12780

I suspect these are required for the module loading to work.

Signed-off-by: Matti Vaittinen <mazziesaccount@gmail.com>

---

I did not add of_device_ids for other supported ICs as I can't verify it
doesn't cause side-effects. Please let me know if you think those IDs
should be added as well. I would be glad if I got more educated opinion
on adding the of-IDs :) (I can squash this to 3/7 and 6/7 in next
revision, and add own patch for adding of-IDs for other ICs if
required).

---
 drivers/hwmon/pmbus/adm1275.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/hwmon/pmbus/adm1275.c b/drivers/hwmon/pmbus/adm1275.c
index 9e21dd4083e9..c27bb0e49354 100644
--- a/drivers/hwmon/pmbus/adm1275.c
+++ b/drivers/hwmon/pmbus/adm1275.c
@@ -927,9 +927,17 @@ static int adm1275_probe(struct i2c_client *client)
 	return pmbus_do_probe(client, info);
 }

+static const struct of_device_id adm1275_of_match[] = {
+	{ .compatible = "rohm,bd12780", },
+	{ .compatible = "rohm,bd12790", },
+	{ }
+};
+MODULE_DEVICE_TABLE(of, adm1275_of_match);
+
 static struct i2c_driver adm1275_driver = {
 	.driver = {
 		   .name = "adm1275",
+		   .of_match_table = adm1275_of_match,
 		   },
 	.probe = adm1275_probe,
 	.id_table = adm1275_id,
-- 
2.54.0

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply related

* 回复：[PATCH v18 net-next 07/11] net/nebula-matrix: add intr resource implementation
From: Illusion Wang @ 2026-06-16  6:45 UTC (permalink / raw)
  To: Illusion Wang, Dimon, Alvin, Sam, netdev
  Cc: andrew+netdev, corbet, kuba, horms, linux-doc, pabeni,
	vadim.fedorenko, lukas.bulwahn, edumazet, enelsonmoore, skhan,
	hkallweit1, open list
In-Reply-To: <20260611044916.2383-8-illusion.wang@nebula-matrix.com>

>commit 7a32ca8335c9041ec03b80d30716c474e81723ad
>Author: illusion.wang <illusion.wang@nebula-matrix.com>
>net/nebula-matrix: add intr resource implementation
>This patch adds the nbl_interrupt module for management of driver-private
>MSI-X indices, plus hw_ops callbacks to program the function MSI-X mapping
>registers, configure MSI-X info, and enable mailbox interrupts.
>> diff --git a/drivers/net/ethernet/nebula-matrix/nbl/nbl_hw/nbl_interrupt.c b/drivers/net/ethernet/nebula-matrix/nbl/nbl_hw/nbl_interrupt.c
>> new file mode 100644
>> index 0000000000000..6d6924e7d9a82
>> --- /dev/null
>> +++ b/drivers/net/ethernet/nebula-matrix/nbl/nbl_hw/nbl_interrupt.c
>[ ... ]
>> +int nbl_res_intr_destroy_msix_map(struct nbl_resource_mgt *res_mgt,
>> +				  u16 func_id)
>> +{
>[ ... ]
>> +	/* use ctrl dev bdf */
>> +	hw_ops->configure_msix_map(res_mgt->hw_ops_tbl->priv, func_id, false, 0,
>> +				   0, 0, 0);
>> +
>> +	intr_mgt->func_intr_res[func_id].interrupts = NULL;
>> +	intr_mgt->func_intr_res[func_id].num_interrupts = 0;
>> +	intr_mgt->func_intr_res[func_id].num_net_interrupts = 0;
>> +	hw_ops->flush_write(res_mgt->hw_ops_tbl->priv);
>> +
>> +	return 0;
>> +}
>Does this code leak the device-managed memory? nbl_res_intr_configure_msix_map()
>allocates interrupts via devm_kcalloc() and msix_map_table->base_addr via
>dmam_alloc_coherent(). Setting the pointer to NULL here without calling
>devm_kfree() and dmam_free_coherent() means the memory won't be freed until
>the device is detached. If the MSI-X map is repeatedly reconfigured, could
>this exhaust memory?
>[ ... ]
>> +int nbl_res_intr_configure_msix_map(struct nbl_resource_mgt *res_mgt,
>> +				    u16 func_id, u16 num_net_msix,
>> +				    u16 num_others_msix,
>> +				    bool net_msix_mask_en)
>> +{
[ ... ]
>> +	msix_map_table = &intr_mgt->func_intr_res[func_id].msix_map_table;
>> +	WARN_ON(msix_map_table->base_addr);
>> +	msix_map_table->size =
>> +		sizeof(struct nbl_msix_map) * NBL_MSIX_MAP_TABLE_MAX_ENTRIES;
>> +	/* HW indexes the MSI-X map table with a fixed stride of 1024 entries,
>> +	 * so we must allocate the full table even if fewer vectors are used.
>> +	 */
>> +	msix_map_table->base_addr = dmam_alloc_coherent(dev,
>> +							msix_map_table->size,
>> +							&msix_map_table->dma,
>> +							GFP_KERNEL);
>> +	if (!msix_map_table->base_addr) {
>> +		dev_err(dev, "Allocate DMA memory for function msix map table failed\n");
>> +		msix_map_table->size = 0;
>> +		return -ENOMEM;
>> +	}
>> +
>> +	interrupts = devm_kcalloc(dev, requested, sizeof(interrupts[0]),
>> +				  GFP_KERNEL);
>> +	if (!interrupts) {
>> +		ret = -ENOMEM;
>> +		goto alloc_interrupts_err;
>> +	}
[ ... ]
>> +get_interrupt_err:
>> +	while (i--) {
>> +		intr_index = interrupts[i];
>> +		if (intr_index >= NBL_NET_INTR_BASE)
>> +			clear_bit(intr_index - NBL_NET_INTR_BASE,
>> +				  intr_mgt->interrupt_net_bitmap);
>> +		else
>> +			clear_bit(intr_index,
>> +				  intr_mgt->interrupt_others_bitmap);
>> +	}
>> +
>> +	intr_mgt->func_intr_res[func_id].num_interrupts = 0;
>> +	intr_mgt->func_intr_res[func_id].interrupts = NULL;
>> +	intr_mgt->func_intr_res[func_id].num_net_interrupts = 0;
>> +alloc_interrupts_err:
>> +
>> +	return ret;
>> +}
>If an error occurs after allocating the memory (e.g., jumping to
>get_interrupt_err or alloc_interrupts_err), does this error path also leak
>the devm_kcalloc() and dmam_alloc_coherent() memory by simply returning
>without freeing it?

Both use devm-managed resources, so I think no manual free is needed.

destroy_msix_map: The interrupts array is allocated with devm_kcalloc(), and
the MSI-X map table is allocated with dmam_alloc_coherent() on the device's
DMA pool. Both are tied to the device lifecycle — they are automatically
reclaimed on driver detach. Setting the pointer to NULL is defensive;
the devm framework handles the actual free.
configure_msix_map error path: Same reasoning. interrupts is devm_kcalloc(),
and the DMA table is device-managed. On any error, the function returns to
the caller, which will eventually trigger driver remove and devm cleanup.
There is no repeated-reconfig loop in the current design — configure_msix_map
is only called during probe/reinit, not in a hot path — so exhaustion is not
a practical concern. Adding manual free calls would duplicate what devm already
does and risk double-free.

^ permalink raw reply

* [PATCH 6/7] hwmon: adm1275: Support ROHM BD12790
From: Matti Vaittinen @ 2026-06-16  6:44 UTC (permalink / raw)
  To: Matti Vaittinen, Matti Vaittinen, Matti Vaittinen
  Cc: Guenter Roeck, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Jonathan Corbet, Shuah Khan, Wensheng Wang, Ashish Yadav,
	Matti Vaittinen, Kim Seer Paller, Cedric Encarnacion,
	Chris Packham, Yuxi Wang, Charles Hsu, ChiShih Tsai, linux-hwmon,
	devicetree, linux-kernel, linux-doc
In-Reply-To: <cover.1781591132.git.mazziesaccount@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 6767 bytes --]

From: Matti Vaittinen <mazziesaccount@gmail.com>

Add support for ROHM BD12790 hot-swap controller which is largely
similar to Analog Devices adm1272.

The BD12790 uses the same selectable 60V/100V voltage ranges and
15mV/30mV current-sense ranges as the ADM1272, and the same VRANGE
(bit 5) and IRANGE (bit 0) layout in PMON_CONFIG. It therefore uses
a dedicated coefficient table that mirrors adm1272_coefficients, with
the following differences derived from BD12790 datasheet Table 1 (p.18):
- power 60V/30mV: m=17560 (vs. 17561)
- power 100V/30mV: m=10536 (vs. 10535)
- temperature: b=31880 (vs. 31871, reflecting T[11:0] = 4.2*T + 3188)

Signed-off-by: Matti Vaittinen <mazziesaccount@gmail.com>
Assisted-by: GitHub Copilot:claude-sonnet-4.6

---
Originally this patch was AI-generated. I did pretty much re-write the
probe changes by hand, and also fixed some of the coefficient math
afterwards :/ But yeah, this one was AI "assisted". :)

 drivers/hwmon/pmbus/Kconfig   |  4 +--
 drivers/hwmon/pmbus/adm1275.c | 53 +++++++++++++++++++++++++++++------
 2 files changed, 47 insertions(+), 10 deletions(-)

diff --git a/drivers/hwmon/pmbus/Kconfig b/drivers/hwmon/pmbus/Kconfig
index b3c27f3b2712..6ebc01e26db3 100644
--- a/drivers/hwmon/pmbus/Kconfig
+++ b/drivers/hwmon/pmbus/Kconfig
@@ -52,8 +52,8 @@ config SENSORS_ADM1275
 	help
 	  If you say yes here you get hardware monitoring support for Analog
 	  Devices ADM1075, ADM1272, ADM1273, ADM1275, ADM1276, ADM1278, ADM1281,
-	  ADM1293, ADM1294, ROHM BD12780, and SQ24905C Hot-Swap Controller and
-	  Digital Power Monitors.
+	  ADM1293, ADM1294, ROHM BD12780, ROHM BD12790, and SQ24905C
+	  Hot-Swap Controller and Digital Power Monitors.
 
 	  This driver can also be built as a module. If so, the module will
 	  be called adm1275.
diff --git a/drivers/hwmon/pmbus/adm1275.c b/drivers/hwmon/pmbus/adm1275.c
index 838b8827eb76..9e21dd4083e9 100644
--- a/drivers/hwmon/pmbus/adm1275.c
+++ b/drivers/hwmon/pmbus/adm1275.c
@@ -19,7 +19,7 @@
 #include "pmbus.h"
 
 enum chips { adm1075, adm1272, adm1273, adm1275, adm1276, adm1278, adm1281,
-	 adm1293, adm1294, bd12780, sq24905c };
+	 adm1293, adm1294, bd12780, bd12790, sq24905c };
 
 #define ADM1275_MFR_STATUS_IOUT_WARN2	BIT(0)
 #define ADM1293_MFR_STATUS_VAUX_UV_WARN	BIT(5)
@@ -47,8 +47,8 @@ enum chips { adm1075, adm1272, adm1273, adm1275, adm1276, adm1278, adm1281,
 #define ADM1278_VOUT_EN			BIT(1)
 
 #define ADM1278_PMON_DEFCONFIG		(ADM1278_VOUT_EN | ADM1278_TEMP1_EN | ADM1278_TSFILT)
-/* The BD12780 data sheets mark TSFILT bit as reserved. */
-#define BD12780_PMON_DEFCONFIG		(ADM1278_VOUT_EN | ADM1278_TEMP1_EN)
+/* The BD127x0 data sheets mark TSFILT bit as reserved. */
+#define BD127X0_PMON_DEFCONFIG		(ADM1278_VOUT_EN | ADM1278_TEMP1_EN)
 
 #define ADM1293_IRANGE_25		0
 #define ADM1293_IRANGE_50		BIT(6)
@@ -136,6 +136,30 @@ static const struct coefficients adm1272_coefficients[] = {
 
 };
 
+/*
+ * BD12790 coefficients derived from preliminary datasheet, Table 1 (p.18)
+ * and the PMBus direct-format relationship X = (Y * 10^(-R) - b) / m.
+ *
+ * Voltage: V[V] = 14.77e-3 * code (60V) / 24.62e-3 * code (100V)
+ *   -> m = 6770, R=-2 / m = 4062, R=-2
+ * Current: code = I[A] * RS * 132802.1 + 2048 (15mV) / * 66401.06 + 2048 (30mV)
+ *   -> m = 1328, b = 2048 * 10^(-R) = 20480, R=-1 / m = 664, same b and R
+ * Power: code = k * RS * PIN, k = 35119.94 / 17559.97 / 21071.44 / 10535.72
+ *   -> m = round(k / 10^(-R)), R=-2 for 60V/15mV, R=-3 for the other three
+ * Temperature: code = 4.2 * T + 3188 -> m = 42, b = 3188 * 10 = 31880, R=-1
+ */
+static const struct coefficients bd12790_coefficients[] = {
+	[0] = { 6770, 0, -2 },		/* voltage, vrange 60V */
+	[1] = { 4062, 0, -2 },		/* voltage, vrange 100V */
+	[2] = { 1328, 20480, -1 },	/* current, vsense range 15mV */
+	[3] = { 664, 20480, -1 },	/* current, vsense range 30mV */
+	[4] = { 3512, 0, -2 },		/* power, vrange 60V, irange 15mV */
+	[5] = { 21071, 0, -3 },		/* power, vrange 100V, irange 15mV */
+	[6] = { 17560, 0, -3 },		/* power, vrange 60V, irange 30mV */
+	[7] = { 10536, 0, -3 },		/* power, vrange 100V, irange 30mV */
+	[8] = { 42, 31880, -1 },	/* temperature */
+};
+
 static const struct coefficients adm1275_coefficients[] = {
 	[0] = { 19199, 0, -2 },		/* voltage, vrange set */
 	[1] = { 6720, 0, -1 },		/* voltage, vrange not set */
@@ -504,6 +528,7 @@ static const struct i2c_device_id adm1275_id[] = {
 	 */
 	{ "bd12780", bd12780 },
 	{ "bd12780a", /* driver data unused, see --^ */ },
+	{ "bd12790", bd12790 },
 	{ "mc09c", sq24905c },
 	{ }
 };
@@ -581,7 +606,8 @@ static int adm1275_probe(struct i2c_client *client)
 	if (mid->driver_data == adm1272 || mid->driver_data == adm1273 ||
 	    mid->driver_data == adm1278 || mid->driver_data == adm1281 ||
 	    mid->driver_data == adm1293 || mid->driver_data == adm1294 ||
-	    mid->driver_data == bd12780 || mid->driver_data == sq24905c)
+	    mid->driver_data == bd12780 || mid->driver_data == bd12790 ||
+	    mid->driver_data == sq24905c)
 		config_read_fn = i2c_smbus_read_word_data;
 	else
 		config_read_fn = i2c_smbus_read_byte_data;
@@ -655,12 +681,23 @@ static int adm1275_probe(struct i2c_client *client)
 		break;
 	case adm1272:
 	case adm1273:
+	case bd12790:
+	{
+		u16 defconfig;
+
 		data->have_vout = true;
 		data->have_pin_max = true;
 		data->have_temp_max = true;
 		data->have_power_sampling = true;
 
-		coefficients = adm1272_coefficients;
+		if (data->id == bd12790) {
+			coefficients = bd12790_coefficients;
+			defconfig = BD127X0_PMON_DEFCONFIG;
+		} else {
+			coefficients = adm1272_coefficients;
+			defconfig = ADM1278_PMON_DEFCONFIG;
+		}
+
 		vindex = (config & ADM1275_VRANGE) ? 1 : 0;
 		cindex = (config & ADM1272_IRANGE) ? 3 : 2;
 		/* pindex depends on the combination of the above */
@@ -685,14 +722,14 @@ static int adm1275_probe(struct i2c_client *client)
 			PMBUS_HAVE_VOUT | PMBUS_HAVE_STATUS_VOUT |
 			PMBUS_HAVE_TEMP | PMBUS_HAVE_STATUS_TEMP;
 
-		ret = adm1275_enable_vout_temp(data, client, config,
-					       ADM1278_PMON_DEFCONFIG);
+		ret = adm1275_enable_vout_temp(data, client, config, defconfig);
 		if (ret)
 			return ret;
 
 		if (config & ADM1278_VIN_EN)
 			info->func[0] |= PMBUS_HAVE_VIN;
 		break;
+	}
 	case adm1275:
 		if (device_config & ADM1275_IOUT_WARN2_SELECT)
 			data->have_oc_fault = true;
@@ -738,7 +775,7 @@ static int adm1275_probe(struct i2c_client *client)
 		u16 defconfig;
 
 		if (data->id == bd12780)
-			defconfig = BD12780_PMON_DEFCONFIG;
+			defconfig = BD127X0_PMON_DEFCONFIG;
 		else
 			defconfig = ADM1278_PMON_DEFCONFIG;
 
-- 
2.54.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply related

* [RFC PATCH v4 3/3] trace: add documentation, selftest and tooling for stackmap
From: Li Pengfei @ 2026-06-16  6:41 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: Mathieu Desnoyers, Mark Rutland, Jonathan Corbet, Shuah Khan,
	linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	lipengfei28, zhangbo56, kernel test robot
In-Reply-To: <20260616064119.438063-1-lipengfei28@xiaomi.com>

From: Pengfei Li <lipengfei28@xiaomi.com>

Add supporting files for the ftrace stackmap feature:

Documentation/trace/ftrace-stackmap.rst:
  Documentation covering design, usage, tracefs interface, binary
  format, and performance characteristics. Added to the 'Core Tracing
  Frameworks' toctree in Documentation/trace/index.rst. Documents:
  - Reset is destructive: it requires tracing to be stopped and also
    clears the ring buffer so no stale <stack_id N> survives
  - Boot-time activation via trace_options=stackmap
  - bits parameter range [10, 18] and worst-case memory usage
  - tracefs file modes (0640 / 0440)
  - Best-effort snapshot semantics for stack_map_bin, serialized
    against reset via the reader_sem
  - Counter naming: successes (events served), drops, success_rate;
    successes/drops are best-effort and saturate on long runs
  - Gravestone amplification when the pool is exhausted

tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc:
  Functional selftest verifying:
  - stackmap tracefs nodes exist
  - enabling stackmap + stacktrace produces stack_id events
  - stack_map_stat shows non-zero successes (a nonzero drops count is
    a legitimate by-design fallback and is not treated as failure;
    only zero successes alongside nonzero drops is fatal)
  - reset clears entries when tracing is stopped
  - reset is rejected (-EBUSY) while tracing is active
  Test reads trace contents BEFORE switching back to the nop tracer
  (tracer_init() unconditionally resets the ring buffer). The
  function:tracer dependency is declared in '# requires:' so
  ftracetest skips on kernels without CONFIG_FUNCTION_TRACER instead
  of failing spuriously.

tools/testing/selftests/ftrace/test.d/ftrace/stackmap-reset.tc:
  Verifies the destructive-reset semantics and the binary ABI header:
  - after 'echo 0 > stack_map', the trace buffer no longer contains
    any stale <stack_id N>
  - stack_map_bin begins with the expected magic and version

tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc:
  Verifies the option is gated to the top-level instance: a secondary
  instance neither exposes options/stackmap nor the stack_map* nodes,
  and writing 'stackmap' to its aggregate trace_options file is
  rejected rather than accepted as a no-op.

tools/tracing/stackmap_dump.py:
  Python script to parse the binary stack_map_bin export.
  Features:
  - Automatic endianness detection via magic number
  - Batched addr2line via stdin (avoids ARG_MAX with large stacks)
  - JSON output mode (ips are always hex addresses; the ftrace
    trampoline marker is shown only in the resolved symbols)
  - Top-N filtering by ref_count

Binary format: all fields are native-endian. The parser detects
byte order by reading the magic value (0x46534D42 = 'FSMB').

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202605160010.fakzGVVq-lkp@intel.com/
Signed-off-by: Pengfei Li <lipengfei28@xiaomi.com>
---
 Documentation/trace/ftrace-stackmap.rst       | 177 ++++++++++++++++++
 Documentation/trace/index.rst                 |   1 +
 .../ftrace/test.d/ftrace/stackmap-basic.tc    | 111 +++++++++++
 .../test.d/ftrace/stackmap-instance-gate.tc   |  54 ++++++
 .../ftrace/test.d/ftrace/stackmap-reset.tc    |  76 ++++++++
 tools/tracing/stackmap_dump.py                | 164 ++++++++++++++++
 6 files changed, 583 insertions(+)
 create mode 100644 Documentation/trace/ftrace-stackmap.rst
 create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
 create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc
 create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-reset.tc
 create mode 100755 tools/tracing/stackmap_dump.py

diff --git a/Documentation/trace/ftrace-stackmap.rst b/Documentation/trace/ftrace-stackmap.rst
new file mode 100644
index 000000000000..8d0b5c389862
--- /dev/null
+++ b/Documentation/trace/ftrace-stackmap.rst
@@ -0,0 +1,177 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================
+Ftrace Stack Map
+======================
+
+:Author: Pengfei Li <lipengfei28@xiaomi.com>
+
+Overview
+========
+
+The ftrace stack map provides stack trace deduplication for the ftrace
+ring buffer. When enabled, instead of storing full kernel stack traces
+(typically 80-160 bytes each) in the ring buffer for every event, ftrace
+stores only a 4-byte ``stack_id``. The full stacks are maintained in a
+separate hash table and exported via tracefs for userspace to resolve.
+
+This is inspired by eBPF's ``BPF_MAP_TYPE_STACK_TRACE`` but integrated
+into ftrace's infrastructure, requiring no userspace daemon.
+
+Configuration
+=============
+
+Enable ``CONFIG_FTRACE_STACKMAP=y`` in the kernel config.
+
+Kernel command line parameters:
+
+- ``ftrace_stackmap.bits=N`` - Set map capacity to 2^N unique stacks
+  (default: 14 → 16384 stacks; valid range: 10-18).
+
+  At ``bits=18`` the kernel reserves roughly 130 MB of vmalloc memory
+  for the element pool. Each ``open()`` of ``stack_map_bin`` may
+  briefly allocate a similar amount for a snapshot. The cap is set
+  intentionally to bound memory usage.
+
+Usage
+=====
+
+Enable stack deduplication::
+
+    echo 1 > /sys/kernel/debug/tracing/options/stackmap
+    echo 1 > /sys/kernel/debug/tracing/options/stacktrace
+    echo function > /sys/kernel/debug/tracing/current_tracer
+
+The trace output will show ``<stack_id N>`` instead of full stack traces::
+
+    sh-1234 [006] d.h.. 123.456789: <stack_id 42>
+
+To view the actual stacks::
+
+    cat /sys/kernel/debug/tracing/stack_map
+
+Output format::
+
+    stack_id 42 [ref 1337, depth 8]
+      [0] schedule+0x48/0xc0
+      [1] schedule_timeout+0x1c/0x30
+      ...
+
+To view statistics::
+
+    cat /sys/kernel/debug/tracing/stack_map_stat
+
+Output::
+
+    entries:      2500 / 16384
+    table_size:   32768
+    successes:    148923
+    drops:        0
+    success_rate: 100%
+
+To reset the stack map (tracing must be stopped first)::
+
+    echo 0 > /sys/kernel/debug/tracing/tracing_on
+    echo 0 > /sys/kernel/debug/tracing/stack_map
+
+Reset returns ``-EBUSY`` if tracing is currently active, or if another
+reset is already in progress.
+
+Reset is destructive to the trace buffer: because the ring buffer may
+still hold ``<stack_id N>`` events that reference soon-to-be-reused
+slots, resetting the map also resets the owning trace buffer (and its
+snapshot, if allocated). This keeps ring-buffer stack_ids and the map
+coherent. Read out any trace data you need before resetting.
+
+Boot-time activation
+====================
+
+The stackmap option can be enabled from the kernel command line::
+
+    trace_options=stackmap,stacktrace
+
+Trace events that fire before the tracefs filesystem is initialized
+(``fs_initcall`` time) fall back to recording full stack traces; once
+``ftrace_stackmap_create()`` runs, subsequent events are deduplicated.
+The crossover is automatic and lossless — no events are dropped, but
+early-boot stacks recorded before the crossover are not deduplicated.
+
+Tracefs Nodes
+=============
+
+The stack_map files are owned by root and not world-readable
+(``stack_map``: 0640; ``stack_map_stat`` and ``stack_map_bin``: 0440).
+
+``stack_map``
+    Text export of all deduplicated stacks with symbol resolution.
+    Writing ``0`` or ``reset`` clears all entries (only when tracing
+    is stopped).
+
+``stack_map_stat``
+    Statistics: entries (allocated unique stacks), table_size,
+    successes (events served), drops (events that fell back to
+    full-stack recording), and success_rate. Drops accumulate when
+    the element pool is exhausted; once that happens, slots that
+    won the cmpxchg but failed to allocate an element remain
+    "claimed but empty" and increase probe pressure for any future
+    insert hashing to the same bucket. Reset (when tracing is
+    stopped) clears these gravestones.
+
+``stack_map_bin``
+    Binary export for efficient userspace consumption. Format:
+
+    - Header (16 bytes): magic(u32) + version(u32) + nr_stacks(u32) + reserved(u32)
+    - Per stack: stack_id(u32) + nr(u32) + ref_count(u32) + reserved(u32) + ips(u64 × nr)
+
+    All fields are written in the kernel's native byte order.
+    Userspace tools detect endianness by reading the magic value.
+    Magic: ``0x46534D42`` ('FSMB'), Version: 1.
+
+    Trampoline frames are exported as the sentinel value
+    ``0x7fffffff`` (FTRACE_TRAMPOLINE_MARKER); all other addresses are
+    passed through ``trace_adjust_address()`` so they match the
+    ``stack_map`` text output's address-adjustment rules. Note this is
+    the same adjustment ftrace applies to its own trace output (mainly
+    relevant for persistent / last-boot buffers), not a general KASLR
+    un-offset: resolving these addresses offline still requires the
+    matching kernel's symbol information.
+
+    The export is a best-effort snapshot allocated at ``open()``;
+    concurrent inserts during the snapshot may be truncated. A
+    bounds check ensures no overflow.
+
+Design
+======
+
+The stack map is modeled after ``tracing_map.c`` (used by hist triggers),
+using a lock-free design based on Dr. Cliff Click's non-blocking hash table
+algorithm:
+
+- **Lookup/Insert**: Lock-free via ``cmpxchg``, safe in NMI/IRQ/any context
+- **Memory**: Pre-allocated element pool, zero allocation on the hot path
+  (no GFP_ATOMIC failures under memory pressure)
+- **Collision**: Linear probing with a 2x over-provisioned table; probe
+  length is bounded so worst-case insert/lookup is O(1)
+- **Scope**: Currently supports the global trace instance
+- **Hash**: 32-bit jhash with a per-instance random seed; full ``memcmp``
+  confirms matches
+
+Deduplication is best-effort, not strict: if two CPUs race in the
+insert path with the same ``key_hash`` (i.e. the same stack), the
+``cmpxchg`` loser advances by one slot and may insert the same stack
+again. Under heavy contention this can produce a small number of
+duplicate entries for the same stack; ``ref_count`` is then split
+across the duplicates. Total memory is still bounded by the element
+pool size, and lookup correctness is unaffected (each duplicate is
+a self-consistent entry with its own ``stack_id``). The trade-off is
+intentional and keeps the hot path lock-free.
+
+Performance
+===========
+
+Typical results on an aarch64 SMP system (function tracer, 2 seconds):
+
+- Unique stacks: ~3000
+- Dedup rate: 84-98% (depends on workload diversity)
+- Ring buffer savings: ~80% for stack data
+- Overhead per event: ~50ns (one jhash + hash table lookup)
diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst
index 5d9bf4694d5d..ac8b1141c23a 100644
--- a/Documentation/trace/index.rst
+++ b/Documentation/trace/index.rst
@@ -33,6 +33,7 @@ the Linux kernel.
    ftrace
    ftrace-design
    ftrace-uses
+   ftrace-stackmap
    kprobes
    kprobetrace
    fprobetrace
diff --git a/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
new file mode 100644
index 000000000000..64dfe7cc66bd
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
@@ -0,0 +1,111 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: ftrace - stackmap basic functionality
+# requires: stack_map options/stackmap function:tracer
+
+# Test that ftrace stackmap deduplication works:
+# 1. Enable stackmap + stacktrace options
+# 2. Run function tracer briefly
+# 3. Verify trace contains <stack_id> events (read BEFORE switching
+#    tracer back to nop, since tracer_init() resets the ring buffer)
+# 4. Verify stack_map has entries and at least some successes (drops is
+#    a legitimate by-design fallback counter and is allowed to be nonzero;
+#    only zero successes alongside nonzero drops indicates breakage)
+# 5. Verify reset is rejected (-EBUSY) while tracing is active
+# 6. Verify reset clears the map when tracing is stopped
+
+fail() {
+    echo "FAIL: $1"
+    exit_fail
+}
+
+# Restore state on any exit (success, fail, or interrupt) so a
+# half-finished test does not leave stacktrace/stackmap enabled.
+cleanup() {
+    disable_tracing 2>/dev/null
+    echo nop > current_tracer 2>/dev/null
+    echo 0 > options/stackmap 2>/dev/null
+    echo 0 > options/stacktrace 2>/dev/null
+}
+trap cleanup EXIT
+
+disable_tracing
+clear_trace
+
+# Verify stackmap files exist
+test -f stack_map      || fail "stack_map file missing"
+test -f stack_map_stat || fail "stack_map_stat file missing"
+test -f stack_map_bin  || fail "stack_map_bin file missing"
+
+# Enable stackmap dedup
+echo 1 > options/stackmap
+echo 1 > options/stacktrace
+
+# Run function tracer briefly
+echo function > current_tracer
+enable_tracing
+sleep 1
+disable_tracing
+
+# Read trace contents NOW, before switching tracer back to nop.
+# tracer_init() unconditionally calls tracing_reset_online_cpus(),
+# so the ring buffer would be empty after 'echo nop > current_tracer'.
+count=$(grep -c "<stack_id" trace || true)
+: "${count:=0}"
+if [ "$count" -eq 0 ]; then
+    fail "trace has no <stack_id> events"
+fi
+
+# Now safe to switch back and disable options
+echo nop > current_tracer
+echo 0 > options/stackmap
+
+# Check stack_map_stat
+entries=$(cat stack_map_stat | grep "^entries:" | awk '{print $2}')
+: "${entries:=0}"
+if [ "$entries" -eq 0 ]; then
+    fail "stackmap has zero entries after tracing"
+fi
+
+successes=$(cat stack_map_stat | grep "^successes:" | awk '{print $2}')
+: "${successes:=0}"
+if [ "$successes" -eq 0 ]; then
+    fail "stackmap has zero successes"
+fi
+
+drops=$(cat stack_map_stat | grep "^drops:" | awk '{print $2}')
+: "${drops:=0}"
+# drops is a legitimate by-design fallback counter: when the map is full
+# or under heavy probe pressure, stackmap falls back to recording a full
+# stack instead of a stack_id. A nonzero drops count is therefore not a
+# failure. Only treat it as fatal if dedup never worked at all (no
+# successes), which would indicate the feature is genuinely broken rather
+# than merely under pressure.
+if [ "$successes" -eq 0 ] && [ "$drops" -ne 0 ]; then
+    fail "stackmap had $drops drops and zero successes (feature broken?)"
+fi
+
+# Check stack_map text output is parseable
+first_id=$(cat stack_map | grep "^stack_id" | head -1 | awk '{print $2}')
+if [ -z "$first_id" ]; then
+    fail "stack_map output has no stack_id entries"
+fi
+
+# Test that reset is rejected while tracing is active
+enable_tracing
+if echo 0 > stack_map 2>/dev/null; then
+    disable_tracing
+    fail "stackmap reset should fail while tracing is active"
+fi
+disable_tracing
+
+# Test reset works when tracing is stopped
+echo 0 > stack_map
+entries_after=$(cat stack_map_stat | grep "^entries:" | awk '{print $2}')
+: "${entries_after:=-1}"
+if [ "$entries_after" -ne 0 ]; then
+    fail "stackmap reset did not clear entries (got $entries_after)"
+fi
+
+echo "stackmap basic test passed: $entries unique stacks, $successes successes, $drops drops"
+exit 0
diff --git a/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc
new file mode 100644
index 000000000000..28810ba20432
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc
@@ -0,0 +1,54 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: ftrace - stackmap option is gated to the top-level trace instance
+# requires: stack_map options/stackmap instances
+
+# The 'stackmap' option is added to TOP_LEVEL_TRACE_FLAGS, matching the
+# convention used for global-only options like 'printk' and 'record-cmd'.
+# Verify that:
+# 1. The global instance exposes options/stackmap and the stack_map* nodes.
+# 2. A newly created secondary instance under instances/ does NOT expose
+#    options/stackmap or stack_map* nodes.
+
+fail() {
+    echo "FAIL: $1"
+    rmdir instances/test_stackmap_gate 2>/dev/null
+    exit_fail
+}
+
+# 1. Global instance must expose the option and the nodes
+test -e options/stackmap || fail "options/stackmap missing on global instance"
+test -e stack_map        || fail "stack_map missing on global instance"
+test -e stack_map_stat   || fail "stack_map_stat missing on global instance"
+test -e stack_map_bin    || fail "stack_map_bin missing on global instance"
+
+# 2. Create a secondary instance and verify it does NOT see the option
+#    or the stack_map* nodes.
+mkdir instances/test_stackmap_gate || fail "could not create secondary instance"
+
+if [ -e instances/test_stackmap_gate/options/stackmap ]; then
+    fail "secondary instance unexpectedly exposes options/stackmap"
+fi
+
+for f in stack_map stack_map_stat stack_map_bin; do
+    if [ -e instances/test_stackmap_gate/$f ]; then
+        fail "secondary instance unexpectedly has $f"
+    fi
+done
+
+# 3. The aggregate trace_options file still reaches set_tracer_flag(),
+#    so writing 'stackmap' there must be rejected on a secondary
+#    instance. Otherwise the bit could appear set in trace_options
+#    while the hot path silently falls back to a full stack trace
+#    (tr->stackmap == NULL).
+if echo stackmap > instances/test_stackmap_gate/trace_options 2>/dev/null; then
+    fail "secondary instance accepted 'echo stackmap > trace_options'"
+fi
+if grep -qw stackmap instances/test_stackmap_gate/trace_options; then
+    fail "secondary instance trace_options reports stackmap as set"
+fi
+
+rmdir instances/test_stackmap_gate || fail "could not remove secondary instance"
+
+echo "stackmap option gating to top-level instance works"
+exit 0
diff --git a/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-reset.tc b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-reset.tc
new file mode 100644
index 000000000000..803cc282f9ab
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-reset.tc
@@ -0,0 +1,76 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: ftrace - stackmap reset clears the trace buffer and ABI header
+# requires: stack_map options/stackmap function:tracer
+
+# Lock in the two things most likely to regress in the stackmap ABI /
+# lifetime:
+#   1. Resetting the stackmap (echo 0 > stack_map, tracing stopped) also
+#      clears the trace buffer, so no stale <stack_id N> can be left
+#      dangling against an emptied map.
+#   2. The stack_map_bin header carries the expected magic ('FSMB' =
+#      0x46534D42) and version (1).
+
+fail() {
+    echo "FAIL: $1"
+    exit_fail
+}
+
+cleanup() {
+    disable_tracing 2>/dev/null
+    echo nop > current_tracer 2>/dev/null
+    echo 0 > options/stackmap 2>/dev/null
+    echo 0 > options/stacktrace 2>/dev/null
+}
+trap cleanup EXIT
+
+disable_tracing
+clear_trace
+
+echo 1 > options/stackmap
+echo 1 > options/stacktrace
+echo function > current_tracer
+enable_tracing
+sleep 1
+disable_tracing
+
+# Sanity: the buffer must contain stack_id events before reset, otherwise
+# the post-reset emptiness check below would be meaningless.
+before=$(grep -c "<stack_id" trace || true)
+: "${before:=0}"
+if [ "$before" -eq 0 ]; then
+    fail "no <stack_id> events captured before reset"
+fi
+
+# Reset while tracing is stopped. This must succeed AND clear the trace
+# buffer (destructive reset semantics).
+echo 0 > stack_map || fail "reset rejected while tracing stopped"
+
+after=$(grep -c "<stack_id" trace || true)
+: "${after:=0}"
+if [ "$after" -ne 0 ]; then
+    fail "trace still has $after stale <stack_id> events after reset"
+fi
+
+entries=$(cat stack_map_stat | grep "^entries:" | awk '{print $2}')
+: "${entries:=-1}"
+if [ "$entries" -ne 0 ]; then
+    fail "stackmap still has $entries entries after reset"
+fi
+
+# Binary export header: magic 'FSMB' (0x46534D42) + version 1.
+# od -tx4 renders the 32-bit words in the target's native byte order,
+# which matches what the kernel wrote, so the comparison is endian-safe.
+if command -v od >/dev/null 2>&1; then
+    magic=$(od -An -tx4 -N4 stack_map_bin | tr -d ' \n')
+    if [ "$magic" != "46534d42" ]; then
+        fail "stack_map_bin bad magic: 0x$magic (expected 46534d42)"
+    fi
+    ver=$(od -An -tx4 -j4 -N4 stack_map_bin | tr -d ' \n')
+    if [ "$ver" != "00000001" ]; then
+        fail "stack_map_bin bad version: 0x$ver (expected 00000001)"
+    fi
+fi
+
+echo "stackmap reset test passed: cleared $before stack_id events, ABI header ok"
+exit 0
diff --git a/tools/tracing/stackmap_dump.py b/tools/tracing/stackmap_dump.py
new file mode 100755
index 000000000000..2d9c49b776e6
--- /dev/null
+++ b/tools/tracing/stackmap_dump.py
@@ -0,0 +1,164 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+"""
+stackmap_dump.py - Parse and display ftrace stack_map_bin binary export.
+
+Usage:
+    # Pull from device and parse
+    adb pull /sys/kernel/debug/tracing/stack_map_bin /tmp/stack_map.bin
+    python3 stackmap_dump.py /tmp/stack_map.bin
+
+    # With vmlinux for offline symbol resolution
+    python3 stackmap_dump.py /tmp/stack_map.bin --vmlinux vmlinux
+
+    # JSON output for tooling
+    python3 stackmap_dump.py /tmp/stack_map.bin --json
+"""
+
+import struct
+import sys
+import argparse
+import json
+import subprocess
+
+MAGIC = 0x46534D42  # 'FSMB'
+HEADER_SIZE = 16  # 4 x u32
+ENTRY_SIZE = 16   # 4 x u32
+
+# __ftrace_trace_stack() replaces trampoline addresses with this marker
+# (FTRACE_TRAMPOLINE_MARKER == (unsigned long)INT_MAX) before the stack
+# is stored, so the binary export carries it verbatim.
+FTRACE_TRAMPOLINE_MARKER = 0x7fffffff
+TRAMPOLINE_LABEL = '[FTRACE TRAMPOLINE]'
+
+
+def detect_endianness(data):
+    """Detect byte order from magic number in header."""
+    if len(data) < 4:
+        raise ValueError("File too small")
+    magic_le = struct.unpack_from('<I', data, 0)[0]
+    if magic_le == MAGIC:
+        return '<'
+    magic_be = struct.unpack_from('>I', data, 0)[0]
+    if magic_be == MAGIC:
+        return '>'
+    raise ValueError(f"Bad magic: 0x{magic_le:08x} (neither LE nor BE)")
+
+
+def batch_addr2line(vmlinux, addrs):
+    """Resolve multiple addresses in one addr2line invocation."""
+    if not addrs:
+        return {}
+    try:
+        # Feed addresses on stdin to avoid ARG_MAX limits with large
+        # numbers of addresses (one stack can have 30+ frames; a
+        # snapshot can have thousands of unique stacks).
+        stdin = '\n'.join(hex(a) for a in addrs) + '\n'
+        result = subprocess.run(
+            ['addr2line', '-f', '-e', vmlinux],
+            input=stdin, capture_output=True, text=True, timeout=60
+        )
+        lines = result.stdout.split('\n')
+        # addr2line outputs 2 lines per address: function name + source location
+        symbols = {}
+        for i, addr in enumerate(addrs):
+            idx = i * 2
+            if idx < len(lines) and lines[idx] and lines[idx] != '??':
+                symbols[addr] = lines[idx]
+        return symbols
+    except (subprocess.TimeoutExpired, FileNotFoundError) as e:
+        print(f"warning: addr2line failed: {e}", file=sys.stderr)
+        return {}
+
+
+def parse_stackmap_bin(data):
+    """Parse binary stackmap data, yield (stack_id, ref_count, [ips])."""
+    if len(data) < HEADER_SIZE:
+        raise ValueError("File too small for header")
+
+    endian = detect_endianness(data)
+    header_fmt = f'{endian}IIII'
+    entry_fmt = f'{endian}IIII'
+
+    magic, version, nr_stacks, _ = struct.unpack_from(header_fmt, data, 0)
+    if version != 1:
+        raise ValueError(f"Unsupported version: {version}")
+
+    offset = HEADER_SIZE
+    for _ in range(nr_stacks):
+        if offset + ENTRY_SIZE > len(data):
+            break
+        stack_id, nr, ref_count, _ = struct.unpack_from(entry_fmt, data, offset)
+        offset += ENTRY_SIZE
+
+        ips_size = nr * 8
+        if offset + ips_size > len(data):
+            break
+        ips = struct.unpack_from(f'{endian}{nr}Q', data, offset)
+        offset += ips_size
+
+        yield stack_id, ref_count, list(ips)
+
+
+def main():
+    parser = argparse.ArgumentParser(description='Parse ftrace stack_map_bin')
+    parser.add_argument('file', help='Path to stack_map_bin file')
+    parser.add_argument('--vmlinux', help='Path to vmlinux for symbol resolution')
+    parser.add_argument('--json', action='store_true', help='JSON output')
+    parser.add_argument('--top', type=int, default=0,
+                        help='Show only top N stacks by ref_count')
+    args = parser.parse_args()
+
+    with open(args.file, 'rb') as f:
+        data = f.read()
+
+    stacks = list(parse_stackmap_bin(data))
+
+    if args.top > 0:
+        stacks.sort(key=lambda x: x[1], reverse=True)
+        stacks = stacks[:args.top]
+
+    # Batch symbol resolution
+    symbols = {}
+    if args.vmlinux:
+        all_addrs = set()
+        for _, _, ips in stacks:
+            all_addrs.update(ip for ip in ips
+                             if ip != FTRACE_TRAMPOLINE_MARKER)
+        symbols = batch_addr2line(args.vmlinux, list(all_addrs))
+
+    def render(ip):
+        if ip == FTRACE_TRAMPOLINE_MARKER:
+            return TRAMPOLINE_LABEL
+        return symbols.get(ip, f'0x{ip:x}')
+
+    if args.json:
+        output = []
+        for stack_id, ref_count, ips in stacks:
+            entry = {
+                'stack_id': stack_id,
+                'ref_count': ref_count,
+                'ips': [f'0x{ip:x}' for ip in ips]
+            }
+            if args.vmlinux:
+                entry['symbols'] = [render(ip) for ip in ips]
+            output.append(entry)
+        print(json.dumps(output, indent=2))
+    else:
+        for stack_id, ref_count, ips in stacks:
+            print(f"stack_id {stack_id} [ref {ref_count}, depth {len(ips)}]")
+            for i, ip in enumerate(ips):
+                if ip == FTRACE_TRAMPOLINE_MARKER:
+                    print(f"  [{i}] {TRAMPOLINE_LABEL}")
+                    continue
+                sym = symbols.get(ip, '')
+                if sym:
+                    sym = f' {sym}'
+                print(f"  [{i}] 0x{ip:x}{sym}")
+            print()
+
+    print(f"Total: {len(stacks)} unique stacks", file=sys.stderr)
+
+
+if __name__ == '__main__':
+    main()
-- 
2.34.1


^ permalink raw reply related

* [RFC PATCH v4 2/3] trace: integrate stackmap into ftrace stack recording path
From: Li Pengfei @ 2026-06-16  6:41 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: Mathieu Desnoyers, Mark Rutland, Jonathan Corbet, Shuah Khan,
	linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	lipengfei28, zhangbo56
In-Reply-To: <20260616064119.438063-1-lipengfei28@xiaomi.com>

From: Pengfei Li <lipengfei28@xiaomi.com>

Add TRACE_STACK_ID event type and integrate ftrace_stackmap into
__ftrace_trace_stack(). When the 'stackmap' trace option is enabled,
the stack recording path stores a 4-byte stack_id in the ring buffer
instead of the full stack trace.

Changes:

- New TRACE_STACK_ID in trace_type enum and stack_id_entry in
  trace_entries.h.

- New TRACE_ITER(STACKMAP) trace option flag; when CONFIG_FTRACE_STACKMAP
  is disabled, TRACE_ITER_STACKMAP_BIT is defined as -1 so that
  TRACE_ITER(STACKMAP) evaluates to 0 (following the existing pattern
  used by TRACE_ITER_PROF_TEXT_OFFSET).

- 'stackmap' is added to TOP_LEVEL_TRACE_FLAGS and ZEROED_TRACE_FLAGS
  so it is only exposed under the top-level trace instance, matching
  the convention already used for global-only options such as 'printk'
  and 'record-cmd'. Secondary instances under tracing/instances/*/
  do not see the option in their options/ directory.

- set_tracer_flag() additionally rejects enabling STACKMAP on a
  secondary instance. The per-option file is hidden on secondary
  instances, but a write to the aggregate trace_options file still
  reaches set_tracer_flag(); without this check the bit could be
  accepted and then become a silent no-op in the hot path (where
  tr->stackmap is NULL). This closes the global-instance-only gate
  at the write path, not just in the tracefs layout.

- __ftrace_trace_stack() reserves the TRACE_STACK_ID ring-buffer slot
  BEFORE calling ftrace_stackmap_get_id(), so the map (and its
  ref_count / success counters) is only mutated when a ring-buffer
  event will actually reference the entry. If the reservation fails
  it falls back to a full stack; if get_id() fails it discards the
  reserved slot and falls back. A stack deeper than
  FTRACE_STACKMAP_MAX_DEPTH skips the map entirely (get_id() would
  return -E2BIG) and records a full stack, so deep traces are never
  truncated or merged.

- Stackmap pointer read with smp_load_acquire(), published with
  smp_store_release() to ensure proper initialization ordering. The
  hot path falls back to a full stack whenever tr->stackmap is NULL.

- ftrace_stackmap_create() takes the owning trace_array so the
  stackmap can later clear that trace_array's buffers during reset.

- Added stack_id print handler in trace_output.c and TRACE_STACK_ID
  to trace_valid_entry() in trace_selftest.c so ftrace startup
  selftests accept the new entry type when the stackmap option is
  enabled.

Failure-atomic init and boot-time activation:

- The global stackmap and its tracefs files are created during
  tracer_init_tracefs(). stack_map is the single required file (it is
  both the resolver and the reset interface); it is created BEFORE the
  map pointer is published with smp_store_release(), so an observed
  non-NULL tr->stackmap implies the resolver/reset file exists. If
  stack_map cannot be created the map is destroyed and never published.

- A small init-state (PENDING / DONE / FAILED) lets set_tracer_flag()
  distinguish "not initialized yet" from "init failed". Boot-time
  options (trace_options=stackmap,stacktrace) are applied before the
  tracefs init work runs; the flag is allowed to be set while init is
  PENDING (the hot path falls back until the map is published, then the
  boot-set option takes effect), and is only rejected once init has
  permanently FAILED. On failure the STACKMAP flag is also cleared from
  the global instance so options/stackmap never reports an enabled
  no-op.

Fallback behavior: if stackmap returns an error (pool exhausted,
resetting, NULL pointer, or a too-deep stack), the full stack trace is
recorded as before -- no new failure modes introduced.

Per-instance stackmap support is left as a follow-up; gating the
option to the global instance (both in the tracefs layout and at the
set_tracer_flag() write path) makes the global-only scope explicit.

Usage:
  echo 1 > /sys/kernel/debug/tracing/options/stackmap
  echo 1 > /sys/kernel/debug/tracing/options/stacktrace

Signed-off-by: Pengfei Li <lipengfei28@xiaomi.com>
---
 kernel/trace/trace.c          | 216 +++++++++++++++++++++++++++++++++-
 kernel/trace/trace.h          |  17 +++
 kernel/trace/trace_entries.h  |  15 +++
 kernel/trace/trace_output.c   |  23 ++++
 kernel/trace/trace_selftest.c |   1 +
 5 files changed, 269 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 6eb4d3097a4d..e00bee5d0e01 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -57,6 +57,7 @@
 
 #include "trace.h"
 #include "trace_output.h"
+#include "trace_stackmap.h"
 
 #ifdef CONFIG_FTRACE_STARTUP_TEST
 /*
@@ -509,12 +510,13 @@ EXPORT_SYMBOL_GPL(unregister_ftrace_export);
 /* trace_options that are only supported by global_trace */
 #define TOP_LEVEL_TRACE_FLAGS (TRACE_ITER(PRINTK) |			\
 	       TRACE_ITER(PRINTK_MSGONLY) | TRACE_ITER(RECORD_CMD) |	\
-	       TRACE_ITER(PROF_TEXT_OFFSET) | FPROFILE_DEFAULT_FLAGS)
+	       TRACE_ITER(PROF_TEXT_OFFSET) | TRACE_ITER(STACKMAP) |	\
+	       FPROFILE_DEFAULT_FLAGS)
 
 /* trace_flags that are default zero for instances */
 #define ZEROED_TRACE_FLAGS \
 	(TRACE_ITER(EVENT_FORK) | TRACE_ITER(FUNC_FORK) | TRACE_ITER(TRACE_PRINTK) | \
-	 TRACE_ITER(COPY_MARKER))
+	 TRACE_ITER(COPY_MARKER) | TRACE_ITER(STACKMAP))
 
 /*
  * The global_trace is the descriptor that holds the top-level tracing
@@ -1562,7 +1564,7 @@ void tracing_reset_online_cpus(struct array_buffer *buf)
 	ring_buffer_record_enable(buffer);
 }
 
-static void tracing_reset_all_cpus(struct array_buffer *buf)
+void tracing_reset_all_cpus(struct array_buffer *buf)
 {
 	struct trace_buffer *buffer = buf->buffer;
 
@@ -2184,6 +2186,75 @@ void __ftrace_trace_stack(struct trace_array *tr,
 	}
 #endif
 
+#ifdef CONFIG_FTRACE_STACKMAP
+	/*
+	 * If stackmap dedup is enabled, try to store only the stack_id
+	 * in the ring buffer instead of the full stack trace.
+	 *
+	 * Reserve the TRACE_STACK_ID ring-buffer slot BEFORE inserting
+	 * into the stackmap. This guarantees the map is only mutated
+	 * (and its ref_count / success counters bumped) when a
+	 * ring-buffer event will actually reference the entry:
+	 *   - reservation fails  -> fall back to full stack, map untouched
+	 *   - get_id() fails      -> discard the reserved slot, fall back
+	 * so stack_map_stat counters stay consistent with what the ring
+	 * buffer holds, and a failed reservation never consumes a map
+	 * slot for an event that records a full stack anyway.
+	 */
+	if (tr->trace_flags & TRACE_ITER(STACKMAP)) {
+		struct ftrace_stackmap *smap;
+		struct stack_id_entry *sid_entry;
+		int sid;
+
+		/*
+		 * Pairs with the smp_store_release() that publishes the
+		 * fully initialized global stackmap at tracefs init.
+		 */
+		smap = smp_load_acquire(&tr->stackmap);
+		if (!smap)
+			goto full_stack;
+
+		/*
+		 * The stackmap stores at most FTRACE_STACKMAP_MAX_DEPTH
+		 * frames per entry. A deeper trace would be truncated, and
+		 * two distinct stacks that share the first MAX_DEPTH frames
+		 * would hash and compare equal, silently merging into one
+		 * stack_id. Keep the conservative full-stack path for deep
+		 * traces so no information is lost or misattributed.
+		 */
+		if (nr_entries > FTRACE_STACKMAP_MAX_DEPTH)
+			goto full_stack;
+
+		event = __trace_buffer_lock_reserve(buffer, TRACE_STACK_ID,
+						    sizeof(*sid_entry), trace_ctx);
+		if (!event)
+			goto full_stack;
+
+		sid = ftrace_stackmap_get_id(smap, fstack->calls, nr_entries);
+		if (sid < 0) {
+			/*
+			 * Pool exhausted or a reset is in progress. Discard
+			 * the reserved stack_id slot and record the full
+			 * stack instead, so the event still gets a trace.
+			 */
+			__trace_event_discard_commit(buffer, event);
+			goto full_stack;
+		}
+
+		sid_entry = ring_buffer_event_data(event);
+		sid_entry->stack_id = sid;
+		/*
+		 * stack_id is a synthetic side-event attached to a
+		 * primary trace event that was already subject to
+		 * filtering. No per-event filter is defined for
+		 * TRACE_STACK_ID, so commit unconditionally.
+		 */
+		__buffer_unlock_commit(buffer, event);
+		goto out;
+	}
+full_stack:
+#endif
+
 	event = __trace_buffer_lock_reserve(buffer, TRACE_STACK,
 				    struct_size(entry, caller, nr_entries),
 				    trace_ctx);
@@ -3979,6 +4050,33 @@ int trace_keep_overwrite(struct tracer *tracer, u64 mask, int set)
 	return 0;
 }
 
+#ifdef CONFIG_FTRACE_STACKMAP
+/*
+ * Tracks tracefs-time initialization of the global stackmap so that
+ * set_tracer_flag() can distinguish "not initialized yet" from
+ * "initialization permanently failed".
+ *
+ * Boot-time options (trace_options=stackmap,stacktrace) are applied
+ * very early, before tracer_init_tracefs() creates and publishes the
+ * map. We must allow the STACKMAP flag to be set during that window
+ * (the hot path falls back to a full stack while tr->stackmap is NULL,
+ * then starts using the map once it is published). We must, however,
+ * reject the enable once init has *failed*, so options/stackmap never
+ * reports an enabled no-op.
+ *
+ * Written once from the tracefs init work before any concurrent
+ * userspace writer to trace_options can run, then only read; a plain
+ * int is therefore sufficient.
+ */
+enum {
+	STACKMAP_INIT_PENDING,	/* tracer_init_tracefs() not run yet */
+	STACKMAP_INIT_DONE,	/* map published, stack_map file created */
+	STACKMAP_INIT_FAILED,	/* permanent failure, never available */
+};
+
+static int stackmap_init_state = STACKMAP_INIT_PENDING;
+#endif
+
 int set_tracer_flag(struct trace_array *tr, u64 mask, int enabled)
 {
 	switch (mask) {
@@ -3993,6 +4091,33 @@ int set_tracer_flag(struct trace_array *tr, u64 mask, int enabled)
 	if (!!(tr->trace_flags & mask) == !!enabled)
 		return 0;
 
+#ifdef CONFIG_FTRACE_STACKMAP
+	/*
+	 * STACKMAP is intentionally global-instance-only: the dedup map,
+	 * its tracefs files (stack_map / stack_map_stat / stack_map_bin)
+	 * and the lifetime/reset semantics are tied to the global trace
+	 * array. options/stackmap is hidden on secondary instances via
+	 * TOP_LEVEL_TRACE_FLAGS, but writes still reach set_tracer_flag()
+	 * through the aggregate trace_options file. Reject the enable on
+	 * a secondary instance so it cannot be silently accepted and then
+	 * become a no-op in the hot path (where tr->stackmap is NULL and
+	 * the code falls back to a full stack trace).
+	 *
+	 * On the global instance, allow the enable while init is still
+	 * pending (boot-time trace_options=stackmap is applied before the
+	 * tracefs init work creates the map; the hot path falls back
+	 * until the map is published). Only reject once init has
+	 * permanently failed, so options/stackmap never reports an
+	 * enabled no-op. READ_ONCE() suffices: this only inspects the
+	 * init state, it does not dereference the map (the hot path uses
+	 * smp_load_acquire(&tr->stackmap) for that).
+	 */
+	if (mask == TRACE_ITER(STACKMAP) && enabled &&
+	    (tr != &global_trace ||
+	     READ_ONCE(stackmap_init_state) == STACKMAP_INIT_FAILED))
+		return -EINVAL;
+#endif
+
 	/* Give the tracer a chance to approve the change */
 	if (tr->current_trace->flag_changed)
 		if (tr->current_trace->flag_changed(tr, mask, !!enabled))
@@ -9222,6 +9347,91 @@ static __init void tracer_init_tracefs_work_func(struct work_struct *work)
 			NULL, &tracing_dyn_info_fops);
 #endif
 
+#ifdef CONFIG_FTRACE_STACKMAP
+	{
+		struct ftrace_stackmap *smap;
+		struct dentry *map_file;
+
+		smap = ftrace_stackmap_create(&global_trace);
+		if (!IS_ERR(smap)) {
+			/*
+			 * Failure-atomic init: stack_map is the single
+			 * required tracefs file (it doubles as the reset
+			 * interface and the human-readable resolver). If
+			 * we cannot create it, the hot path must not be
+			 * able to emit <stack_id N> events that no one can
+			 * resolve or clear, so refuse to publish the map
+			 * and tear it down.
+			 *
+			 * Create stack_map BEFORE smp_store_release() so an
+			 * observed non-NULL global_trace.stackmap implies
+			 * its resolver/reset file exists.
+			 */
+			map_file = trace_create_file("stack_map",
+						     TRACE_MODE_WRITE, NULL,
+						     smap,
+						     &ftrace_stackmap_fops);
+			if (!map_file) {
+				pr_warn("ftrace stackmap init: stack_map create failed, dedup disabled\n");
+				ftrace_stackmap_destroy(smap);
+				/*
+				 * Permanent failure. Record it and clear a
+				 * STACKMAP flag that a boot-time
+				 * trace_options=stackmap may have set, so
+				 * options/stackmap does not report an
+				 * enabled no-op and later userspace enables
+				 * return -EINVAL.
+				 */
+				WRITE_ONCE(stackmap_init_state,
+					   STACKMAP_INIT_FAILED);
+				global_trace.trace_flags &=
+					~TRACE_ITER(STACKMAP);
+			} else {
+				/*
+				 * smp_store_release pairs with the
+				 * smp_load_acquire() in
+				 * __ftrace_trace_stack(). Publishing only
+				 * after the required file exists keeps
+				 * "smap visible" => "resolver/reset
+				 * available".
+				 */
+				smp_store_release(&global_trace.stackmap,
+						  smap);
+				WRITE_ONCE(stackmap_init_state,
+					   STACKMAP_INIT_DONE);
+				/*
+				 * stat and bin are auxiliary observability
+				 * surfaces. If they fail to be created we
+				 * keep dedup enabled (the kernel side still
+				 * works, and stack_map alone is enough to
+				 * resolve and reset); trace_create_file()
+				 * already pr_warn()s on failure.
+				 */
+				trace_create_file("stack_map_stat",
+						  TRACE_MODE_READ, NULL,
+						  smap,
+						  &ftrace_stackmap_stat_fops);
+				trace_create_file("stack_map_bin",
+						  TRACE_MODE_READ, NULL,
+						  smap,
+						  &ftrace_stackmap_bin_fops);
+			}
+		} else {
+			pr_warn("ftrace stackmap init failed, dedup disabled\n");
+			/*
+			 * global_trace is statically defined; its stackmap
+			 * field is zero-initialized via BSS, so leaving it
+			 * NULL ensures the smp_load_acquire() in
+			 * __ftrace_trace_stack() falls back to full stack.
+			 * Mark init failed and clear any boot-time STACKMAP
+			 * flag so userspace enables are rejected rather than
+			 * becoming silent no-ops.
+			 */
+			WRITE_ONCE(stackmap_init_state, STACKMAP_INIT_FAILED);
+			global_trace.trace_flags &= ~TRACE_ITER(STACKMAP);
+		}
+	}
+#endif
 	create_trace_instances(NULL);
 
 	update_tracer_options();
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 80fe152af1dd..95db43bfc747 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -57,6 +57,7 @@ enum trace_type {
 	TRACE_TIMERLAT,
 	TRACE_RAW_DATA,
 	TRACE_FUNC_REPEATS,
+	TRACE_STACK_ID,
 
 	__TRACE_LAST_TYPE,
 };
@@ -453,6 +454,9 @@ struct trace_array {
 	struct cond_snapshot	*cond_snapshot;
 #endif
 	struct trace_func_repeats	__percpu *last_func_repeats;
+#ifdef CONFIG_FTRACE_STACKMAP
+	struct ftrace_stackmap		*stackmap;
+#endif
 	/*
 	 * On boot up, the ring buffer is set to the minimum size, so that
 	 * we do not waste memory on systems that are not using tracing.
@@ -579,6 +583,8 @@ extern void __ftrace_bad_type(void);
 			  TRACE_GRAPH_RET);		\
 		IF_ASSIGN(var, ent, struct func_repeats_entry,		\
 			  TRACE_FUNC_REPEATS);				\
+		IF_ASSIGN(var, ent, struct stack_id_entry,		\
+			  TRACE_STACK_ID);				\
 		__ftrace_bad_type();					\
 	} while (0)
 
@@ -689,6 +695,7 @@ extern int tracing_disabled;
 int tracer_init(struct tracer *t, struct trace_array *tr);
 int tracing_is_enabled(void);
 void tracing_reset_online_cpus(struct array_buffer *buf);
+void tracing_reset_all_cpus(struct array_buffer *buf);
 void tracing_reset_all_online_cpus(void);
 void tracing_reset_all_online_cpus_unlocked(void);
 int tracing_open_generic(struct inode *inode, struct file *filp);
@@ -1449,7 +1456,16 @@ extern int trace_get_user(struct trace_parser *parser, const char __user *ubuf,
 # define STACK_FLAGS
 #endif
 
+#ifdef CONFIG_FTRACE_STACKMAP
+# define STACKMAP_FLAGS				\
+			C(STACKMAP,		"stackmap"),
+#else
+# define STACKMAP_FLAGS
+# define TRACE_ITER_STACKMAP_BIT	-1
+#endif
+
 #ifdef CONFIG_FUNCTION_PROFILER
+
 # define PROFILER_FLAGS					\
 		C(PROF_TEXT_OFFSET,	"prof-text-offset"),
 # ifdef CONFIG_FUNCTION_GRAPH_TRACER
@@ -1506,6 +1522,7 @@ extern int trace_get_user(struct trace_parser *parser, const char __user *ubuf,
 		FUNCTION_FLAGS					\
 		FGRAPH_FLAGS					\
 		STACK_FLAGS					\
+		STACKMAP_FLAGS					\
 		BRANCH_FLAGS					\
 		PROFILER_FLAGS					\
 		FPROFILE_FLAGS
diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h
index 54417468fdeb..89ed14b7e5fd 100644
--- a/kernel/trace/trace_entries.h
+++ b/kernel/trace/trace_entries.h
@@ -250,6 +250,21 @@ FTRACE_ENTRY(user_stack, userstack_entry,
 		 (void *)__entry->caller[6], (void *)__entry->caller[7])
 );
 
+/*
+ * Stack ID entry - stores only a stack_id referencing the stackmap.
+ * Used when CONFIG_FTRACE_STACKMAP is enabled to deduplicate stacks.
+ */
+FTRACE_ENTRY(stack_id, stack_id_entry,
+
+	TRACE_STACK_ID,
+
+	F_STRUCT(
+		__field(	int,		stack_id	)
+	),
+
+	F_printk("<stack_id %d>", __entry->stack_id)
+);
+
 /*
  * trace_printk entry:
  */
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index a5ad76175d10..68678ea88159 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1517,6 +1517,28 @@ static struct trace_event trace_user_stack_event = {
 	.funcs		= &trace_user_stack_funcs,
 };
 
+/* TRACE_STACK_ID */
+static enum print_line_t trace_stack_id_print(struct trace_iterator *iter,
+					      int flags, struct trace_event *event)
+{
+	struct stack_id_entry *field;
+	struct trace_seq *s = &iter->seq;
+
+	trace_assign_type(field, iter->ent);
+	trace_seq_printf(s, "<stack_id %d>\n", field->stack_id);
+
+	return trace_handle_return(s);
+}
+
+static struct trace_event_functions trace_stack_id_funcs = {
+	.trace		= trace_stack_id_print,
+};
+
+static struct trace_event trace_stack_id_event = {
+	.type		= TRACE_STACK_ID,
+	.funcs		= &trace_stack_id_funcs,
+};
+
 /* TRACE_HWLAT */
 static enum print_line_t
 trace_hwlat_print(struct trace_iterator *iter, int flags,
@@ -1908,6 +1930,7 @@ static struct trace_event *events[] __initdata = {
 	&trace_wake_event,
 	&trace_stack_event,
 	&trace_user_stack_event,
+	&trace_stack_id_event,
 	&trace_bputs_event,
 	&trace_bprint_event,
 	&trace_print_event,
diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
index 929c84075315..0c97065b0d68 100644
--- a/kernel/trace/trace_selftest.c
+++ b/kernel/trace/trace_selftest.c
@@ -14,6 +14,7 @@ static inline int trace_valid_entry(struct trace_entry *entry)
 	case TRACE_CTX:
 	case TRACE_WAKE:
 	case TRACE_STACK:
+	case TRACE_STACK_ID:
 	case TRACE_PRINT:
 	case TRACE_BRANCH:
 	case TRACE_GRAPH_ENT:
-- 
2.34.1


^ permalink raw reply related

* [RFC PATCH v4 1/3] trace: add lock-free stackmap for stack trace deduplication
From: Li Pengfei @ 2026-06-16  6:41 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: Mathieu Desnoyers, Mark Rutland, Jonathan Corbet, Shuah Khan,
	linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	lipengfei28, zhangbo56
In-Reply-To: <20260616064119.438063-1-lipengfei28@xiaomi.com>

From: Pengfei Li <lipengfei28@xiaomi.com>

Add a lock-free hash map (ftrace_stackmap) that deduplicates kernel
stack traces for the ftrace ring buffer. Instead of storing full
stack traces (80-160 bytes each) in the ring buffer for every event,
ftrace can store a 4-byte stack_id when the stackmap option is enabled.

The implementation is modeled after tracing_map.c (used by hist
triggers), using the same lock-free design based on Dr. Cliff Click's
non-blocking hash table algorithm:

- Lock-free insert via cmpxchg, safe in NMI/IRQ/any context
- Pre-allocated element pool (zero allocation on hot path)
- Linear probing with 2x over-provisioned table; probe length is
  bounded by FTRACE_STACKMAP_MAX_PROBE so worst-case insert/lookup
  is O(1) even when the table is heavily loaded with claimed-but-
  empty slots from pool exhaustion
- Single global instance (initialized for the global trace array)

The Kconfig depends on ARCH_HAVE_NMI_SAFE_CMPXCHG, matching the
existing tracing_map / hist_triggers requirement: the lock-free
hot path uses cmpxchg in a context that may be reached from NMI.

The stackmap is exported via three tracefs nodes:
- stack_map: text export with symbol resolution (mode 0640)
- stack_map_stat: counters (entries, successes, drops, success_rate)
- stack_map_bin: binary export (magic 0x46534D42 'FSMB', version 1,
  all fields native-endian)

ftrace_stackmap_get_id() never truncates: a stack deeper than
FTRACE_STACKMAP_MAX_DEPTH (64) returns -E2BIG so the caller records a
full stack instead. This prevents two distinct traces that share their
first 64 frames from being merged into one stack_id.

Hot-path counters use per-CPU local_t (NMI-safe single-instruction
increments) instead of atomic64_t. atomic64_t falls back to
raw_spinlock_t-based emulation on 32-bit GENERIC_ATOMIC64 systems,
which would deadlock if an NMI hit while the spinlock was held.
local_t avoids this hazard. All counters saturate rather than wrap on
long (from-boot, multi-hour) traces: ref_count via
atomic_add_unless(.., INT_MAX) and successes/drops via
local_add_unless(.., LONG_MAX).

Reset semantics:
- Reset is a control-path operation only allowed when tracing is
  stopped on the owning trace_array. Online reset (with tracing
  active) is intentionally not supported.
- Reset is destructive: under the reader_sem write lock it clears the
  owning trace_array's ring buffer (and snapshot buffer) BEFORE the
  map, so an external observer never sees "trace still has
  <stack_id N> but the map is already empty". The buffers are cleared
  with tracing_reset_all_cpus() rather than _online_cpus() so a
  TRACE_STACK_ID written by a now-offline CPU cannot survive a reset.
- Reset uses atomic_cmpxchg() to claim the resetting flag, then
  verifies tracer_tracing_is_on() returns false.
- synchronize_rcu() drains in-flight get_id() callers from the
  ftrace callback path (which runs preempt-disabled).
- The reader_sem (rw_semaphore) serializes the clearing against
  tracefs readers (seq_file iteration and stack_map_bin snapshot),
  which run in process context and aren't covered by
  synchronize_rcu(). Readers take it shared, reset takes it
  exclusive, so a reset cannot tear an iteration in progress. The
  hot path doesn't take this lock.
- Reset clears the resetting flag with atomic_set_release() so a
  subsequent get_id() observes a fully cleared map.
- get_id() uses atomic_read_acquire() on resetting so subsequent
  loads of entry->key/val are properly ordered after the check
  (control dependencies only order stores per LKMM).
- Concurrent reset, or reset while tracing is active, returns -EBUSY.

Concurrency notes:
- entry->val publication uses smp_store_release() paired with
  smp_load_acquire() in all dereferencing readers.
- entry->key reads (in get_id, seq_start/next, bin_open) use
  READ_ONCE() to avoid LKMM data races with the cmpxchg writer.
- elt->nr is read with READ_ONCE() and clamped to MAX_DEPTH before
  use in seq_show and bin_open.
- Pool exhaustion: stackmap_get_elt() short-circuits via
  atomic_read() before the contended atomic RMW, avoiding cacheline
  contention once the pool is full. Slots that win cmpxchg but
  cannot get an elt are left 'claimed but empty'; subsequent
  lookups treat val==NULL as a miss and probe past them.

Hash key:
- Per-instance random seed stored in the stackmap struct (no
  global state), seeded at create time.
- 32-bit jhash is forced to 1 if it lands on 0 (which is the
  free-slot sentinel). Full memcmp confirms matches.

Memory:
- Single flat vmalloc for the element pool (no per-elt kzalloc).
- bits parameter clamped to [10, 18]: at the maximum bits=18, the
  element pool is ~135 MB and a stack_map_bin snapshot may briefly
  allocate another ~135 MB.
- struct stackmap_bin_snapshot uses u64 (not size_t) for its size
  field so data[] is 8-byte aligned on both 32-bit and 64-bit
  architectures, avoiding alignment faults when writing u64 IPs
  on strict-alignment architectures.

Kernel command line parameter:
- ftrace_stackmap.bits=N: set map capacity (2^N unique stacks,
  range 10-18, default 14)

Signed-off-by: Pengfei Li <lipengfei28@xiaomi.com>
---
 kernel/trace/Kconfig          |  22 +
 kernel/trace/Makefile         |   1 +
 kernel/trace/trace_stackmap.c | 889 ++++++++++++++++++++++++++++++++++
 kernel/trace/trace_stackmap.h |  57 +++
 4 files changed, 969 insertions(+)
 create mode 100644 kernel/trace/trace_stackmap.c
 create mode 100644 kernel/trace/trace_stackmap.h

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index e130da35808f..e49cae886ff0 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -412,6 +412,28 @@ config STACK_TRACER
 
 	  Say N if unsure.
 
+config FTRACE_STACKMAP
+	bool "Ftrace stack map deduplication"
+	depends on TRACING
+	depends on STACKTRACE
+	depends on ARCH_HAVE_NMI_SAFE_CMPXCHG
+	select KALLSYMS
+	help
+	  This enables a global stack trace hash table for ftrace, inspired
+	  by eBPF's BPF_MAP_TYPE_STACK_TRACE. When enabled, ftrace can store
+	  only a stack_id in the ring buffer instead of the full stack trace,
+	  significantly reducing trace buffer usage when the same call stacks
+	  appear repeatedly.
+
+	  The deduplicated stacks are exported via:
+	    /sys/kernel/debug/tracing/stack_map
+
+	  Writing to this file resets the stack map. Reading shows all unique
+	  stacks with their stack_id and reference count.
+
+	  Say Y if you want to reduce ftrace buffer usage for stack traces.
+	  Say N if unsure.
+
 config TRACE_PREEMPT_TOGGLE
 	bool
 	help
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 8d3d96e847d8..c2d9b2bf895a 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -85,6 +85,7 @@ obj-$(CONFIG_HWLAT_TRACER) += trace_hwlat.o
 obj-$(CONFIG_OSNOISE_TRACER) += trace_osnoise.o
 obj-$(CONFIG_NOP_TRACER) += trace_nop.o
 obj-$(CONFIG_STACK_TRACER) += trace_stack.o
+obj-$(CONFIG_FTRACE_STACKMAP) += trace_stackmap.o
 obj-$(CONFIG_MMIOTRACE) += trace_mmiotrace.o
 obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += trace_functions_graph.o
 obj-$(CONFIG_TRACE_BRANCH_PROFILING) += trace_branch.o
diff --git a/kernel/trace/trace_stackmap.c b/kernel/trace/trace_stackmap.c
new file mode 100644
index 000000000000..9e9fdf85071d
--- /dev/null
+++ b/kernel/trace/trace_stackmap.c
@@ -0,0 +1,889 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Ftrace Stack Map - Lock-free stack trace deduplication for ftrace
+ *
+ * Modeled after tracing_map.c (used by hist triggers), this provides
+ * a lock-free hash map optimized for the ftrace hot path. The design
+ * is based on Dr. Cliff Click's non-blocking hash table algorithm.
+ *
+ * Key properties:
+ * - Lock-free insert via cmpxchg, safe in NMI/IRQ/any context
+ * - Pre-allocated element pool (zero allocation on hot path)
+ * - Linear probing with 2x over-provisioned table; probe length
+ *   bounded by FTRACE_STACKMAP_MAX_PROBE to keep worst-case lookup
+ *   cost constant even when the table is heavily loaded
+ * - Single global instance (initialized for the global trace array)
+ *
+ * Reset is a control-path operation, only allowed when tracing is
+ * stopped on the owning trace_array. The protocol is:
+ *
+ *   - atomic_cmpxchg(&resetting, 0, 1) atomically claims reset rights
+ *     and blocks new get_id() callers (they observe resetting=1 and
+ *     return -EINVAL).
+ *   - trace_types_lock serializes the tracer_tracing_is_on() check and
+ *     the destructive ring-buffer reset against tracefs writes to
+ *     tracing_on.
+ *   - synchronize_rcu() drains in-flight get_id() callers from the
+ *     ftrace callback path, which runs with preemption disabled.
+ *
+ * Online reset (with tracing active) is intentionally not supported
+ * to keep the design simple and the proof obligations small.
+ *
+ * The 32-bit jhash of the stack IPs is the hash table key. On hash
+ * collision, linear probing finds the next slot and full memcmp
+ * confirms the match.
+ *
+ * Concurrent userspace readers (cat stack_map / stack_map_bin) get
+ * a best-effort snapshot. They are coherent with the hot path
+ * (smp_load_acquire on entry->val); they are also serialized
+ * against reset via smap->reader_sem (readers take it in shared
+ * mode, reset in exclusive mode), so a reset cannot tear an
+ * iteration in progress -- it waits for active readers to drop
+ * the rwsem before clearing the map. The hot path is coordinated
+ * with reset separately, via acquire/release on smap->resetting.
+ */
+
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/jhash.h>
+#include <linux/seq_file.h>
+#include <linux/kallsyms.h>
+#include <linux/vmalloc.h>
+#include <linux/atomic.h>
+#include <linux/local_lock.h>
+#include <linux/percpu.h>
+#include <linux/random.h>
+#include <linux/rcupdate.h>
+#include <linux/log2.h>
+#include <asm/local.h>
+
+#include "trace.h"
+#include "trace_stackmap.h"
+
+/*
+ * Bound the linear-probe scan length. With a 2x over-provisioned table,
+ * a well-distributed hash gives very short probe chains. Capping at 64
+ * keeps worst-case lookup O(1) even when the table is heavily loaded
+ * with claimed-but-empty slots from pool exhaustion.
+ */
+#define FTRACE_STACKMAP_MAX_PROBE	64
+
+/*
+ * Memory ordering of entry->val: published with smp_store_release()
+ * by the inserter; consumed with smp_load_acquire() by every reader
+ * that dereferences the elt (get_id, seq_show, bin_open). This pairs
+ * the writes to elt->{nr,ips,ref_count} (initialized BEFORE the
+ * publish) with the reads of those fields (which happen AFTER the
+ * load). seq_start / seq_next only test val for NULL and use the
+ * acquire load purely to keep memory ordering symmetric.
+ */
+
+/*
+ * Each pre-allocated element holds one unique stack trace.
+ * Fixed size: MAX_DEPTH entries regardless of actual depth.
+ */
+struct stackmap_elt {
+	u32		nr;		/* actual number of IPs */
+	atomic_t	ref_count;
+	unsigned long	ips[FTRACE_STACKMAP_MAX_DEPTH];
+};
+
+/*
+ * Hash table entry: a 32-bit key (jhash of stack) + pointer to elt.
+ * key == 0 means the slot is free.
+ */
+struct stackmap_entry {
+	u32			key;	/* 0 = free, non-zero = jhash */
+	struct stackmap_elt	*val;	/* NULL until fully published */
+};
+
+static struct stackmap_elt *stackmap_load_elt(struct stackmap_entry *entry)
+{
+	/*
+	 * Pairs with the smp_store_release() that publishes entry->val
+	 * after fully initializing the element payload.
+	 */
+	return smp_load_acquire(&entry->val);
+}
+
+struct ftrace_stackmap {
+	struct trace_array	*tr;		/* owning trace_array */
+	unsigned int		map_bits;
+	unsigned int		map_size;	/* 1 << (map_bits + 1) */
+	unsigned int		max_elts;	/* 1 << map_bits */
+	u32			hash_seed;	/* per-instance jhash seed */
+	atomic_t		next_elt;	/* index into elts pool */
+	struct stackmap_entry	*entries;	/* hash table */
+	struct stackmap_elt	*elts;		/* flat element pool */
+	atomic_t		resetting;
+	/*
+	 * Reader/reset serialization. Held in shared mode (read lock)
+	 * across seq_file iteration and binary snapshot construction;
+	 * held in exclusive mode (write lock) by reset's clearing
+	 * phase. The hot path (get_id) does not take this lock — it
+	 * uses smp_load_acquire/smp_store_release on entry->val and
+	 * the resetting flag for the lock-free protocol.
+	 */
+	struct rw_semaphore	reader_sem;
+	/*
+	 * Per-CPU counters using local_t. local_t increments are NMI-
+	 * safe on all architectures (single-instruction or interrupt-
+	 * masked) and avoid the raw_spinlock_t fallback that
+	 * atomic64_t uses on 32-bit GENERIC_ATOMIC64 — which would
+	 * deadlock if an NMI hit while the spinlock was held.
+	 */
+	local_t __percpu	*successes;	/* events served (hits + new inserts) */
+	local_t __percpu	*drops;
+};
+
+/*
+ * Cap the bits parameter to keep worst-case allocations bounded:
+ *   bits=18 → 256K elts, 512K slots, ~130 MB elt pool, ~130 MB bin
+ *             export.
+ * Smaller workloads should use the default (14) which gives 16K elts
+ * (~8 MB pool); bump bits via the ftrace_stackmap.bits= kernel
+ * parameter for higher unique-stack capacity.
+ */
+#define FTRACE_STACKMAP_BITS_MIN	10
+#define FTRACE_STACKMAP_BITS_MAX	18
+#define FTRACE_STACKMAP_BITS_DEFAULT	14
+
+static unsigned int stackmap_map_bits = FTRACE_STACKMAP_BITS_DEFAULT;
+static int __init stackmap_bits_setup(char *str)
+{
+	unsigned long val;
+
+	if (kstrtoul(str, 0, &val))
+		return -EINVAL;
+	val = clamp_val(val, FTRACE_STACKMAP_BITS_MIN, FTRACE_STACKMAP_BITS_MAX);
+	stackmap_map_bits = val;
+	return 0;
+}
+early_param("ftrace_stackmap.bits", stackmap_bits_setup);
+
+/* --- Element pool --- */
+
+static struct stackmap_elt *stackmap_get_elt(struct ftrace_stackmap *smap)
+{
+	int idx;
+
+	/*
+	 * Fast-path early-out once the pool is fully consumed. Avoids
+	 * the contended atomic RMW on next_elt for every traced event
+	 * after the pool is exhausted.
+	 */
+	if (atomic_read(&smap->next_elt) >= smap->max_elts)
+		return NULL;
+
+	idx = atomic_fetch_add_unless(&smap->next_elt, 1, smap->max_elts);
+	if (idx < smap->max_elts)
+		return &smap->elts[idx];
+	return NULL;
+}
+
+/* --- Create / Destroy / Reset --- */
+
+struct ftrace_stackmap *ftrace_stackmap_create(struct trace_array *tr)
+{
+	struct ftrace_stackmap *smap;
+	unsigned int bits;
+
+	smap = kzalloc_obj(*smap, GFP_KERNEL);
+	if (!smap)
+		return ERR_PTR(-ENOMEM);
+
+	/* Defensive clamp: reject bogus bits even if early_param is bypassed. */
+	bits = clamp_val(stackmap_map_bits,
+			 FTRACE_STACKMAP_BITS_MIN,
+			 FTRACE_STACKMAP_BITS_MAX);
+
+	smap->tr = tr;
+	smap->map_bits = bits;
+	smap->max_elts = 1U << bits;
+	smap->map_size = 1U << (bits + 1);	/* 2x over-provision */
+
+	smap->entries = vzalloc(sizeof(*smap->entries) * smap->map_size);
+	if (!smap->entries) {
+		kfree(smap);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	/*
+	 * Single large vmalloc of the element pool, indexed flat.
+	 * At bits=18 this is 256K * sizeof(struct stackmap_elt). The
+	 * struct is ~520 B (8 + 4 + 4 + 64*8), so total ~135 MB.
+	 */
+	smap->elts = vzalloc(sizeof(*smap->elts) * (size_t)smap->max_elts);
+	if (!smap->elts) {
+		vfree(smap->entries);
+		kfree(smap);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	smap->successes = alloc_percpu(local_t);
+	if (!smap->successes) {
+		vfree(smap->elts);
+		vfree(smap->entries);
+		kfree(smap);
+		return ERR_PTR(-ENOMEM);
+	}
+	smap->drops = alloc_percpu(local_t);
+	if (!smap->drops) {
+		free_percpu(smap->successes);
+		vfree(smap->elts);
+		vfree(smap->entries);
+		kfree(smap);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	smap->hash_seed = get_random_u32();
+	atomic_set(&smap->next_elt, 0);
+	atomic_set(&smap->resetting, 0);
+	init_rwsem(&smap->reader_sem);
+
+	return smap;
+}
+
+void ftrace_stackmap_destroy(struct ftrace_stackmap *smap)
+{
+	if (!smap || IS_ERR(smap))
+		return;
+	free_percpu(smap->drops);
+	free_percpu(smap->successes);
+	vfree(smap->elts);
+	vfree(smap->entries);
+	kfree(smap);
+}
+
+/**
+ * ftrace_stackmap_reset - clear all entries in the stackmap
+ * @smap: the stackmap to reset
+ *
+ * Returns 0 on success, -EBUSY if another reset is already in
+ * progress, or if tracing is currently active on the owning
+ * trace_array.
+ *
+ * Online reset (with tracing active) is not supported. Caller must
+ * stop tracing first (echo 0 > tracing_on).
+ *
+ * Caller is process context (typically sysfs write handler).
+ *
+ * Protocol:
+ *   1. Atomically claim reset rights via cmpxchg on @resetting.
+ *   2. Take trace_types_lock to serialize against tracefs writes to
+ *      tracing_on.
+ *   3. Verify tracing is stopped on @smap->tr; if not, release the
+ *      claim and return -EBUSY. The resetting flag itself blocks
+ *      any subsequent get_id() callers.
+ *   4. synchronize_rcu() drains in-flight get_id() callers from the
+ *      ftrace callback path (which runs preempt-disabled).
+ *   5. Reset the ring buffer(s), then memset entries, elts, and
+ *      counters.
+ *   6. Release the resetting flag with release semantics so any new
+ *      get_id() observes a fully cleared map.
+ */
+int ftrace_stackmap_reset(struct ftrace_stackmap *smap)
+{
+	struct trace_array *tr;
+	int ret = 0;
+
+	if (!smap)
+		return 0;
+
+	if (atomic_cmpxchg(&smap->resetting, 0, 1) != 0)
+		return -EBUSY;
+
+	mutex_lock(&trace_types_lock);
+
+	tr = smap->tr;
+	if (tr && tracer_tracing_is_on(tr)) {
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+
+	/*
+	 * synchronize_rcu() itself is a full barrier; no extra smp_mb()
+	 * is needed before it. It drains in-flight ftrace callbacks that
+	 * may have already passed the resetting check with the old value.
+	 */
+	synchronize_rcu();
+
+	/*
+	 * Take the reader_sem in exclusive mode. This serializes the
+	 * memset against any tracefs reader (seq_file iteration or
+	 * stack_map_bin snapshot) that may currently hold the rwsem
+	 * for read. synchronize_rcu() already drained the hot path;
+	 * this rwsem covers process-context readers that aren't
+	 * preempt-disabled.
+	 */
+	down_write(&smap->reader_sem);
+
+	/*
+	 * Clear the ring buffer(s) BEFORE the map, both under the write
+	 * lock. The ring buffer may still hold TRACE_STACK_ID events
+	 * whose stack_id points at slots we are about to free/reuse.
+	 * Resetting the buffer first guarantees an external observer
+	 * never sees the inconsistent "trace still has <stack_id N> but
+	 * the map is already empty" window: it sees either (old buffer,
+	 * old map) or (cleared buffer, old map) or (cleared buffer,
+	 * cleared map) -- never (old buffer, cleared map).
+	 *
+	 * Use tracing_reset_all_cpus() (not _online_cpus) so per-CPU
+	 * buffers belonging to currently offline CPUs are also cleared.
+	 * The ring buffer is allocated per-possible-CPU; an offline CPU's
+	 * buffer can still hold a TRACE_STACK_ID event written before
+	 * the CPU went offline. tracing_reset_online_cpus() iterates
+	 * for_each_online_buffer_cpu() and would leave that data behind
+	 * to be observed once the CPU comes back online (or by the
+	 * trace reader, which iterates all allocated CPU buffers),
+	 * recreating the stale-stack_id window we are trying to close.
+	 *
+	 * Since reset requires tracing to be stopped, this makes "reset"
+	 * an explicitly destructive operation on the owning trace_array,
+	 * keeping ring-buffer stack_ids and the map coherent.
+	 */
+	if (tr) {
+		tracing_reset_all_cpus(&tr->array_buffer);
+#ifdef CONFIG_TRACER_SNAPSHOT
+		if (tr->allocated_snapshot)
+			tracing_reset_all_cpus(&tr->snapshot_buffer);
+#endif
+	}
+
+	memset(smap->entries, 0, sizeof(*smap->entries) * smap->map_size);
+	memset(smap->elts, 0, sizeof(*smap->elts) * (size_t)smap->max_elts);
+
+	atomic_set(&smap->next_elt, 0);
+	{
+		int cpu;
+
+		for_each_possible_cpu(cpu) {
+			local_set(per_cpu_ptr(smap->successes, cpu), 0);
+			local_set(per_cpu_ptr(smap->drops, cpu), 0);
+		}
+	}
+
+	up_write(&smap->reader_sem);
+
+out_unlock:
+	mutex_unlock(&trace_types_lock);
+
+	/* Release resetting=0 so new get_id() observes a cleared map. */
+	atomic_set_release(&smap->resetting, 0);
+	return ret;
+}
+
+/* --- Core: get_id (lock-free, NMI-safe) --- */
+
+int ftrace_stackmap_get_id(struct ftrace_stackmap *smap,
+			   unsigned long *ips, unsigned int nr_entries)
+{
+	u32 key_hash, idx, test_key, trace_len;
+	struct stackmap_entry *entry;
+	struct stackmap_elt *val;
+	int probes = 0;
+
+	/*
+	 * atomic_read_acquire() pairs with atomic_set_release() in the
+	 * reset path. This ensures that subsequent reads of entry->key
+	 * and entry->val are ordered after this check; without acquire,
+	 * the CPU would only have a control dependency, which orders
+	 * subsequent stores but not loads (per LKMM).
+	 */
+	if (!smap || !nr_entries || atomic_read_acquire(&smap->resetting))
+		return -EINVAL;
+	/*
+	 * Never truncate: a stack deeper than the map can hold must not be
+	 * silently shortened, or two distinct traces sharing their first
+	 * FTRACE_STACKMAP_MAX_DEPTH frames would be merged into one
+	 * stack_id. The caller is expected to fall back to a full stack
+	 * trace for such events. Reject defensively in case of a future
+	 * caller that forgets this contract.
+	 */
+	if (nr_entries > FTRACE_STACKMAP_MAX_DEPTH)
+		return -E2BIG;
+
+	trace_len = nr_entries * sizeof(unsigned long);
+	/*
+	 * jhash2() requires the length in u32 units and the data to be
+	 * u32-aligned. On 64-bit kernels sizeof(unsigned long)==8, so
+	 * trace_len is always a multiple of 8 (hence of 4). Use jhash2
+	 * directly; the cast to u32* is safe because ips[] is naturally
+	 * aligned to sizeof(unsigned long) >= 4.
+	 */
+	key_hash = jhash2((const u32 *)ips, trace_len / sizeof(u32),
+			  smap->hash_seed);
+	if (key_hash == 0)
+		key_hash = 1;	/* 0 means free slot */
+
+	idx = key_hash >> (32 - (smap->map_bits + 1));
+
+	while (probes < FTRACE_STACKMAP_MAX_PROBE) {
+		idx &= (smap->map_size - 1);
+		entry = &smap->entries[idx];
+		/*
+		 * READ_ONCE() to avoid LKMM data race with concurrent
+		 * cmpxchg(&entry->key, 0, key_hash) on this slot.
+		 */
+		test_key = READ_ONCE(entry->key);
+
+		if (test_key == key_hash) {
+			val = stackmap_load_elt(entry);
+			/*
+			 * READ_ONCE(val->nr) keeps style consistent with
+			 * the seq_show / bin_open readers. nr is write-once
+			 * (set before publish, never modified afterwards),
+			 * so the load is data-race-free, but READ_ONCE
+			 * silences any analysis tool that flags a plain
+			 * read of a field that is also read under acquire
+			 * elsewhere.
+			 */
+			if (val && READ_ONCE(val->nr) == nr_entries &&
+			    memcmp(val->ips, ips, trace_len) == 0) {
+				/*
+				 * ref_count is a best-effort popularity
+				 * counter. On a long (from-boot, multi-hour)
+				 * trace a hot stack can be hit billions of
+				 * times. atomic_add_unless() gives true
+				 * saturation at INT_MAX even under concurrent
+				 * hits on multiple CPUs (a plain
+				 * check-then-inc could let several CPUs past
+				 * the check near the cap and still wrap).
+				 */
+				atomic_add_unless(&val->ref_count, 1, INT_MAX);
+				/*
+				 * successes/drops are best-effort throughput
+				 * counters. Saturate at LONG_MAX so they do
+				 * not wrap on long runs (notably where local_t
+				 * is 32-bit), matching ref_count's behaviour.
+				 */
+				local_add_unless(this_cpu_ptr(smap->successes),
+						 1, LONG_MAX);
+				return (int)idx;
+			}
+			/*
+			 * val == NULL: another CPU is mid-insert, or this
+			 * slot is "claimed but empty" (pool exhausted).
+			 * val != NULL but mismatch: 32-bit hash collision
+			 * with a different stack. In both cases, advance.
+			 */
+		} else if (!test_key) {
+			/*
+			 * Free slot: try to claim it.
+			 *
+			 * If two CPUs race here with the same key_hash
+			 * (same stack), one loses the cmpxchg, advances,
+			 * and may insert the same stack at a later slot.
+			 * This can produce a small number of duplicate
+			 * entries under heavy contention. The trade-off
+			 * is accepted to keep the hot path lock-free;
+			 * ref_count is split across the duplicates and
+			 * total memory cost is bounded by the element
+			 * pool size.
+			 */
+			if (cmpxchg(&entry->key, 0, key_hash) == 0) {
+				struct stackmap_elt *elt;
+
+				elt = stackmap_get_elt(smap);
+				if (!elt) {
+					/*
+					 * Pool exhausted. We claimed this
+					 * slot with cmpxchg but cannot fill
+					 * it. Leave key set so the slot
+					 * stays "claimed but empty" — future
+					 * lookups treat val==NULL as a miss
+					 * and probe past it. Cannot revert
+					 * key=0 without racing other CPUs.
+					 */
+					local_add_unless(this_cpu_ptr(smap->drops),
+							 1, LONG_MAX);
+					return -ENOSPC;
+				}
+
+				elt->nr = nr_entries;
+				atomic_set(&elt->ref_count, 1);
+				memcpy(elt->ips, ips, trace_len);
+
+				/*
+				 * Publish elt with release semantics so the
+				 * reader's smp_load_acquire can safely
+				 * dereference val->nr / val->ips.
+				 */
+				smp_store_release(&entry->val, elt);
+				local_add_unless(this_cpu_ptr(smap->successes),
+						 1, LONG_MAX);
+				return (int)idx;
+			}
+			/* cmpxchg failed; another CPU claimed this slot. */
+		}
+
+		idx++;
+		probes++;
+	}
+
+	local_add_unless(this_cpu_ptr(smap->drops), 1, LONG_MAX);
+	return -ENOSPC;
+}
+
+/* --- Text export: /sys/kernel/debug/tracing/stack_map --- */
+
+struct stackmap_seq_private {
+	struct ftrace_stackmap	*smap;
+};
+
+static void *stackmap_seq_start(struct seq_file *m, loff_t *pos)
+{
+	struct stackmap_seq_private *priv = m->private;
+	struct ftrace_stackmap *smap = priv->smap;
+	u32 i;
+
+	if (!smap)
+		return NULL;
+	/*
+	 * Take the reader_sem to serialize against ftrace_stackmap_reset(),
+	 * which holds it for write while clearing the table. Released in
+	 * stackmap_seq_stop(), which seq_file calls regardless of whether
+	 * start() returned an element or NULL (per Documentation/filesystems
+	 * /seq_file.rst: "the iterator value returned by start() or next()
+	 * is guaranteed to be passed to a subsequent next() or stop()").
+	 */
+	down_read(&smap->reader_sem);
+	for (i = *pos; i < smap->map_size; i++) {
+		if (READ_ONCE(smap->entries[i].key) &&
+		    stackmap_load_elt(&smap->entries[i])) {
+			*pos = i;
+			return &smap->entries[i];
+		}
+	}
+	return NULL;
+}
+
+static void *stackmap_seq_next(struct seq_file *m, void *v, loff_t *pos)
+{
+	struct stackmap_seq_private *priv = m->private;
+	struct ftrace_stackmap *smap = priv->smap;
+	u32 i;
+
+	if (!smap)
+		return NULL;
+	for (i = *pos + 1; i < smap->map_size; i++) {
+		if (READ_ONCE(smap->entries[i].key) &&
+		    stackmap_load_elt(&smap->entries[i])) {
+			*pos = i;
+			return &smap->entries[i];
+		}
+	}
+	/*
+	 * Advance *pos past the end so that on the next read() the
+	 * subsequent stackmap_seq_start() call returns NULL and the
+	 * iteration terminates. Without this, seq_read() would loop
+	 * on the last element.
+	 */
+	*pos = smap->map_size;
+	return NULL;
+}
+
+static void stackmap_seq_stop(struct seq_file *m, void *v)
+{
+	struct stackmap_seq_private *priv = m->private;
+	struct ftrace_stackmap *smap = priv->smap;
+
+	/*
+	 * seq_file invokes stop() unconditionally after each iteration
+	 * pass (see seq_read_iter / traverse), even when start() returned
+	 * NULL. Always release here, balanced against the down_read in
+	 * stackmap_seq_start().
+	 */
+	if (smap)
+		up_read(&smap->reader_sem);
+}
+
+static int stackmap_seq_show(struct seq_file *m, void *v)
+{
+	struct stackmap_entry *entry = v;
+	struct stackmap_seq_private *priv = m->private;
+	struct stackmap_elt *elt;
+	u32 idx = entry - priv->smap->entries;
+	u32 i, nr;
+
+	elt = stackmap_load_elt(entry);
+	if (!elt)
+		return 0;
+
+	nr = READ_ONCE(elt->nr);
+	if (nr > FTRACE_STACKMAP_MAX_DEPTH)
+		nr = FTRACE_STACKMAP_MAX_DEPTH;
+
+	seq_printf(m, "stack_id %u [ref %u, depth %u]\n",
+		   idx, atomic_read(&elt->ref_count), nr);
+	for (i = 0; i < nr; i++) {
+		unsigned long ip = elt->ips[i];
+
+		/*
+		 * Mirror trace_stack_print(): __ftrace_trace_stack()
+		 * may replace trampoline addresses with
+		 * FTRACE_TRAMPOLINE_MARKER before the stack reaches the
+		 * map, and normal addresses must go through
+		 * trace_adjust_address() (KASLR / module text delta)
+		 * before symbolization. Without this the export would
+		 * print a bogus symbol for the marker and unadjusted
+		 * addresses for everything else.
+		 */
+		if (ip == FTRACE_TRAMPOLINE_MARKER) {
+			seq_printf(m, "  [%u] [FTRACE TRAMPOLINE]\n", i);
+			continue;
+		}
+		seq_printf(m, "  [%u] %pS\n", i,
+			   (void *)trace_adjust_address(priv->smap->tr, ip));
+	}
+	seq_putc(m, '\n');
+	return 0;
+}
+
+static const struct seq_operations stackmap_seq_ops = {
+	.start	= stackmap_seq_start,
+	.next	= stackmap_seq_next,
+	.stop	= stackmap_seq_stop,
+	.show	= stackmap_seq_show,
+};
+
+static int stackmap_open(struct inode *inode, struct file *file)
+{
+	struct stackmap_seq_private *priv;
+	struct seq_file *m;
+	int ret;
+
+	ret = seq_open_private(file, &stackmap_seq_ops,
+			       sizeof(struct stackmap_seq_private));
+	if (ret)
+		return ret;
+	m = file->private_data;
+	priv = m->private;
+	priv->smap = inode->i_private;
+	return 0;
+}
+
+/*
+ * Accept exactly "0" or "reset" (optionally followed by a single newline).
+ */
+static bool stackmap_write_is_reset(const char *buf, size_t n)
+{
+	if (n > 0 && buf[n - 1] == '\n')
+		n--;
+	return (n == 1 && buf[0] == '0') ||
+	       (n == 5 && memcmp(buf, "reset", 5) == 0);
+}
+
+static ssize_t stackmap_write(struct file *file, const char __user *ubuf,
+			      size_t count, loff_t *ppos)
+{
+	struct seq_file *m = file->private_data;
+	struct stackmap_seq_private *priv = m->private;
+	char buf[8];
+	size_t n = min(count, sizeof(buf) - 1);
+	int ret;
+
+	if (n == 0)
+		return -EINVAL;
+	if (copy_from_user(buf, ubuf, n))
+		return -EFAULT;
+	buf[n] = '\0';
+
+	if (!stackmap_write_is_reset(buf, n))
+		return -EINVAL;
+
+	/*
+	 * ftrace_stackmap_reset() atomically claims reset rights via
+	 * cmpxchg and returns -EBUSY if another reset is in progress
+	 * or if tracing is active.
+	 */
+	ret = ftrace_stackmap_reset(priv->smap);
+	if (ret)
+		return ret;
+	return count;
+}
+
+const struct file_operations ftrace_stackmap_fops = {
+	.open		= stackmap_open,
+	.read		= seq_read,
+	.write		= stackmap_write,
+	.llseek		= seq_lseek,
+	.release	= seq_release_private,
+};
+
+/* --- Stats --- */
+
+static int stackmap_stat_show(struct seq_file *m, void *v)
+{
+	struct ftrace_stackmap *smap = m->private;
+	u64 successes = 0, drops = 0;
+	u32 entries;
+	int cpu;
+
+	if (!smap) {
+		seq_puts(m, "stackmap not initialized\n");
+		return 0;
+	}
+
+	entries = atomic_read(&smap->next_elt);
+	for_each_possible_cpu(cpu) {
+		successes += local_read(per_cpu_ptr(smap->successes, cpu));
+		drops += local_read(per_cpu_ptr(smap->drops, cpu));
+	}
+
+	seq_printf(m, "entries:      %u / %u\n", entries, smap->max_elts);
+	seq_printf(m, "table_size:   %u\n", smap->map_size);
+	seq_printf(m, "successes:    %llu\n", successes);
+	seq_printf(m, "drops:        %llu\n", drops);
+	if (successes + drops > 0)
+		seq_printf(m, "success_rate: %llu%%\n",
+			   successes * 100 / (successes + drops));
+	return 0;
+}
+
+static int stackmap_stat_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, stackmap_stat_show, inode->i_private);
+}
+
+const struct file_operations ftrace_stackmap_stat_fops = {
+	.open		= stackmap_stat_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+/* --- Binary export --- */
+
+struct stackmap_bin_snapshot {
+	/*
+	 * Use u64 (not size_t) so data[] is 8-byte aligned on both
+	 * 32-bit and 64-bit architectures. The IP array within data[]
+	 * is accessed as u64*, which would alignment-fault on strict
+	 * architectures (e.g. older ARM, SPARC) if data[] started at
+	 * a 4-byte boundary.
+	 */
+	u64	size;
+	char	data[];
+};
+
+static int stackmap_bin_open(struct inode *inode, struct file *file)
+{
+	struct ftrace_stackmap *smap = inode->i_private;
+	struct stackmap_bin_snapshot *snap;
+	struct ftrace_stackmap_bin_header *hdr;
+	size_t alloc_size, off;
+	u32 nr_entries, i, nr_stacks;
+
+	if (!smap)
+		return -ENODEV;
+
+	/*
+	 * Worst-case allocation size: every populated entry uses a
+	 * full-depth stack. The (+1) gives one slack slot in case a
+	 * concurrent insert lands between this snapshot and iteration.
+	 * The loop below performs an explicit bounds check anyway.
+	 *
+	 * At bits=18 this caps at ~135 MB. The file is mode 0440
+	 * (TRACE_MODE_READ), so only privileged users can open it.
+	 */
+	nr_entries = atomic_read(&smap->next_elt);
+	alloc_size = sizeof(*hdr) + (nr_entries + 1) *
+		     (sizeof(struct ftrace_stackmap_bin_entry) +
+		      FTRACE_STACKMAP_MAX_DEPTH * sizeof(u64));
+
+	snap = vmalloc(sizeof(*snap) + alloc_size);
+	if (!snap)
+		return -ENOMEM;
+
+	hdr = (struct ftrace_stackmap_bin_header *)snap->data;
+	hdr->magic = FTRACE_STACKMAP_BIN_MAGIC;
+	hdr->version = FTRACE_STACKMAP_BIN_VERSION;
+	hdr->reserved = 0;
+	off = sizeof(*hdr);
+	nr_stacks = 0;
+
+	/*
+	 * Take reader_sem to serialize against ftrace_stackmap_reset(),
+	 * which clears the table and elt pool under the write lock.
+	 */
+	down_read(&smap->reader_sem);
+
+	for (i = 0; i < smap->map_size; i++) {
+		struct stackmap_entry *entry = &smap->entries[i];
+		struct stackmap_elt *elt;
+		struct ftrace_stackmap_bin_entry *e;
+		u64 *ips_out;
+		u32 k, nr;
+
+		if (!READ_ONCE(entry->key))
+			continue;
+		elt = stackmap_load_elt(entry);
+		if (!elt)
+			continue;
+
+		nr = READ_ONCE(elt->nr);
+		if (nr > FTRACE_STACKMAP_MAX_DEPTH)
+			nr = FTRACE_STACKMAP_MAX_DEPTH;
+
+		/* Bounds check: stop if we would overflow the allocation. */
+		if (off + sizeof(*e) + nr * sizeof(u64) > alloc_size)
+			break;
+
+		e = (struct ftrace_stackmap_bin_entry *)(snap->data + off);
+		e->stack_id = i;
+		e->nr = nr;
+		e->ref_count = atomic_read(&elt->ref_count);
+		e->reserved = 0;
+		off += sizeof(*e);
+
+		ips_out = (u64 *)(snap->data + off);
+		for (k = 0; k < nr; k++) {
+			unsigned long ip = elt->ips[k];
+
+			/*
+			 * Emit the trampoline marker verbatim so userspace
+			 * can render it as [FTRACE TRAMPOLINE]; pass every
+			 * other address through trace_adjust_address() so the
+			 * binary export follows the same address-adjustment
+			 * rules as the text export.
+			 */
+			if (ip == FTRACE_TRAMPOLINE_MARKER)
+				ips_out[k] = (u64)FTRACE_TRAMPOLINE_MARKER;
+			else
+				ips_out[k] = (u64)trace_adjust_address(smap->tr, ip);
+		}
+		off += nr * sizeof(u64);
+		nr_stacks++;
+	}
+
+	up_read(&smap->reader_sem);
+
+	hdr->nr_stacks = nr_stacks;
+	snap->size = off;
+	file->private_data = snap;
+	return 0;
+}
+
+static ssize_t stackmap_bin_read(struct file *file, char __user *ubuf,
+				 size_t count, loff_t *ppos)
+{
+	struct stackmap_bin_snapshot *snap = file->private_data;
+
+	if (!snap)
+		return -EINVAL;
+	return simple_read_from_buffer(ubuf, count, ppos, snap->data, snap->size);
+}
+
+static int stackmap_bin_release(struct inode *inode, struct file *file)
+{
+	vfree(file->private_data);
+	return 0;
+}
+
+const struct file_operations ftrace_stackmap_bin_fops = {
+	.open		= stackmap_bin_open,
+	.read		= stackmap_bin_read,
+	.llseek		= default_llseek,
+	.release	= stackmap_bin_release,
+};
diff --git a/kernel/trace/trace_stackmap.h b/kernel/trace/trace_stackmap.h
new file mode 100644
index 000000000000..7c2e5ab9d36d
--- /dev/null
+++ b/kernel/trace/trace_stackmap.h
@@ -0,0 +1,57 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _TRACE_STACKMAP_H
+#define _TRACE_STACKMAP_H
+
+#include <linux/types.h>
+#include <linux/atomic.h>
+
+#define FTRACE_STACKMAP_MAX_DEPTH	64
+
+/* Binary export format */
+#define FTRACE_STACKMAP_BIN_MAGIC	0x46534D42	/* 'FSMB' */
+#define FTRACE_STACKMAP_BIN_VERSION	1
+
+struct ftrace_stackmap_bin_header {
+	u32 magic;
+	u32 version;
+	u32 nr_stacks;
+	u32 reserved;
+};
+
+struct ftrace_stackmap_bin_entry {
+	u32 stack_id;
+	u32 nr;
+	u32 ref_count;
+	u32 reserved;
+	/* followed by u64 ips[nr] */
+};
+
+struct trace_array;
+
+#ifdef CONFIG_FTRACE_STACKMAP
+
+struct ftrace_stackmap;
+
+struct ftrace_stackmap *ftrace_stackmap_create(struct trace_array *tr);
+void ftrace_stackmap_destroy(struct ftrace_stackmap *smap);
+int ftrace_stackmap_get_id(struct ftrace_stackmap *smap,
+			   unsigned long *ips, unsigned int nr_entries);
+int ftrace_stackmap_reset(struct ftrace_stackmap *smap);
+
+extern const struct file_operations ftrace_stackmap_fops;
+extern const struct file_operations ftrace_stackmap_stat_fops;
+extern const struct file_operations ftrace_stackmap_bin_fops;
+
+#else
+
+struct ftrace_stackmap;
+static inline struct ftrace_stackmap *
+ftrace_stackmap_create(struct trace_array *tr) { return NULL; }
+static inline void ftrace_stackmap_destroy(struct ftrace_stackmap *s) { }
+static inline int ftrace_stackmap_get_id(struct ftrace_stackmap *s,
+					 unsigned long *ips, unsigned int n)
+{ return -EOPNOTSUPP; }
+static inline int ftrace_stackmap_reset(struct ftrace_stackmap *s) { return 0; }
+
+#endif
+#endif /* _TRACE_STACKMAP_H */
-- 
2.34.1


^ permalink raw reply related

* [RFC PATCH v4 0/3] trace: stack trace deduplication for ftrace ring buffer
From: Li Pengfei @ 2026-06-16  6:41 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: Mathieu Desnoyers, Mark Rutland, Jonathan Corbet, Shuah Khan,
	linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	lipengfei28, zhangbo56

From: Pengfei Li <lipengfei28@xiaomi.com>

Hi Masami, Steven, all,

This is v4 of the ftrace stackmap series. It is sent as a new thread.

v3: https://lore.kernel.org/all/cover.1779769138.git.lipengfei28@xiaomi.com/

The series adds stack trace deduplication to ftrace. When the
'stackmap' option is enabled alongside 'stacktrace', the ring buffer
stores a 4-byte stack_id instead of a full kernel stack trace, and the
full stacks are exported once via tracefs (stack_map / stack_map_bin).

Rebased onto v7.1-rc5 (e8c2f9fdadee).

Motivation
==========

The target use case is long-duration, from-boot kernel tracing where
the same stacks recur enormously often and the bottleneck is ring
buffer space, not CPU.

Concretely: tracing the slab allocator from boot for hours to study
memory aging and to catch the allocation backtraces behind a usage
peak. With a stacktrace trigger on the slab tracepoints, every event
today carries a full kernel stack (~80-160 bytes). On a fixed-size
ring buffer that bounds how far back in time the trace reaches: the
buffer wraps in seconds to minutes and the early-boot history -- the
part we care about -- is overwritten before it can be consumed.

In this workload the set of distinct stacks is small and highly
repetitive, so storing a 4-byte stack_id per event and the full stack
only once dramatically increases the time span a given buffer covers.
The intended operating model is exactly the low-overhead one ftrace is
good at: let the trace run for a long time producing a comparatively
small, dense log, then resolve stack_ids offline (cat stack_map, or
parse stack_map_bin with the included tool) during analysis.

This is complementary to, not a replacement for, the existing full
stack recording: deep stacks and the early pre-init window still fall
back to full stacks (see below).

Effect on retention
====================

Same fixed per-CPU buffer, slab allocation workload with a shallow
kernel stack (kmem_cache_alloc), stackmap OFF vs ON:

                  retained events   bytes/event   time span
  stackmap OFF        645,068          ~104 B        15.0 s
  stackmap ON       1,397,741           ~48 B        27.7 s
                     2.17x             2.17x          1.85x

The buffer holds ~2.17x more events and reaches ~1.85x further back in
time for the same memory. The win grows with stack depth and with how
repetitive the stacks are; for deep, highly-repeated stacks the
per-event size approaches the 4-byte stack_id plus event header.

Changes since v3
================

Correctness:
  - Deep stacks are never silently truncated or merged. A stack deeper
    than FTRACE_STACKMAP_MAX_DEPTH (64) now falls back to a full stack
    trace; ftrace_stackmap_get_id() returns -E2BIG rather than
    truncating, so two distinct stacks sharing their first 64 frames
    can no longer collapse to one stack_id.
  - reset is now genuinely destructive and coherent: under the
    reader_sem write lock it clears the owning trace_array's ring
    buffer (and snapshot) BEFORE clearing the map, and uses
    tracing_reset_all_cpus() rather than _online_cpus() so a
    TRACE_STACK_ID written by a now-offline CPU cannot survive and
    dangle against a cleared map.
  - __ftrace_trace_stack() reserves the TRACE_STACK_ID ring-buffer
    slot before inserting into the map, so stack_map_stat counters and
    ref_count stay consistent with what the ring buffer actually
    references (failed reservation -> full stack, map untouched).
  - ref_count / successes / drops now saturate (INT_MAX / LONG_MAX)
    instead of wrapping on multi-hour, billions-of-hits traces.

Global-instance gating:
  - Enabling 'stackmap' on a secondary instance via the aggregate
    trace_options file is now rejected, not just hidden in the
    per-instance options/ directory.
  - tracefs init is failure-atomic: the required stack_map file is
    created before the map pointer is published; if it cannot be
    created the map is destroyed and never published. An init-state
    (PENDING/DONE/FAILED) lets boot-time trace_options=stackmap set
    the flag before the map exists (hot path falls back until it is
    published) while still rejecting enables after a permanent init
    failure, so options/stackmap never reports an enabled no-op.

ABI / tooling:
  - Binary magic corrected to 0x46534D42 ('FSMB'); version is 1 (first
    upstream ABI). Documentation, tool and selftest updated to match.
  - Text and binary exports now follow the same trampoline-marker and
    trace_adjust_address() handling as the normal stack print path.
  - stackmap_dump.py emits hex addresses in 'ips' and shows the ftrace
    trampoline marker only in the resolved 'symbols'.

Selftests:
  - New stackmap-reset.tc: verifies reset clears stale <stack_id N>
    from the trace buffer and checks the stack_map_bin magic/version.
  - stackmap-instance-gate.tc extended to verify the trace_options
    write path is rejected on a secondary instance.
  - stackmap-basic.tc no longer treats a nonzero drops count as a
    failure (drops is a by-design fallback); only zero successes with
    nonzero drops is fatal.

Open questions for maintainers
==============================

Two design points where I would value direction before polishing
further:

1. Eager vs lazy allocation. The element pool is allocated at
   fs_initcall when CONFIG_FTRACE_STACKMAP=y, regardless of whether
   userspace ever enables the option (~8 MB at the default bits=14,
   up to ~135 MB at bits=18). This keeps the hot path allocation-free
   with no allocation-failure path under tracing pressure. Is eager
   allocation acceptable, or would you prefer lazy allocation on the
   first 'echo 1 > options/stackmap'?

2. Binary ABI now or later. stack_map_bin is a new tracefs binary
   interface (magic 0x46534D42, version 1). Is it acceptable to
   introduce it now, or would you prefer the first version ship with
   the text stack_map interface only and add the binary export once
   trace-cmd / libtraceevent integration is designed?

Test results
============

QEMU (aarch64 virt, v7.1-rc5 + this series), boot to init smoke test:
  - stackmap functional suite: 16/16 PASS, including reset clearing the
    trace buffer (stale <stack_id> count 48 -> 0), stack_map_bin
    magic/version, global-vs-secondary instance gating, and the
    trace_options rejection on a secondary instance.
  - boot-time activation (trace_options=stackmap,stacktrace on the
    kernel cmdline): 3/3 PASS -- the option survives the
    pre-initialization window and the map is live once published.
  - ftrace startup self-tests pass with the new TRACE_STACK_ID entry.

Device retention numbers above were collected on a Xiaomi SM8850
(ARM64) running an Android workload, comparing the same buffer with
the option off and on.

Known limitations
=================

- Per-instance stackmap support is not included; the option is gated
  to the global instance (in the tracefs layout and at the
  set_tracer_flag() write path). Per-instance maps are a follow-up.
- Deduplication is best-effort, not strict: under heavy concurrent
  contention two CPUs racing with the same stack hash may each claim a
  different slot, producing a few duplicate entries; ref_count is then
  split across them. This keeps the hot path lock-free.
- The stackmap covers kernel stacks only.
- stack_map_bin is a best-effort snapshot, serialized against reset
  but not a fully atomic export.
- trace-cmd / libtraceevent integration is left for follow-up once the
  binary format settles (see open question 2).

Usage
=====

  echo 1 > /sys/kernel/debug/tracing/options/stackmap
  echo 1 > /sys/kernel/debug/tracing/options/stacktrace

Pengfei Li (3):
  trace: add lock-free stackmap for stack trace deduplication
  trace: integrate stackmap into ftrace stack recording path
  trace: add documentation, selftest and tooling for stackmap

 Documentation/trace/ftrace-stackmap.rst       | 177 ++++
 Documentation/trace/index.rst                 |   1 +
 kernel/trace/Kconfig                          |  22 +
 kernel/trace/Makefile                         |   1 +
 kernel/trace/trace.c                          | 216 ++++-
 kernel/trace/trace.h                          |  17 +
 kernel/trace/trace_entries.h                  |  15 +
 kernel/trace/trace_output.c                   |  23 +
 kernel/trace/trace_selftest.c                 |   1 +
 kernel/trace/trace_stackmap.c                 | 889 ++++++++++++++++++
 kernel/trace/trace_stackmap.h                 |  57 ++
 .../ftrace/test.d/ftrace/stackmap-basic.tc    | 111 +++
 .../test.d/ftrace/stackmap-instance-gate.tc   |  54 ++
 .../ftrace/test.d/ftrace/stackmap-reset.tc    |  76 ++
 tools/tracing/stackmap_dump.py                | 164 ++++
 15 files changed, 1821 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/trace/ftrace-stackmap.rst
 create mode 100644 kernel/trace/trace_stackmap.c
 create mode 100644 kernel/trace/trace_stackmap.h
 create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
 create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc
 create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-reset.tc
 create mode 100755 tools/tracing/stackmap_dump.py

-- 
2.34.1

^ permalink raw reply

* [PATCH 5/7] doc: adm1275: Add ROHM BD12790
From: Matti Vaittinen @ 2026-06-16  6:38 UTC (permalink / raw)
  To: Matti Vaittinen, Matti Vaittinen, Matti Vaittinen
  Cc: Guenter Roeck, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Jonathan Corbet, Shuah Khan, Wensheng Wang, Ashish Yadav,
	Matti Vaittinen, Kim Seer Paller, Cedric Encarnacion,
	Chris Packham, Yuxi Wang, Charles Hsu, ChiShih Tsai, linux-hwmon,
	devicetree, linux-kernel, linux-doc
In-Reply-To: <cover.1781591132.git.mazziesaccount@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 896 bytes --]

From: Matti Vaittinen <mazziesaccount@gmail.com>

Add the ROHM BD12790 to the list of the ICs supported by the adm1275
driver.

Signed-off-by: Matti Vaittinen <mazziesaccount@gmail.com>
---
I didn't find public data-sheet yet. I will add a link when one is
available.

 Documentation/hwmon/adm1275.rst | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/Documentation/hwmon/adm1275.rst b/Documentation/hwmon/adm1275.rst
index 8a793dd2b412..d8495be313b8 100644
--- a/Documentation/hwmon/adm1275.rst
+++ b/Documentation/hwmon/adm1275.rst
@@ -83,6 +83,14 @@ Supported chips:
 
     Datasheet: https://fscdn.rohm.com/en/products/databook/datasheet/ic/power/power_switch/bd12780amuv-lb-e.pdf
 
+  * ROHM Semiconductor BD12790
+
+    Prefix: 'bd12790'
+
+    Addresses scanned: -
+
+    Datasheet: -
+
   * Silergy SQ24905C
 
     Prefix: 'mc09c'
-- 
2.54.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply related

* [PATCH 4/7] dt-bindings: adm1275: ROHM BD12790 hot-swap controller
From: Matti Vaittinen @ 2026-06-16  6:37 UTC (permalink / raw)
  To: Matti Vaittinen, Matti Vaittinen, Matti Vaittinen
  Cc: Guenter Roeck, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Jonathan Corbet, Shuah Khan, Wensheng Wang, Ashish Yadav,
	Matti Vaittinen, Kim Seer Paller, Cedric Encarnacion,
	Chris Packham, Yuxi Wang, Charles Hsu, ChiShih Tsai, linux-hwmon,
	devicetree, linux-kernel, linux-doc
In-Reply-To: <cover.1781591132.git.mazziesaccount@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1456 bytes --]

From: Matti Vaittinen <mazziesaccount@gmail.com>

Support ROHM BD12790 hot-swap controller which is largely compatible
with the Analog Devices adm1272.

Signed-off-by: Matti Vaittinen <mazziesaccount@gmail.com>
---
 Documentation/devicetree/bindings/hwmon/adi,adm1275.yaml | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/Documentation/devicetree/bindings/hwmon/adi,adm1275.yaml b/Documentation/devicetree/bindings/hwmon/adi,adm1275.yaml
index bc67510ef3ab..e231964a6706 100644
--- a/Documentation/devicetree/bindings/hwmon/adi,adm1275.yaml
+++ b/Documentation/devicetree/bindings/hwmon/adi,adm1275.yaml
@@ -33,6 +33,9 @@ description: |
     https://fscdn.rohm.com/en/products/databook/datasheet/ic/power/power_switch/bd12780muv-lb-e.pdf
     https://fscdn.rohm.com/en/products/databook/datasheet/ic/power/power_switch/bd12780amuv-lb-e.pdf
 
+  The BD12790 is a ROHM hot-swap controller,  functionally similar to the
+  ADM1272.
+
 properties:
   compatible:
     oneOf:
@@ -48,6 +51,7 @@ properties:
             - adi,adm1293
             - adi,adm1294
             - rohm,bd12780
+            - rohm,bd12790
             - silergy,mc09c
 
 # Require BD12780 as a fall-back for BD12780A.
@@ -104,6 +108,7 @@ allOf:
             enum:
               - adi,adm1272
               - adi,adm1273
+              - rohm,bd12790
     then:
       properties:
         adi,volt-curr-sample-average:
-- 
2.54.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply related

* [PATCH 3/7] hwmon: adm1275: Support ROHM BD12780
From: Matti Vaittinen @ 2026-06-16  6:36 UTC (permalink / raw)
  To: Matti Vaittinen, Matti Vaittinen, Matti Vaittinen
  Cc: Guenter Roeck, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Jonathan Corbet, Shuah Khan, Wensheng Wang, Ashish Yadav,
	Matti Vaittinen, Kim Seer Paller, Cedric Encarnacion,
	Chris Packham, Yuxi Wang, Charles Hsu, ChiShih Tsai, linux-hwmon,
	devicetree, linux-kernel, linux-doc
In-Reply-To: <cover.1781591132.git.mazziesaccount@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 6130 bytes --]

From: Matti Vaittinen <mazziesaccount@gmail.com>

ROHM BD12780 and BD12780A are hot-swap controllers. They are largely
similar to Analog Devices ADM1278. Besides the ID registers and some
added functionality, the BD12780 and BD12780A mark PMON_CONFIG bits
[15:14] as reserved. Hence TSFILT setting must be omitted on these ICs.

The BD12780 has 3 pins usable for configuring the I2C address. The
BD12780A lists the ADDR3-pin as "not connect".

Support ROHM BD12780 and BD12780A controllers.

Signed-off-by: Matti Vaittinen <mazziesaccount@gmail.com>
---
 drivers/hwmon/pmbus/Kconfig   |  2 +-
 drivers/hwmon/pmbus/adm1275.c | 46 +++++++++++++++++++++++++++++------
 2 files changed, 39 insertions(+), 9 deletions(-)

diff --git a/drivers/hwmon/pmbus/Kconfig b/drivers/hwmon/pmbus/Kconfig
index 8f4bff375ecb..b3c27f3b2712 100644
--- a/drivers/hwmon/pmbus/Kconfig
+++ b/drivers/hwmon/pmbus/Kconfig
@@ -52,7 +52,7 @@ config SENSORS_ADM1275
 	help
 	  If you say yes here you get hardware monitoring support for Analog
 	  Devices ADM1075, ADM1272, ADM1273, ADM1275, ADM1276, ADM1278, ADM1281,
-	  ADM1293, ADM1294 and SQ24905C Hot-Swap Controller and
+	  ADM1293, ADM1294, ROHM BD12780, and SQ24905C Hot-Swap Controller and
 	  Digital Power Monitors.
 
 	  This driver can also be built as a module. If so, the module will
diff --git a/drivers/hwmon/pmbus/adm1275.c b/drivers/hwmon/pmbus/adm1275.c
index bc2a6a07dc3e..838b8827eb76 100644
--- a/drivers/hwmon/pmbus/adm1275.c
+++ b/drivers/hwmon/pmbus/adm1275.c
@@ -19,7 +19,7 @@
 #include "pmbus.h"
 
 enum chips { adm1075, adm1272, adm1273, adm1275, adm1276, adm1278, adm1281,
-	 adm1293, adm1294, sq24905c };
+	 adm1293, adm1294, bd12780, sq24905c };
 
 #define ADM1275_MFR_STATUS_IOUT_WARN2	BIT(0)
 #define ADM1293_MFR_STATUS_VAUX_UV_WARN	BIT(5)
@@ -47,6 +47,8 @@ enum chips { adm1075, adm1272, adm1273, adm1275, adm1276, adm1278, adm1281,
 #define ADM1278_VOUT_EN			BIT(1)
 
 #define ADM1278_PMON_DEFCONFIG		(ADM1278_VOUT_EN | ADM1278_TEMP1_EN | ADM1278_TSFILT)
+/* The BD12780 data sheets mark TSFILT bit as reserved. */
+#define BD12780_PMON_DEFCONFIG		(ADM1278_VOUT_EN | ADM1278_TEMP1_EN)
 
 #define ADM1293_IRANGE_25		0
 #define ADM1293_IRANGE_50		BIT(6)
@@ -487,6 +489,21 @@ static const struct i2c_device_id adm1275_id[] = {
 	{ "adm1281", adm1281 },
 	{ "adm1293", adm1293 },
 	{ "adm1294", adm1294 },
+	/*
+	 * The BD12780a is functionally identical to BD12780(*). Even the pmbus ID
+	 * register contents are same. When instantiated from the DT, it is required
+	 * to have the bd12780 as a fall-back. We still need the bd12780a ID here,
+	 * because the i2c_device_id is created from the first compatible, not from
+	 * the fall-back entry.
+	 * (*)Until proven to differ. I prefer having own compatible for these
+	 * variants for that day. Please note that even though the probe is called
+	 * based on the 'bd12780a' -entry, the ID is picked at probe based on the
+	 * pmbus register contents and not by DT entry. Thus, if the bd12780 and
+	 * bd12780a are found to require different handling, then this needs to be
+	 * changed, or bd12780a is handled as bd12780.
+	 */
+	{ "bd12780", bd12780 },
+	{ "bd12780a", /* driver data unused, see --^ */ },
 	{ "mc09c", sq24905c },
 	{ }
 };
@@ -494,12 +511,13 @@ MODULE_DEVICE_TABLE(i2c, adm1275_id);
 
 /* Enable VOUT & TEMP1 if not enabled (disabled by default) */
 static int adm1275_enable_vout_temp(struct adm1275_data *data,
-				    struct i2c_client *client, int config)
+				    struct i2c_client *client, int config,
+				    u16 defconfig)
 {
 	int ret;
 
-	if ((config & ADM1278_PMON_DEFCONFIG) != ADM1278_PMON_DEFCONFIG) {
-		config |= ADM1278_PMON_DEFCONFIG;
+	if ((config & defconfig) != defconfig) {
+		config |= defconfig;
 		ret = adm1275_write_pmon_config(data, client, config);
 		if (ret < 0) {
 			dev_err(&client->dev, "Failed to enable VOUT/TEMP1 monitoring\n");
@@ -535,7 +553,8 @@ static int adm1275_probe(struct i2c_client *client)
 		return ret;
 	}
 	if ((ret != 3 || strncmp(block_buffer, "ADI", 3)) &&
-	    (ret != 2 || strncmp(block_buffer, "SY", 2))) {
+	    (ret != 2 || strncmp(block_buffer, "SY", 2)) &&
+	    (ret != 4 || strncmp(block_buffer, "ROHM", 4))) {
 		dev_err(&client->dev, "Unsupported Manufacturer ID\n");
 		return -ENODEV;
 	}
@@ -562,7 +581,7 @@ static int adm1275_probe(struct i2c_client *client)
 	if (mid->driver_data == adm1272 || mid->driver_data == adm1273 ||
 	    mid->driver_data == adm1278 || mid->driver_data == adm1281 ||
 	    mid->driver_data == adm1293 || mid->driver_data == adm1294 ||
-	    mid->driver_data == sq24905c)
+	    mid->driver_data == bd12780 || mid->driver_data == sq24905c)
 		config_read_fn = i2c_smbus_read_word_data;
 	else
 		config_read_fn = i2c_smbus_read_byte_data;
@@ -666,7 +685,8 @@ static int adm1275_probe(struct i2c_client *client)
 			PMBUS_HAVE_VOUT | PMBUS_HAVE_STATUS_VOUT |
 			PMBUS_HAVE_TEMP | PMBUS_HAVE_STATUS_TEMP;
 
-		ret = adm1275_enable_vout_temp(data, client, config);
+		ret = adm1275_enable_vout_temp(data, client, config,
+					       ADM1278_PMON_DEFCONFIG);
 		if (ret)
 			return ret;
 
@@ -712,7 +732,16 @@ static int adm1275_probe(struct i2c_client *client)
 		break;
 	case adm1278:
 	case adm1281:
+	case bd12780:
 	case sq24905c:
+	{
+		u16 defconfig;
+
+		if (data->id == bd12780)
+			defconfig = BD12780_PMON_DEFCONFIG;
+		else
+			defconfig = ADM1278_PMON_DEFCONFIG;
+
 		data->have_vout = true;
 		data->have_pin_max = true;
 		data->have_temp_max = true;
@@ -728,13 +757,14 @@ static int adm1275_probe(struct i2c_client *client)
 			PMBUS_HAVE_VOUT | PMBUS_HAVE_STATUS_VOUT |
 			PMBUS_HAVE_TEMP | PMBUS_HAVE_STATUS_TEMP;
 
-		ret = adm1275_enable_vout_temp(data, client, config);
+		ret = adm1275_enable_vout_temp(data, client, config, defconfig);
 		if (ret)
 			return ret;
 
 		if (config & ADM1278_VIN_EN)
 			info->func[0] |= PMBUS_HAVE_VIN;
 		break;
+	}
 	case adm1293:
 	case adm1294:
 		data->have_iout_min = true;
-- 
2.54.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply related

* [PATCH 2/7] doc: Add ROHM BD12780 and BD12780A
From: Matti Vaittinen @ 2026-06-16  6:36 UTC (permalink / raw)
  To: Matti Vaittinen, Matti Vaittinen, Matti Vaittinen
  Cc: Guenter Roeck, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Jonathan Corbet, Shuah Khan, Wensheng Wang, Ashish Yadav,
	Matti Vaittinen, Kim Seer Paller, Cedric Encarnacion,
	Chris Packham, Yuxi Wang, Charles Hsu, ChiShih Tsai, linux-hwmon,
	devicetree, linux-kernel, linux-doc
In-Reply-To: <cover.1781591132.git.mazziesaccount@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1134 bytes --]

From: Matti Vaittinen <mazziesaccount@gmail.com>

Add the ROHM BD12780 and the BD12780A to the list of the ICs supported by
the adm1275 driver.

Signed-off-by: Matti Vaittinen <mazziesaccount@gmail.com>
---
 Documentation/hwmon/adm1275.rst | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/Documentation/hwmon/adm1275.rst b/Documentation/hwmon/adm1275.rst
index cf923f20fa52..8a793dd2b412 100644
--- a/Documentation/hwmon/adm1275.rst
+++ b/Documentation/hwmon/adm1275.rst
@@ -67,6 +67,22 @@ Supported chips:
 
     Datasheet: https://www.analog.com/media/en/technical-documentation/data-sheets/ADM1293_1294.pdf
 
+  * ROHM Semiconductor BD12780
+
+    Prefix: 'bd12780'
+
+    Addresses scanned: -
+
+    Datasheet: https://fscdn.rohm.com/en/products/databook/datasheet/ic/power/power_switch/bd12780muv-lb-e.pdf
+
+  * ROHM Semiconductor BD12780A
+
+    Prefix: 'bd12780'
+
+    Addresses scanned: -
+
+    Datasheet: https://fscdn.rohm.com/en/products/databook/datasheet/ic/power/power_switch/bd12780amuv-lb-e.pdf
+
   * Silergy SQ24905C
 
     Prefix: 'mc09c'
-- 
2.54.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply related

* [PATCH 1/7] dt-bindings: adm1275: ROHM BD12780 hot-swap controller
From: Matti Vaittinen @ 2026-06-16  6:35 UTC (permalink / raw)
  To: Matti Vaittinen, Matti Vaittinen, Matti Vaittinen
  Cc: Guenter Roeck, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Jonathan Corbet, Shuah Khan, Wensheng Wang, Ashish Yadav,
	Matti Vaittinen, Kim Seer Paller, Cedric Encarnacion,
	Chris Packham, Yuxi Wang, Charles Hsu, ChiShih Tsai, linux-hwmon,
	devicetree, linux-kernel, linux-doc
In-Reply-To: <cover.1781591132.git.mazziesaccount@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2524 bytes --]

From: Matti Vaittinen <mazziesaccount@gmail.com>

Support ROHM BD12780 and BD12780A hot-swap controllers, which are largely
compatible with the Analog Devices adm1278. Main difference between
the BD12780 and the BD12780A is, that the BD12780 has one I2C address
configuration pin more (ADDR3) than the BD12780A.

Introduce own compatibles for both variants but require the BD12780A to
always have the BD12780 as a fall-back.

Signed-off-by: Matti Vaittinen <mazziesaccount@gmail.com>
---
 .../bindings/hwmon/adi,adm1275.yaml           | 39 +++++++++++++------
 1 file changed, 28 insertions(+), 11 deletions(-)

diff --git a/Documentation/devicetree/bindings/hwmon/adi,adm1275.yaml b/Documentation/devicetree/bindings/hwmon/adi,adm1275.yaml
index d6a7517f2a50..bc67510ef3ab 100644
--- a/Documentation/devicetree/bindings/hwmon/adi,adm1275.yaml
+++ b/Documentation/devicetree/bindings/hwmon/adi,adm1275.yaml
@@ -25,19 +25,35 @@ description: |
     https://www.silergy.com/
     download/downloadFile?id=5669&type=product&ftype=note
 
+  The BD12780 and BD12780A are hot-swap controllers from ROHM. They are
+  functionally compatible with the ADM1278. The main difference between
+  the BD12780A and the BD12780 is amount of configurable I2C addresses.
+
+  Datasheets:
+    https://fscdn.rohm.com/en/products/databook/datasheet/ic/power/power_switch/bd12780muv-lb-e.pdf
+    https://fscdn.rohm.com/en/products/databook/datasheet/ic/power/power_switch/bd12780amuv-lb-e.pdf
+
 properties:
   compatible:
-    enum:
-      - adi,adm1075
-      - adi,adm1272
-      - adi,adm1273
-      - adi,adm1275
-      - adi,adm1276
-      - adi,adm1278
-      - adi,adm1281
-      - adi,adm1293
-      - adi,adm1294
-      - silergy,mc09c
+    oneOf:
+      - items:
+          enum:
+            - adi,adm1075
+            - adi,adm1272
+            - adi,adm1273
+            - adi,adm1275
+            - adi,adm1276
+            - adi,adm1278
+            - adi,adm1281
+            - adi,adm1293
+            - adi,adm1294
+            - rohm,bd12780
+            - silergy,mc09c
+
+# Require BD12780 as a fall-back for BD12780A.
+      - items:
+          - const: rohm,bd12780a
+          - const: rohm,bd12780
 
   reg:
     maxItems: 1
@@ -104,6 +120,7 @@ allOf:
               - adi,adm1281
               - adi,adm1293
               - adi,adm1294
+              - rohm,bd12780
               - silergy,mc09c
     then:
       properties:
-- 
2.54.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply related

* [PATCH 0/7] Support ROHM BD127x0 hot-swap controllers
From: Matti Vaittinen @ 2026-06-16  6:33 UTC (permalink / raw)
  To: Matti Vaittinen, Matti Vaittinen, Matti Vaittinen
  Cc: Guenter Roeck, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Jonathan Corbet, Shuah Khan, Wensheng Wang, Ashish Yadav,
	Matti Vaittinen, Kim Seer Paller, Cedric Encarnacion,
	Chris Packham, Yuxi Wang, Charles Hsu, ChiShih Tsai, linux-hwmon,
	devicetree, linux-kernel, linux-doc

[-- Attachment #1: Type: text/plain, Size: 1427 bytes --]

Support ROHM BD12780(A) and BD12790

The BD12780 and BD12780A hot-swap controllers are very similar to Analog
Devices ADM1278. There are only some minor differences in the registers.

The BD12790 is largely similar to the ADM1272, with slightly different
coefficients and minor register changes.

This series adds basic support for these ROHM ICs.

The last patch adds of_device_id table with entries for the newly added
controllers. This fixes the module auto-load on the test board with old
Debian user-space.

I have no idea if adding the of_device_id -entries for other ICs could
cause problems in some existing systems. Hence only new ICs were added
to the of_device_id tables.

---

Matti Vaittinen (7):
  dt-bindings: adm1275: ROHM BD12780 hot-swap controller
  doc: Add ROHM BD12780 and BD12780A
  hwmon: adm1275: Support ROHM BD12780
  dt-bindings: adm1275: ROHM BD12790 hot-swap controller
  doc: adm1275: Add ROHM BD12790
  hwmon: adm1275: Support ROHM BD12790
  hwmon: adm1275: Support module auto-loading

 .../bindings/hwmon/adi,adm1275.yaml           | 44 ++++++---
 Documentation/hwmon/adm1275.rst               | 24 +++++
 drivers/hwmon/pmbus/Kconfig                   |  4 +-
 drivers/hwmon/pmbus/adm1275.c                 | 91 +++++++++++++++++--
 4 files changed, 142 insertions(+), 21 deletions(-)

base-commit: 254f49634ee16a731174d2ae34bc50bd5f45e731
-- 
2.54.0

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH v5 3/4] PCI: endpoint: Add support for DOE initialization and setup in EPC core
From: Manivannan Sadhasivam @ 2026-06-16  6:22 UTC (permalink / raw)
  To: Aksh Garg
  Cc: Bjorn Helgaas, linux-pci, linux-doc, kwilczynski, bhelgaas,
	corbet, kishon, skhan, lukas, cassel, alistair, linux-arm-kernel,
	linux-kernel, s-vadapalli, danishanwar, srk
In-Reply-To: <0216a528-3737-4714-b9d1-5d28008e0ec8@ti.com>

On Fri, Jun 12, 2026 at 01:54:13PM +0530, Aksh Garg wrote:
> 
> 
> On 12/06/26 00:42, Bjorn Helgaas wrote:
> > On Wed, Jun 10, 2026 at 03:32:55PM +0530, Aksh Garg wrote:
> > > Add pci_epc_init_capabilities() in EPC core driver to initialize and
> > > setup the capabilities supported by the EPC driver. This calls
> > > pci_epc_doe_setup() to setup the DOE framework for an endpoint controller,
> > > which discovers the DOE capabilities (extended capability ID 0x2E), and
> > > registers each discovered DOE mailbox for all the functions in the
> > > endpoint controller.
> > > 
> > > Add pci_epc_deinit_capabilities() in EPC core driver for cleanup of the
> > > resources used by the capabilities of the EPC driver. This calls
> > > pci_ep_doe_destroy() to destroy all DOE mailboxes and free associated
> > > resources.
> > > 
> > > Co-developed-by: Siddharth Vadapalli <s-vadapalli@ti.com>
> > > Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
> > > Signed-off-by: Aksh Garg <a-garg7@ti.com>
> > > ---
> > > +/**
> > > + * pci_epc_doe_setup() - Discover and setup DOE mailboxes for all functions
> > > + * @epc: the EPC device on which DOE mailboxes has to be setup
> > > + *
> > > + * Discover DOE (Data Object Exchange) capabilities for all physical functions
> > > + * in the endpoint controller and register DOE mailboxes.
> > > + *
> > > + * Returns: 0 on success, -errno on failure
> > > + */
> > > +static int pci_epc_doe_setup(struct pci_epc *epc)
> > > +{
> > > +	u8 func_no, vfunc_no = 0;
> > > +	u16 cap_offset;
> > > +	int ret;
> > > +
> > > +	if (!epc->ops || !epc->ops->find_ext_capability)
> > > +		return -EINVAL;
> > 
> 
> Hi Bjorn,
> 
> Thank you for your feedback comments. I will work on them and post v6
> series incorporating the changes.
> 
> > I don't see anything that sets pci_epc_ops.find_ext_capability in this
> > series, so this looks currently unused and untestable, so likely not
> > mergeable as-is.  What's the plan for users of this?
> > 
> 
> Currently there is no EPC driver upstream which supports DOE yet. However, I
> am working on a platform which supports DOE (support for
> which would be added soon). Mani pointed out that if EPC driver support
> for the same is guaranteed to be added soon, the APIs can be merged
> first.
> 
> For the demonstration purpose, he asked to show how an EPC driver is
> expected to use the API as a snippet in the cover letter itself.
> 

I retract my previous comment here. Let's not introduce dead code in the kernel.
We can review the series now, but cannot merge it until the EPC driver gets
submitted.

- Mani

-- 
மணிவண்ணன் சதாசிவம்

^ permalink raw reply

* Re: [RFC V2 3/3] mm: Replace pgtable entry prints with new format
From: Hugh Dickins @ 2026-06-16  6:19 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Hugh Dickins, Anshuman Khandual, linux-mm, Andy Shevchenko,
	Rasmus Villemoes, Sergey Senozhatsky, Petr Mladek, Steven Rostedt,
	Jonathan Corbet, Andrew Morton, linux-kernel, linux-doc,
	Lorenzo Stoakes
In-Reply-To: <dabfd73b-d872-4267-9a40-45463fe146ac@kernel.org>

On Mon, 15 Jun 2026, David Hildenbrand (Arm) wrote:
> On 6/12/26 23:26, Hugh Dickins wrote:
> > On Fri, 12 Jun 2026, David Hildenbrand (Arm) wrote:
> > ...
> >>
> >> After some off-list discussion, I wonder if we can make our life easier.
> >>
> >> I think, even with your patch, there is still the case:
> >>
> >> pr_alert("BUG: Bad page map in process %s  %s:%08llx", current->comm,
> >> 	 pgtable_level_to_str(level), entry);
> >>
> >> Where we cast all entries to an "unsigned long" in the callers. We'd have to rework all
> >> that for 128bit entries either way (passing them in some struct instead).
> >>
> >> I really just extended what we used to do here in print_bad_pte() before commit ec63a44011d.
> >>
> >> Maybe we should just drop the "print the involved page table entries" thing?
> >>
> >> I mean, we do have the actual page, and we do have the address in the address space, which
> >> we all print.
> >>
> >> Not sure if the actual page table entries are that relevant?
> > 
> > The page table entry is BUGgily Bad: we want to see what it looks like
> > (sometimes, a sequence of bad page map entries may even show up as ASCII).
> 
> But is printing raw page table entries really what we want? I guess to detect
> "random corruption" it might help sometimes.

Yes, that's what it's for. What we really want is to understand what went
wrong: that's too much to ask of a printk, but it can give us a good clue.

> 
> And do we really need information about the full page table walk, or is the last
> level good enough?

Page table entry and pmd entry are good enough: higher levels got
added at some stage, but they are unlikely to be useful here.

Hugh

^ permalink raw reply

* Re: [PATCH v5 6/6] kselftest: alloc_tag: extend the allocinfo ioctl kselftest
From: Hao Ge @ 2026-06-16  6:18 UTC (permalink / raw)
  To: Abhishek Bapat
  Cc: Shuah Khan, Jonathan Corbet, linux-doc, linux-kernel, linux-mm,
	Sourav Panda, Suren Baghdasaryan, Andrew Morton, Kent Overstreet
In-Reply-To: <1a68864b1493528abed0e9e3489688fc6287c37e.1781564384.git.abhishekbapat@google.com>

Hi Abhishek


On 2026/6/16 07:04, Abhishek Bapat wrote:
> Add the following 2 scenarios to the allocinfo ioctl kselftest:
> 1. Validate size based filtering
> 2. Validate lineno based filtering
>
> The first test uses "do_init_module" as the candidate function for the
> test. This is because the associated site will only allocate memory when
> a kernel module is loaded. The return value of get_content_id() changes
> every time modules are loaded or unloaded. Hence, as long as
> get_content_id() values at the start and the end of the test are the
> same, the memory allocated by the do_init_module call site should also
> remain the same. Consequently, the test can assume consistency between
> the value returned by the ioctl and the procfs resulting in less
> flakiness.
>
> Signed-off-by: Abhishek Bapat <abhishekbapat@google.com>
> ---
>   .../alloc_tag/allocinfo_ioctl_test.c          | 197 +++++++++++++++++-
>   1 file changed, 196 insertions(+), 1 deletion(-)
>
> diff --git a/tools/testing/selftests/alloc_tag/allocinfo_ioctl_test.c b/tools/testing/selftests/alloc_tag/allocinfo_ioctl_test.c
> index 62d5a488a04d..041fee1a3d74 100644
> --- a/tools/testing/selftests/alloc_tag/allocinfo_ioctl_test.c
> +++ b/tools/testing/selftests/alloc_tag/allocinfo_ioctl_test.c
> @@ -309,11 +309,194 @@ static int test_function_filter(void)
>   	return run_filter_test(&filter);
>   }
>   
> +static int test_size_filter(void)
> +{
> +	int fd;
> +	struct allocinfo_tag_data_vec *tags = malloc(sizeof(*tags));
> +	struct allocinfo_tag_data_vec *procfs_entries = malloc(sizeof(*procfs_entries));
> +	struct allocinfo_filter filter;
> +	int ret = KSFT_PASS;
> +	__u64 target_size, i, pos;
> +	bool found;
> +	const char *target_function = "do_init_module";
> +	struct allocinfo_content_id start_cont_id, end_cont_id;
> +	int retry = 0;
> +	const int max_retries = 10;
> +
> +	if (!tags || !procfs_entries) {
> +		ksft_print_msg("Memory allocation failed.\n");
> +		ret = KSFT_FAIL;
> +		goto freemem;
> +	}
> +
> +	fd = open(ALLOCINFO_PROC, O_RDONLY);
> +	if (fd < 0) {
> +		ksft_print_msg("Failed to open " ALLOCINFO_PROC ": %s\n", strerror(errno));


I see. The #include <errno.h> you added in patch 5 is meant for this 
spot, right?

If that's the case, I'd prefer moving the #include <errno.h>

addition into this patch, though this is a trivial detail either way.


Thanks

Best Regards

Hao


> +		ret = KSFT_FAIL;
> +		goto freemem;
> +	}
> +
> +	do {
> +		found = false;
> +		pos = 0;
> +
> +		if (__allocinfo_get_content_id(fd, &start_cont_id)) {
> +			ksft_print_msg("allocinfo_get_content_id failed\n");
> +			ret = KSFT_FAIL;
> +			goto exit;
> +		}
> +
> +		memset(&filter, 0, sizeof(filter));
> +		filter.mask |= ALLOCINFO_FILTER_MASK_FUNCTION;
> +		strncpy(filter.fields.function, target_function, ALLOCINFO_STR_SIZE);
> +
> +		if (get_filtered_procfs_entries(procfs_entries, &filter)) {
> +			ksft_print_msg("Error retrieving entries from " ALLOCINFO_PROC "\n");
> +			ret = KSFT_FAIL;
> +			goto exit;
> +		}
> +
> +		if (procfs_entries->count == 0) {
> +			ksft_print_msg("Function %s not found in procfs\n", target_function);
> +			ret = KSFT_SKIP;
> +			goto exit;
> +		}
> +
> +		target_size = procfs_entries->tag[0].counter.bytes;
> +
> +		memset(&filter, 0, sizeof(filter));
> +		filter.mask |= ALLOCINFO_FILTER_MASK_MIN_SIZE | ALLOCINFO_FILTER_MASK_MAX_SIZE;
> +		filter.min_size = target_size;
> +		filter.max_size = target_size;
> +
> +		while (1) {
> +			struct allocinfo_get_at get_at_params;
> +
> +			memset(&get_at_params, 0, sizeof(get_at_params));
> +			memcpy(&get_at_params.filter, &filter, sizeof(filter));
> +			get_at_params.pos = pos;
> +
> +			if (__allocinfo_get_at(fd, &get_at_params))
> +				break;
> +
> +			tags->count = 0;
> +			memcpy(&tags->tag[tags->count++], &get_at_params.data,
> +			       sizeof(get_at_params.data));
> +
> +			while (tags->count < VEC_MAX_ENTRIES &&
> +			       __allocinfo_get_next(fd, &tags->tag[tags->count]) == 0)
> +				tags->count++;
> +
> +			for (i = 0; i < tags->count; i++) {
> +				if (strcmp(tags->tag[i].tag.function, target_function) == 0) {
> +					found = true;
> +					break;
> +				}
> +			}
> +
> +			if (found || tags->count < VEC_MAX_ENTRIES)
> +				break;
> +
> +			pos += tags->count;
> +		}
> +
> +		if (__allocinfo_get_content_id(fd, &end_cont_id)) {
> +			ksft_print_msg("allocinfo_get_content_id failed\n");
> +			ret = KSFT_FAIL;
> +			goto exit;
> +		}
> +
> +		if (start_cont_id.id == end_cont_id.id)
> +			break;
> +
> +		ksft_print_msg("Module load detected during size verification, retrying...\n");
> +	} while (retry++ < max_retries);
> +
> +	if (start_cont_id.id == end_cont_id.id && !found) {
> +		ksft_print_msg("Entry with function %s not found in IOCTL results\n",
> +			       target_function);
> +		ret = KSFT_FAIL;
> +	} else if (start_cont_id.id != end_cont_id.id) {
> +		ksft_print_msg("Failed to match content_ids for procfs and IOCTL, skipping...\n");
> +		ret = KSFT_SKIP;
> +	}
> +
> +exit:
> +	close(fd);
> +freemem:
> +	free(tags);
> +	free(procfs_entries);
> +	return ret;
> +}
> +
> +static int test_lineno_filter(void)
> +{
> +	struct allocinfo_tag_data_vec *tags = malloc(sizeof(*tags));
> +	struct allocinfo_tag_data_vec *procfs_entries = malloc(sizeof(*procfs_entries));
> +	struct allocinfo_filter filter;
> +	enum ioctl_ret ioctl_status;
> +	int ret = KSFT_PASS;
> +	__u64 target_lineno, i;
> +
> +	if (!tags || !procfs_entries) {
> +		ksft_print_msg("Memory allocation failed.\n");
> +		ret = KSFT_FAIL;
> +		goto exit;
> +	}
> +
> +	memset(&filter, 0, sizeof(filter));
> +
> +	if (get_filtered_procfs_entries(procfs_entries, &filter)) {
> +		ksft_print_msg("Error retrieving entries from " ALLOCINFO_PROC "\n");
> +		ret = KSFT_FAIL;
> +		goto exit;
> +	}
> +	if (procfs_entries->count == 0) {
> +		ksft_print_msg("Could not retrieve procfs entries\n");
> +		ret = KSFT_SKIP;
> +		goto exit;
> +	}
> +	/*
> +	 * We depend on the result of procfs entries to create the ioctl_filter. Hence we
> +	 * cannot recycle the run_filter_test function here.
> +	 */
> +	target_lineno = procfs_entries->tag[0].tag.lineno;
> +
> +	filter.mask |= ALLOCINFO_FILTER_MASK_LINENO;
> +	filter.fields.lineno = target_lineno;
> +
> +	ioctl_status = get_filtered_ioctl_entries(tags, &filter, 0);
> +	if (ioctl_status == IOCTL_INVALID_DATA) {
> +		ksft_print_msg("Trouble retrieving valid IOCTL entries, skipping.\n");
> +		ret = KSFT_SKIP;
> +		goto exit;
> +	}
> +	if (ioctl_status == IOCTL_FAILURE) {
> +		ksft_print_msg("Error retrieving IOCTL entries.\n");
> +		ret = KSFT_FAIL;
> +		goto exit;
> +	}
> +
> +	for (i = 0; i < tags->count; i++) {
> +		if (tags->tag[i].tag.lineno != target_lineno) {
> +			ksft_print_msg("IOCTL entry %llu has incorrect lineno %llu.\n",
> +				       i, tags->tag[i].tag.lineno);
> +			ret = KSFT_FAIL;
> +			goto exit;
> +		}
> +	}
> +
> +exit:
> +	free(tags);
> +	free(procfs_entries);
> +	return ret;
> +}
> +
>   int main(int argc, char *argv[])
>   {
>   	int ret;
>   
> -	ksft_set_plan(2);
> +	ksft_set_plan(4);
>   
>   	ret = test_filename_filter();
>   	if (ret == KSFT_SKIP)
> @@ -327,5 +510,17 @@ int main(int argc, char *argv[])
>   	else
>   		ksft_test_result(ret == KSFT_PASS, "test_function_filter\n");
>   
> +	ret = test_size_filter();
> +	if (ret == KSFT_SKIP)
> +		ksft_test_result_skip("Skipping test_size_filter\n");
> +	else
> +		ksft_test_result(ret == KSFT_PASS, "test_size_filter\n");
> +
> +	ret = test_lineno_filter();
> +	if (ret == KSFT_SKIP)
> +		ksft_test_result_skip("Skipping test_lineno_filter\n");
> +	else
> +		ksft_test_result(ret == KSFT_PASS, "test_lineno_filter\n");
> +
>   	ksft_finished();
>   }

^ permalink raw reply

* [linuxsecuritymodule:main 1/1] htmldocs: Warning: README.md references a file that doesn't exist: Documentation/API/obsolete
From: kernel test robot @ 2026-06-16  6:06 UTC (permalink / raw)
  To: Paul Moore; +Cc: oe-kbuild-all, linux-doc

tree:   https://github.com/LinuxSecurityModule/kernel main
head:   fac1e95155f2b940a514438e0b25c0fd543207d7
commit: fac1e95155f2b940a514438e0b25c0fd543207d7 [1/1] lsm: add a LSM specific README.md and SECURITY.md
compiler: clang version 22.0.0git (https://github.com/llvm/llvm-project f43d6834093b19baf79beda8c0337ab020ac5f17)
docutils: docutils (Docutils 0.21.2, Python 3.13.5, on linux)
reproduce: (https://download.01.org/0day-ci/archive/20260616/202606160846.O1FzSF0E-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202606160846.O1FzSF0E-lkp@intel.com/

All warnings (new ones prefixed by >>):

   Warning: Documentation/translations/zh_CN/how-to.rst references a file that doesn't exist: Documentation/xxx/xxx.rst
   Warning: Documentation/translations/zh_CN/networking/xfrm_proc.rst references a file that doesn't exist: Documentation/networking/xfrm_proc.rst
   Warning: Documentation/translations/zh_CN/scsi/scsi_mid_low_api.rst references a file that doesn't exist: Documentation/Configure.help
   Warning: MAINTAINERS references a file that doesn't exist: Documentation/ABI/testing/sysfs-platform-ayaneo
   Warning: MAINTAINERS references a file that doesn't exist: Documentation/devicetree/bindings/display/bridge/megachips-stdpxxxx-ge-b850v3-fw.txt
>> Warning: README.md references a file that doesn't exist: Documentation/API/obsolete
>> Warning: README.md references a file that doesn't exist: Documentation/API/obsolete
>> Warning: README.md references a file that doesn't exist: Documentation/API/obsolete
   Warning: arch/powerpc/sysdev/mpic.c references a file that doesn't exist: Documentation/devicetree/bindings/powerpc/fsl/mpic.txt
   Warning: drivers/net/ethernet/smsc/Kconfig references a file that doesn't exist: file:Documentation/networking/device_drivers/ethernet/smsc/smc9.rst
   Warning: rust/kernel/sync/atomic/ordering.rs references a file that doesn't exist: srctree/tools/memory-model/Documentation/explanation.txt
   Warning: tools/docs/documentation-file-ref-check references a file that doesn't exist: Documentation/virtual/lguest/lguest.c
   Warning: tools/docs/documentation-file-ref-check references a file that doesn't exist: m,\b(\S*)(Documentation/[A-Za-z0-9

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH v5 5/6] kselftest: alloc_tag: add kselftest for ioctl interface
From: Hao Ge @ 2026-06-16  6:01 UTC (permalink / raw)
  To: Abhishek Bapat
  Cc: Shuah Khan, Jonathan Corbet, linux-doc, linux-kernel, linux-mm,
	Sourav Panda, Suren Baghdasaryan, Andrew Morton, Kent Overstreet
In-Reply-To: <4d9c8dc420e2019e1fce62ecf31ec40933fce3a8.1781564384.git.abhishekbapat@google.com>

Hi Abhishek


On 2026/6/16 07:04, Abhishek Bapat wrote:
> Introduce a kselftest to verify the new IOCTL-based interface for
> /proc/allocinfo. The test covers:
>
> 1. Validation of the filename filter.
> 2. Validation of the function filter.
>
> The first test validates the functionality of the filename filter. Using
> "mm/memory.c" as the candidate filename filter, it retrieves filtered
> entries from both procfs and ioctl and matches the first VEC_MAX_ENTRIES
> entries.
>
> The second test validates the functionality of the function filter.
> It uses "dup_mm" as the candidate function as we do not expect this
> function name to change frequently and hence won't be needing to modify
> this test often.
>
> Note that both the tests match line no, function name and file name
> fields. Bytes allocated and calls are not matched as those values may
> change in the time when the data is being read from procfs and ioctl and
> hence can lead to false negatives.
>
> Signed-off-by: Abhishek Bapat <abhishekbapat@google.com>
> ---
>   MAINTAINERS                                   |   1 +
>   tools/testing/selftests/alloc_tag/Makefile    |   9 +
>   .../alloc_tag/allocinfo_ioctl_test.c          | 331 ++++++++++++++++++
>   3 files changed, 341 insertions(+)
>   create mode 100644 tools/testing/selftests/alloc_tag/Makefile
>   create mode 100644 tools/testing/selftests/alloc_tag/allocinfo_ioctl_test.c
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 019cc4c285a3..6610dd42e484 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -16715,6 +16715,7 @@ F:	include/linux/alloc_tag.h
>   F:	include/linux/pgalloc_tag.h
>   F:	include/uapi/linux/alloc_tag.h
>   F:	lib/alloc_tag.c
> +F:	tools/testing/selftests/alloc_tag/
>   
>   MEMORY CONTROLLER DRIVERS
>   M:	Krzysztof Kozlowski <krzk@kernel.org>
> diff --git a/tools/testing/selftests/alloc_tag/Makefile b/tools/testing/selftests/alloc_tag/Makefile
> new file mode 100644
> index 000000000000..f2b8fc022c3b
> --- /dev/null
> +++ b/tools/testing/selftests/alloc_tag/Makefile
> @@ -0,0 +1,9 @@
> +# SPDX-License-Identifier: GPL-2.0
> +
> +TEST_GEN_PROGS := allocinfo_ioctl_test
> +
> +CFLAGS += -Wall
> +CFLAGS += -I../../../../usr/include


I think we should replace -I../../../../usr/include with $(KHDR_INCLUDES).


> +
> +include ../lib.mk
> +


We would also introduce an extra field in the parent directory,

allowing our alloc_tag target to be built when running make

within /home/linux/tools/testing/selftests.

like this:

--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -1,5 +1,6 @@
  # SPDX-License-Identifier: GPL-2.0
  TARGETS += acct
+TARGETS += alloc_tag
  TARGETS += alsa
  TARGETS += amd-pstate
  TARGETS += arm64

Below is the relevant build log:

[root@localhost selftests]# make -j8
   CC       acct_syscall
   CC       allocinfo_ioctl_test

Below is the log from running make clean:

[root@localhost selftests]# make clean
rm -f -r /home/linux/tools/testing/selftests/acct/acct_syscall
rm -f -r /home/linux/tools/testing/selftests/alloc_tag/allocinfo_ioctl_test


(Sorry about this inconvenience. I had limited experience with the 
kselftest framework,

so I reached out to my teammates to sort out the relevant build logic.

It's my fault for not catching this issue in prior reviews

and forcing you to send a revised patch.)


> diff --git a/tools/testing/selftests/alloc_tag/allocinfo_ioctl_test.c b/tools/testing/selftests/alloc_tag/allocinfo_ioctl_test.c
> new file mode 100644
> index 000000000000..62d5a488a04d
> --- /dev/null
> +++ b/tools/testing/selftests/alloc_tag/allocinfo_ioctl_test.c
> @@ -0,0 +1,331 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +/* kselftest for allocinfo ioctl
> + * allocinfo ioctl retrives allocinfo data through ioctl
> + * Copyright (C) 2026 Google, Inc.
> + */
> +
> +#include <errno.h>


errno.h is unused and may be dropped.


> +#include <fcntl.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <stdbool.h>
> +#include <unistd.h>
> +#include <sys/ioctl.h>
> +#include <linux/types.h>
> +#include <linux/alloc_tag.h>
> +#include "../kselftest.h"
> +
> +#define MAX_LINE_LEN		512
> +#define ALLOCINFO_PROC		"/proc/allocinfo"
> +
> +enum ioctl_ret {
> +	IOCTL_SUCCESS = 0,
> +	IOCTL_FAILURE = 1,
> +	IOCTL_INVALID_DATA = 2,
> +};
> +
> +#define VEC_MAX_ENTRIES 32
> +
> +struct allocinfo_tag_data_vec {
> +	struct allocinfo_tag_data tag[VEC_MAX_ENTRIES];
> +	__u64 count;
> +};
> +
> +static inline int __allocinfo_get_content_id(int dev_fd, struct allocinfo_content_id *params)
> +{
> +	return ioctl(dev_fd, ALLOCINFO_IOC_CONTENT_ID, params);
> +}
> +
> +static inline int __allocinfo_get_at(int dev_fd, struct allocinfo_get_at *params)
> +{
> +	return ioctl(dev_fd, ALLOCINFO_IOC_GET_AT, params);
> +}
> +
> +static inline int __allocinfo_get_next(int dev_fd, struct allocinfo_tag_data *params)
> +{
> +	return ioctl(dev_fd, ALLOCINFO_IOC_GET_NEXT, params);
> +}
> +
> +static bool match_entry(const struct allocinfo_tag_data *procfs_entry,
> +			const struct allocinfo_tag_data *tag_data,
> +			bool match_bytes, bool match_calls, bool match_lineno,
> +			bool match_function, bool match_filename)
> +{
> +	if (match_bytes && tag_data->counter.bytes != procfs_entry->counter.bytes) {
> +		ksft_print_msg("size retrieved through ioctl does not match procfs\n");
> +		return false;
> +	}
> +
> +	if (match_calls && tag_data->counter.calls != procfs_entry->counter.calls) {
> +		ksft_print_msg("call count retrieved through ioctl does not match procfs\n");
> +		return false;
> +	}
> +
> +	if (match_lineno && tag_data->tag.lineno != procfs_entry->tag.lineno) {
> +		ksft_print_msg("lineno retrieved through ioctl does not match procfs\n");
> +		return false;
> +	}
> +
> +	if (match_function &&
> +	    strncmp(tag_data->tag.function, procfs_entry->tag.function, ALLOCINFO_STR_SIZE)) {
> +		ksft_print_msg("function retrieved through ioctl does not match procfs\n");
> +		return false;
> +	}
> +
> +	if (match_filename &&
> +	    strncmp(tag_data->tag.filename, procfs_entry->tag.filename, ALLOCINFO_STR_SIZE)) {
> +		ksft_print_msg("filename retrieved through ioctl does not match procfs\n");
> +		return false;
> +	}
> +	return true;
> +}
> +
> +static bool match_entries(const struct allocinfo_tag_data_vec *procfs_entries,
> +			  const struct allocinfo_tag_data_vec *tags,
> +			  bool match_bytes, bool match_calls, bool match_lineno,
> +			  bool match_function, bool match_filename)
> +{
> +	__u64 i;
> +
> +	if (procfs_entries->count != tags->count) {
> +		ksft_print_msg("Entry count mismatch. ioctl entries: %llu, proc entries: %llu\n",
> +			       tags->count, procfs_entries->count);
> +		return false;
> +	}
> +	for (i = 0; i < procfs_entries->count; i++) {
> +		if (!match_entry(&procfs_entries->tag[i], &tags->tag[i],
> +				 match_bytes, match_calls, match_lineno,
> +				 match_function, match_filename)) {
> +			ksft_print_msg("%lluth entry does not match.\n", i);
> +			return false;
> +		}
> +	}
> +	return true;
> +}
> +
> +static const char *allocinfo_str(const char *str)
> +{
> +	size_t len = strlen(str);
> +
> +	if (len >= ALLOCINFO_STR_SIZE)
> +		str += (len - ALLOCINFO_STR_SIZE) + 1;
> +	return str;
> +}
> +
> +static void allocinfo_copy_str(char *dest, const char *src)
> +{
> +	strncpy(dest, allocinfo_str(src), ALLOCINFO_STR_SIZE - 1);
> +	dest[ALLOCINFO_STR_SIZE - 1] = '\0';
> +}
> +
> +static int get_filtered_procfs_entries(struct allocinfo_tag_data_vec *procfs_entries,
> +				       const struct allocinfo_filter *filter)
> +{
> +	FILE *fp = fopen(ALLOCINFO_PROC, "r");
> +	char line[MAX_LINE_LEN];
> +	int matches;
> +	struct allocinfo_tag_data procfs_entry;
> +
> +	if (!fp) {
> +		ksft_print_msg("Failed to open " ALLOCINFO_PROC " for reading\n");
> +		return 1;
> +	}
> +	memset(procfs_entries, 0, sizeof(*procfs_entries));
> +	while (fgets(line, sizeof(line), fp) && procfs_entries->count < VEC_MAX_ENTRIES) {
> +		char filename[MAX_LINE_LEN];
> +		char function[MAX_LINE_LEN];
> +
> +		memset(&procfs_entry, 0, sizeof(procfs_entry));
> +		matches = sscanf(line, "%llu %llu %[^:]:%llu func:%s",
> +				 &procfs_entry.counter.bytes,
> +				 &procfs_entry.counter.calls,
> +				 filename,
> +				 &procfs_entry.tag.lineno,
> +				 function);
> +
> +		if (matches != 5)
> +			continue;
> +
> +		allocinfo_copy_str(procfs_entry.tag.filename, filename);
> +		allocinfo_copy_str(procfs_entry.tag.function, function);
> +
> +		if (filter->mask & ALLOCINFO_FILTER_MASK_FILENAME) {
> +			if (strncmp(procfs_entry.tag.filename,
> +				    filter->fields.filename, ALLOCINFO_STR_SIZE))
> +				continue;
> +		}
> +		if (filter->mask & ALLOCINFO_FILTER_MASK_FUNCTION) {
> +			if (strncmp(procfs_entry.tag.function,
> +				    filter->fields.function, ALLOCINFO_STR_SIZE))
> +				continue;
> +		}
> +		if (filter->mask & ALLOCINFO_FILTER_MASK_LINENO) {
> +			if (procfs_entry.tag.lineno != filter->fields.lineno)
> +				continue;
> +		}
> +		if (filter->mask & ALLOCINFO_FILTER_MASK_MIN_SIZE) {
> +			if (procfs_entry.counter.bytes < filter->min_size)
> +				continue;
> +		}
> +		if (filter->mask & ALLOCINFO_FILTER_MASK_MAX_SIZE) {
> +			if (procfs_entry.counter.bytes > filter->max_size)
> +				continue;
> +		}
> +
> +		memcpy(&procfs_entries->tag[procfs_entries->count++], &procfs_entry,
> +		       sizeof(procfs_entry));
> +	}
> +	fclose(fp);
> +	return 0;
> +}
> +
> +static enum ioctl_ret get_filtered_ioctl_entries(struct allocinfo_tag_data_vec *tags,
> +						 const struct allocinfo_filter *filter,
> +						 __u64 start_pos)
> +{
> +	int fd = open(ALLOCINFO_PROC, O_RDONLY);
> +
> +	if (fd < 0) {
> +		ksft_print_msg("Failed to open " ALLOCINFO_PROC " for IOCTL\n");
> +		return IOCTL_FAILURE;
> +	}
> +	struct allocinfo_content_id start_cont_id, end_cont_id;
> +	struct allocinfo_get_at get_at_params;
> +	const int max_retries = 10;
> +	int retry_count = 0;
> +	int status;
> +
> +	/*
> +	 * __allocinfo_get_content_id may return different values if a kernel module was loaded
> +	 * between the two calls. If that happens, the data gathered cannot be considered consistent
> +	 * and hence needs to be fetched again to avoid flakiness.
> +	 */
> +	do {
> +		if (__allocinfo_get_content_id(fd, &start_cont_id)) {
> +			ksft_print_msg("allocinfo_get_content_id failed\n");
> +			return IOCTL_FAILURE;

Sashiko pointed out the fd leak in get_filtered_ioctl_entries().

Since we already need to update the patch to fix the missing

TARGETS entry in the top-level Makefile, we might as well fix

this at the same time.


Thanks

Best Regards

Hao

> +		}
> +
> +		memset(tags, 0, sizeof(*tags));
> +		memset(&get_at_params, 0, sizeof(get_at_params));
> +		memcpy(&get_at_params.filter, filter, sizeof(*filter));
> +		get_at_params.pos = start_pos;
> +		if (__allocinfo_get_at(fd, &get_at_params)) {
> +			ksft_print_msg("allocinfo_get_at failed\n");
> +			return IOCTL_FAILURE;
> +		}
> +		memcpy(&tags->tag[tags->count++], &get_at_params.data, sizeof(get_at_params.data));
> +
> +		while (tags->count < VEC_MAX_ENTRIES &&
> +		       __allocinfo_get_next(fd, &tags->tag[tags->count]) == 0)
> +			tags->count++;
> +
> +		if (__allocinfo_get_content_id(fd, &end_cont_id)) {
> +			ksft_print_msg("allocinfo_get_content_id failed\n");
> +			return IOCTL_FAILURE;
> +		}
> +
> +		if (start_cont_id.id == end_cont_id.id) {
> +			status = IOCTL_SUCCESS;
> +		} else {
> +			ksft_print_msg("allocinfo_get_content_id mismatch, retrying...\n");
> +			status = IOCTL_INVALID_DATA;
> +		}
> +	} while (status == IOCTL_INVALID_DATA && retry_count++ < max_retries);
> +
> +	close(fd);
> +	return status;
> +}
> +
> +static int run_filter_test(const struct allocinfo_filter *filter)
> +{
> +	struct allocinfo_tag_data_vec *tags = malloc(sizeof(*tags));
> +	struct allocinfo_tag_data_vec *procfs_entries = malloc(sizeof(*procfs_entries));
> +	int ioctl_status;
> +	int ret = KSFT_PASS;
> +
> +	if (!tags || !procfs_entries) {
> +		ksft_print_msg("Memory allocation failed.\n");
> +		ret = KSFT_FAIL;
> +		goto exit;
> +	}
> +
> +	if (get_filtered_procfs_entries(procfs_entries, filter)) {
> +		ksft_print_msg("Error retrieving entries from " ALLOCINFO_PROC "\n");
> +		ret = KSFT_SKIP;
> +		goto exit;
> +	}
> +
> +	if (procfs_entries->count == 0) {
> +		ksft_print_msg("No entries found in " ALLOCINFO_PROC ", skipping test\n");
> +		ret = KSFT_SKIP;
> +		goto exit;
> +	}
> +
> +	ioctl_status = get_filtered_ioctl_entries(tags, filter, 0);
> +	if (ioctl_status == IOCTL_INVALID_DATA) {
> +		ksft_print_msg("Trouble retrieving valid IOCTL entries, skipping.\n");
> +		ret = KSFT_SKIP;
> +		goto exit;
> +	}
> +	if (ioctl_status == IOCTL_FAILURE) {
> +		ksft_print_msg("Error retrieving IOCTL entries.\n");
> +		ret = KSFT_FAIL;
> +		goto exit;
> +	}
> +
> +	if (!match_entries(procfs_entries, tags, false, false, true, true, true))
> +		ret = KSFT_FAIL;
> +
> +exit:
> +	free(tags);
> +	free(procfs_entries);
> +	return ret;
> +}
> +
> +static int test_filename_filter(void)
> +{
> +	struct allocinfo_filter filter;
> +	const char *target_filename = "mm/memory.c";
> +
> +	memset(&filter, 0, sizeof(filter));
> +	filter.mask |= ALLOCINFO_FILTER_MASK_FILENAME;
> +	strncpy(filter.fields.filename, target_filename, ALLOCINFO_STR_SIZE);
> +
> +	return run_filter_test(&filter);
> +}
> +
> +static int test_function_filter(void)
> +{
> +	struct allocinfo_filter filter;
> +	const char *target_function = "dup_mm";
> +
> +	memset(&filter, 0, sizeof(filter));
> +	filter.mask |= ALLOCINFO_FILTER_MASK_FUNCTION;
> +	strncpy(filter.fields.function, target_function, ALLOCINFO_STR_SIZE);
> +
> +	return run_filter_test(&filter);
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +	int ret;
> +
> +	ksft_set_plan(2);
> +
> +	ret = test_filename_filter();
> +	if (ret == KSFT_SKIP)
> +		ksft_test_result_skip("Skipping test_filename_filter\n");
> +	else
> +		ksft_test_result(ret == KSFT_PASS, "test_filename_filter\n");
> +
> +	ret = test_function_filter();
> +	if (ret == KSFT_SKIP)
> +		ksft_test_result_skip("Skipping test_function_filter\n");
> +	else
> +		ksft_test_result(ret == KSFT_PASS, "test_function_filter\n");
> +
> +	ksft_finished();
> +}

^ permalink raw reply

* Re: [RFC PATCH 0/5] mm/slub: preserve previous object lifetime
From: Harry Yoo @ 2026-06-16  4:15 UTC (permalink / raw)
  To: Pengpeng Hou, Vlastimil Babka (SUSE), Andrew Morton, linux-mm
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	David Hildenbrand, Lorenzo Stoakes, liam, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jonathan Corbet, Shuah Khan,
	linux-doc, linux-kernel
In-Reply-To: <20260615061202.23715-1-pengpeng@iscas.ac.cn>


[-- Attachment #1.1: Type: text/plain, Size: 1667 bytes --]



On 6/15/26 3:12 PM, Pengpeng Hou wrote:
> Hi Vlastimil, Harry,
> 
> Thanks for the feedback.
> 
> I agree that the terminology in the RFC cover letter was not precise
> enough. The case I was trying to describe is a duplicate/stale free by a
> previous owner after the object has already been freed and then reused by
> another user. In that case, the current SLAB_STORE_USER records can show
> the current allocation and the later bad free/check, but the previous
> completed alloc/free lifetime that explains where the stale pointer came
> from has already been overwritten.

I was confused, but I see now, thanks for clarifying :)

> This is not intended to compete with KASAN or infer semantic ownership.
> KASAN is better when it can be used, but the motivation here is the lower
> barrier of enabling slub_debug for a specific cache on an existing kernel,
> especially in field debugging environments.

Makes sense.

> Based on your comments, I will rework the non-RFC version to fold this
> into the existing U tracking instead of adding a separate H option, unless
> there is a preference for keeping the extra history behind an explicit
> flag.

Ack.

> I will keep the scope to one previous completed lifetime and avoid a
> larger history table/ring for now.

Ack.

> I will also add a small reproducer or KUnit coverage showing the lost
> previous-lifetime case,

Ack.

> plus object-size/order comparison data for a few
> representative caches.

I think we don't care much about the size on debug caches.

Looking forward to seeing the next version, Pengpeng.

Thanks!

-- 
Cheers,
Harry / Hyeonggon

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [GIT PULL] Documentation for 7.2
From: pr-tracker-bot @ 2026-06-16  3:10 UTC (permalink / raw)
  To: Jonathan Corbet; +Cc: Linus Torvalds, Shuah Khan, linux-doc, linux-kernel
In-Reply-To: <874ij3xxtk.fsf@trenco.lwn.net>

The pull request you sent on Mon, 15 Jun 2026 16:07:03 -0600:

> git://git.kernel.org/pub/scm/linux/kernel/git/docs/linux.git tags/docs-7.2

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/a87bbc4578fd686d535fbd62e8bc73fc6c7c5415

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply

* Re: [swap tier discussion] Re: [PATCH v3 2/4] mm/zswap: Implement proactive writeback
From: YoungJun Park @ 2026-06-16  3:08 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Shakeel Butt, Hao Jia, Johannes Weiner, mhocko, tj, mkoutny,
	roman.gushchin, Nhat Pham, akpm, chengming.zhou, muchun.song,
	cgroups, linux-mm, linux-kernel, linux-doc, Hao Jia, chrisl,
	kasong, baoquan.he, joshua.hahnjy, youngjun.park
In-Reply-To: <ajCgzNIPLhjTRSXR@yjaykim-PowerEdge-T330>

...
> - "zswap tier only": Only zswap is allowed. Fallback to other swap is
>   blocked.
> - "zswap writeback disabled": zswap is allowed, but if zswap_store()
>   fails, pages can still fall back to other swap devices.

Upon double-checking the code, my previous clarification was wrong.
You are right. Sorry for the confusion. "zswap tier only" is indeed
equivalent to "zswap writeback disabled".
(I'm not sure why I read the code that way...)

As I initially thought, it might be possible to replace the zswap writeback
control with the tiering mechanism.

If we need to keep the existing interface, we can integrate or share the
underlying logic (though the specific details need more thought anyway).

It can be summarized as follows:

- "zswap tier only" + "zswap writeback disable" -> meaningless (noop)
- "zswap tier only" + "zswap writeback enable" -> meaningless (no writabck backend exist)
- "zswap tier with other tiers" + "zswap writeback disable" -> uses only zswap
  (can be replaced by "zswap tier only". This code could be intergrated, modified or something.)
- "zswap tier with other tiers" + "zswap writeback enable" -> works as is

As mentioned in the previous email, the zswap tier on/off control comes as a
bonus (though, as you pointed out, we may need to discuss if it's actually
needed).

BR,
Youngjun

^ permalink raw reply

* Re: [PATCH v5 1/6] alloc_tag: add ioctl to /proc/allocinfo
From: Hao Ge @ 2026-06-16  1:40 UTC (permalink / raw)
  To: Abhishek Bapat, Suren Baghdasaryan, Andrew Morton,
	Kent Overstreet
  Cc: Shuah Khan, Jonathan Corbet, linux-doc, linux-kernel, linux-mm,
	Sourav Panda
In-Reply-To: <fa1fe7d869e2ff45907b271ac4066aa0339d037c.1781564384.git.abhishekbapat@google.com>

Hi Abhishek


On 2026/6/16 07:04, Abhishek Bapat wrote:
> From: Suren Baghdasaryan <surenb@google.com>
>
> Add the following ioctl commands for /proc/allocinfo file:
>
> ALLOCINFO_IOC_CONTENT_ID - gets content identifier which can be used
> to check whether the file content has changed specifically due to module
> load/unload. Every time a module is loaded / unloaded, the returned
> value will be different. By comparing the identifier value at the
> beginning and at the end of the content retrieval operation, users can
> validate retrieved information for consistency.
>
> ALLOCINFO_IOC_GET_AT - gets the record at the specified position. This
> is the position of a record in /proc/allocinfo.
>
> ALLOCINFO_IOC_GET_NEXT - gets the record next to the last retrieved
> one. If no records were previously retrieved, returns the first
> record.
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Abhishek Bapat <abhishekbapat@google.com>


Thanks for updating the patch, LGTM.

Acked-by: Hao Ge <hao.ge@linux.dev>


> ---
>   Documentation/mm/allocation-profiling.rst     |   5 +
>   .../userspace-api/ioctl/ioctl-number.rst      |   2 +
>   MAINTAINERS                                   |   1 +
>   include/linux/codetag.h                       |   2 +
>   include/uapi/linux/alloc_tag.h                |  60 +++++
>   lib/alloc_tag.c                               | 235 +++++++++++++++++-
>   lib/codetag.c                                 |  18 ++
>   7 files changed, 321 insertions(+), 2 deletions(-)
>   create mode 100644 include/uapi/linux/alloc_tag.h
>
> diff --git a/Documentation/mm/allocation-profiling.rst b/Documentation/mm/allocation-profiling.rst
> index 5389d241176a..c3a28467955f 100644
> --- a/Documentation/mm/allocation-profiling.rst
> +++ b/Documentation/mm/allocation-profiling.rst
> @@ -46,6 +46,11 @@ sysctl:
>   Runtime info:
>     /proc/allocinfo
>   
> +  Profiling data can be retrieved either by reading `/proc/allocinfo` directly as
> +  text or programmatically via `ioctl()` calls defined in `<uapi/linux/alloc_tag.h>`.
> +  The ioctl interface supports structured binary data extraction as well as filtering
> +  by module name, function, file, line number, accuracy, or allocation size limits.
> +
>   Example output::
>   
>     root@moria-kvm:~# sort -g /proc/allocinfo|tail|numfmt --to=iec
> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> index 331223761fff..84f6808a8578 100644
> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> @@ -349,6 +349,8 @@ Code  Seq#    Include File                                             Comments
>                                                                          <mailto:luzmaximilian@gmail.com>
>   0xA5  20-2F  linux/surface_aggregator/dtx.h                            Microsoft Surface DTX driver
>                                                                          <mailto:luzmaximilian@gmail.com>
> +0xA6  00-0F  uapi/linux/alloc_tag.h                                    Memory allocation profiling
> +                                                                       <mailto:surenb@google.com>
>   0xAA  00-3F  linux/uapi/linux/userfaultfd.h
>   0xAB  00-1F  linux/nbd.h
>   0xAC  00-1F  linux/raw.h
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 65bd4328fe05..019cc4c285a3 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -16713,6 +16713,7 @@ S:	Maintained
>   F:	Documentation/mm/allocation-profiling.rst
>   F:	include/linux/alloc_tag.h
>   F:	include/linux/pgalloc_tag.h
> +F:	include/uapi/linux/alloc_tag.h
>   F:	lib/alloc_tag.c
>   
>   MEMORY CONTROLLER DRIVERS
> diff --git a/include/linux/codetag.h b/include/linux/codetag.h
> index ddae7484ca45..a25a085c2df1 100644
> --- a/include/linux/codetag.h
> +++ b/include/linux/codetag.h
> @@ -77,6 +77,8 @@ struct codetag_iterator {
>   void codetag_lock_module_list(struct codetag_type *cttype);
>   bool codetag_trylock_module_list(struct codetag_type *cttype);
>   void codetag_unlock_module_list(struct codetag_type *cttype);
> +unsigned long codetag_get_content_id(struct codetag_type *cttype);
> +unsigned int codetag_get_count(struct codetag_type *cttype);
>   struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype);
>   struct codetag *codetag_next_ct(struct codetag_iterator *iter);
>   
> diff --git a/include/uapi/linux/alloc_tag.h b/include/uapi/linux/alloc_tag.h
> new file mode 100644
> index 000000000000..0928e1a48d49
> --- /dev/null
> +++ b/include/uapi/linux/alloc_tag.h
> @@ -0,0 +1,60 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/*
> + * alloc_tag IOCTL API definition
> + *
> + * Copyright (C) 2026 Google, LLC.  All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef _UAPI_ALLOC_TAG_H
> +#define _UAPI_ALLOC_TAG_H
> +
> +#include <linux/types.h>
> +
> +#define ALLOCINFO_STR_SIZE	64
> +
> +struct allocinfo_content_id {
> +	__u64 id;
> +};
> +
> +struct allocinfo_tag {
> +	/* Longer names are trimmed */
> +	char modname[ALLOCINFO_STR_SIZE];
> +	char function[ALLOCINFO_STR_SIZE];
> +	char filename[ALLOCINFO_STR_SIZE];
> +	__u64 lineno;
> +};
> +
> +/* The alignment ensures 32-bit compatible interfaces are not broken */
> +struct allocinfo_counter {
> +	__u64 bytes;
> +	__u64 calls;
> +	__u8 accurate;
> +} __attribute__((aligned(8)));
> +
> +struct allocinfo_tag_data {
> +	struct allocinfo_tag tag;
> +	struct allocinfo_counter counter;
> +};
> +
> +struct allocinfo_get_at {
> +	__u64 pos;	/* input */
> +	struct allocinfo_tag_data data;
> +};
> +
> +#define _ALLOCINFO_IOC_CONTENT_ID	0
> +#define _ALLOCINFO_IOC_GET_AT		1
> +#define _ALLOCINFO_IOC_GET_NEXT		2
> +
> +#define ALLOCINFO_IOC_BASE		0xA6
> +#define ALLOCINFO_IOC_CONTENT_ID	_IOR(ALLOCINFO_IOC_BASE, _ALLOCINFO_IOC_CONTENT_ID,	\
> +					     struct allocinfo_content_id)
> +#define ALLOCINFO_IOC_GET_AT		_IOWR(ALLOCINFO_IOC_BASE, _ALLOCINFO_IOC_GET_AT,	\
> +					      struct allocinfo_get_at)
> +#define ALLOCINFO_IOC_GET_NEXT		_IOR(ALLOCINFO_IOC_BASE, _ALLOCINFO_IOC_GET_NEXT,	\
> +					     struct allocinfo_tag_data)
> +
> +#endif /* _UAPI_ALLOC_TAG_H */
> diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
> index d9be1cf5187d..82e3b5f32dff 100644
> --- a/lib/alloc_tag.c
> +++ b/lib/alloc_tag.c
> @@ -5,6 +5,7 @@
>   #include <linux/gfp.h>
>   #include <linux/kallsyms.h>
>   #include <linux/module.h>
> +#include <linux/mutex.h>
>   #include <linux/page_ext.h>
>   #include <linux/pgalloc_tag.h>
>   #include <linux/proc_fs.h>
> @@ -14,6 +15,7 @@
>   #include <linux/string_choices.h>
>   #include <linux/vmalloc.h>
>   #include <linux/kmemleak.h>
> +#include <uapi/linux/alloc_tag.h>
>   
>   #define ALLOCINFO_FILE_NAME		"allocinfo"
>   #define MODULE_ALLOC_TAG_VMAP_SIZE	(100000UL * sizeof(struct alloc_tag))
> @@ -47,6 +49,10 @@ struct allocinfo_private {
>   	struct codetag_iterator iter;
>   	struct codetag_iterator reported_iter;
>   	bool print_header;
> +	/* ioctl uses a separate iterator not to interfere with reads */
> +	struct codetag_iterator ioctl_iter;
> +	bool positioned; /* seq_open_private() sets to 0 */
> +	struct mutex ioctl_lock;
>   };
>   
>   static void *allocinfo_start(struct seq_file *m, loff_t *pos)
> @@ -130,6 +136,232 @@ static const struct seq_operations allocinfo_seq_op = {
>   	.show	= allocinfo_show,
>   };
>   
> +/*
> + * Initializes seq_file operations and allocates private state when opening
> + * the /proc/allocinfo procfs entry.
> + */
> +static int allocinfo_open(struct inode *inode, struct file *file)
> +{
> +	int ret;
> +
> +	ret = seq_open_private(file, &allocinfo_seq_op,
> +			       sizeof(struct allocinfo_private));
> +	if (!ret) {
> +		struct seq_file *m = file->private_data;
> +		struct allocinfo_private *priv = m->private;
> +
> +		mutex_init(&priv->ioctl_lock);
> +	}
> +	return ret;
> +}
> +
> +/*
> + * Cleans up the seq_file state and frees up the private state allocated in
> + * allocinfo_open() when closing the /proc/allocinfo file descriptor.
> + */
> +static int allocinfo_release(struct inode *inode, struct file *file)
> +{
> +	struct seq_file *m = file->private_data;
> +	struct allocinfo_private *priv = m->private;
> +
> +	mutex_destroy(&priv->ioctl_lock);
> +	return seq_release_private(inode, file);
> +}
> +
> +/*
> + * Returns a pointer to the suffix of a string so that its length fits within
> + * ALLOCINFO_STR_SIZE, preserving the trailing characters.
> + */
> +static const char *allocinfo_str(const char *str)
> +{
> +	size_t len = strlen(str);
> +
> +	/* Keep an extra space for the trailing NULL. */
> +	if (len >= ALLOCINFO_STR_SIZE)
> +		str += (len - ALLOCINFO_STR_SIZE) + 1;
> +	return str;
> +}
> +
> +/* Copy a string and trim from the beginning if it's too long */
> +static void allocinfo_copy_str(char *dest, const char *src)
> +{
> +	strscpy_pad(dest, allocinfo_str(src), ALLOCINFO_STR_SIZE);
> +}
> +
> +/*
> + * Populates the UAPI allocinfo_tag_data structure with active runtime
> + * profiling counters extracted from the given kernel codetag.
> + */
> +static void allocinfo_to_params(struct codetag *ct,
> +				struct allocinfo_tag_data *data)
> +{
> +	struct alloc_tag *tag = ct_to_alloc_tag(ct);
> +	struct alloc_tag_counters counter = alloc_tag_read(tag);
> +
> +	if (ct->modname)
> +		allocinfo_copy_str(data->tag.modname, ct->modname);
> +	else
> +		data->tag.modname[0] = '\0';
> +	allocinfo_copy_str(data->tag.function, ct->function);
> +	allocinfo_copy_str(data->tag.filename, ct->filename);
> +	data->tag.lineno = ct->lineno;
> +	data->counter.bytes = counter.bytes;
> +	data->counter.calls = counter.calls;
> +	data->counter.accurate = !alloc_tag_is_inaccurate(tag);
> +}
> +
> +/*
> + * Retrieves the unique content ID representing the current allocation tag module
> + * layout, allowing userspace to detect if modules were loaded / unloaded.
> + */
> +static int allocinfo_ioctl_get_content_id(struct seq_file *m, void __user *arg)
> +{
> +	struct allocinfo_content_id params;
> +
> +	codetag_lock_module_list(alloc_tag_cttype);
> +	params.id = codetag_get_content_id(alloc_tag_cttype);
> +	codetag_unlock_module_list(alloc_tag_cttype);
> +	if (copy_to_user(arg, &params, sizeof(params)))
> +		return -EFAULT;
> +
> +	return 0;
> +}
> +
> +/*
> + * Seeks the ioctl iterator to the specified 0-indexed tag position, reads its
> + * profiling data and returns it to userspace.
> + */
> +static int allocinfo_ioctl_get_at(struct seq_file *m, void __user *arg)
> +{
> +	struct allocinfo_private *priv;
> +	struct codetag *ct;
> +	__u64 pos;
> +	struct allocinfo_get_at params = {0};
> +
> +	if (copy_from_user(&params, arg, sizeof(params)))
> +		return -EFAULT;
> +
> +	priv = m->private;
> +	pos = params.pos;
> +
> +	mutex_lock(&priv->ioctl_lock);
> +	codetag_lock_module_list(alloc_tag_cttype);
> +
> +	if (pos >= codetag_get_count(alloc_tag_cttype)) {
> +		codetag_unlock_module_list(alloc_tag_cttype);
> +		mutex_unlock(&priv->ioctl_lock);
> +		return -ENOENT;
> +	}
> +
> +	/* Find the codetag */
> +	priv->ioctl_iter = codetag_get_ct_iter(alloc_tag_cttype);
> +	ct = codetag_next_ct(&priv->ioctl_iter);
> +	while (ct && pos--)
> +		ct = codetag_next_ct(&priv->ioctl_iter);
> +	if (ct) {
> +		allocinfo_to_params(ct, &params.data);
> +		priv->positioned = true;
> +	}
> +
> +	codetag_unlock_module_list(alloc_tag_cttype);
> +	mutex_unlock(&priv->ioctl_lock);
> +
> +	if (!ct)
> +		return -ENOENT;
> +
> +	if (copy_to_user(arg, &params, sizeof(params)))
> +		return -EFAULT;
> +
> +	return 0;
> +}
> +
> +/*
> + * Advances the ioctl iterator to the next allocation tag in the sequence and
> + * returns its profiling data to userspace.
> + */
> +static int allocinfo_ioctl_get_next(struct seq_file *m, void __user *arg)
> +{
> +	struct allocinfo_private *priv;
> +	struct codetag *ct;
> +	struct allocinfo_tag_data params;
> +	int ret = 0;
> +
> +	memset(&params, 0, sizeof(params));
> +	priv = m->private;
> +
> +	mutex_lock(&priv->ioctl_lock);
> +	codetag_lock_module_list(alloc_tag_cttype);
> +
> +	if (!priv->positioned) {
> +		priv->ioctl_iter = codetag_get_ct_iter(alloc_tag_cttype);
> +		priv->positioned = true;
> +	}
> +
> +	ct = codetag_next_ct(&priv->ioctl_iter);
> +	if (ct)
> +		allocinfo_to_params(ct, &params);
> +
> +	if (!ct) {
> +		priv->positioned = false;
> +		ret = -ENOENT;
> +	}
> +	codetag_unlock_module_list(alloc_tag_cttype);
> +	mutex_unlock(&priv->ioctl_lock);
> +
> +	if (ret == 0) {
> +		if (copy_to_user(arg, &params, sizeof(params)))
> +			return -EFAULT;
> +	}
> +	return ret;
> +}
> +
> +/*
> + * Entry point ioctl function for /proc/allocinfo routing requests to fetch the
> + * layout content ID, seek to a specific tag, or read sequential tags.
> + */
> +static long allocinfo_ioctl(struct file *file, unsigned int cmd,
> +			    unsigned long __arg)
> +{
> +	void __user *arg = (void __user *)__arg;
> +	int ret;
> +
> +	switch (cmd) {
> +	case ALLOCINFO_IOC_CONTENT_ID:
> +		ret = allocinfo_ioctl_get_content_id(file->private_data, arg);
> +		break;
> +	case ALLOCINFO_IOC_GET_AT:
> +		ret = allocinfo_ioctl_get_at(file->private_data, arg);
> +		break;
> +	case ALLOCINFO_IOC_GET_NEXT:
> +		ret = allocinfo_ioctl_get_next(file->private_data, arg);
> +		break;
> +	default:
> +		ret = -ENOIOCTLCMD;
> +		break;
> +	}
> +
> +	return ret;
> +}
> +
> +#ifdef CONFIG_COMPAT
> +static long allocinfo_compat_ioctl(struct file *file, unsigned int cmd,
> +				   unsigned long arg)
> +{
> +	return allocinfo_ioctl(file, cmd, (unsigned long)compat_ptr(arg));
> +}
> +#endif
> +
> +static const struct proc_ops allocinfo_proc_ops = {
> +	.proc_open		= allocinfo_open,
> +	.proc_read_iter		= seq_read_iter,
> +	.proc_lseek		= seq_lseek,
> +	.proc_release		= allocinfo_release,
> +	.proc_ioctl		= allocinfo_ioctl,
> +#ifdef CONFIG_COMPAT
> +	.proc_compat_ioctl	= allocinfo_compat_ioctl,
> +#endif
> +};
> +
>   size_t alloc_tag_top_users(struct codetag_bytes *tags, size_t count, bool can_sleep)
>   {
>   	struct codetag_iterator iter;
> @@ -993,8 +1225,7 @@ static int __init alloc_tag_init(void)
>   		return 0;
>   	}
>   
> -	if (!proc_create_seq_private(ALLOCINFO_FILE_NAME, 0400, NULL, &allocinfo_seq_op,
> -				     sizeof(struct allocinfo_private), NULL)) {
> +	if (!proc_create(ALLOCINFO_FILE_NAME, 0400, NULL, &allocinfo_proc_ops)) {
>   		pr_err("Failed to create %s file\n", ALLOCINFO_FILE_NAME);
>   		shutdown_mem_profiling(false);
>   		return -ENOMEM;
> diff --git a/lib/codetag.c b/lib/codetag.c
> index 4001a7ea6675..a9cda4c962a3 100644
> --- a/lib/codetag.c
> +++ b/lib/codetag.c
> @@ -19,6 +19,8 @@ struct codetag_type {
>   	struct codetag_type_desc desc;
>   	/* generates unique sequence number for module load */
>   	unsigned long next_mod_seq;
> +	/* bumped on every module load and unload */
> +	unsigned long content_id;
>   };
>   
>   struct codetag_range {
> @@ -50,6 +52,20 @@ void codetag_unlock_module_list(struct codetag_type *cttype)
>   	up_read(&cttype->mod_lock);
>   }
>   
> +unsigned long codetag_get_content_id(struct codetag_type *cttype)
> +{
> +	lockdep_assert_held(&cttype->mod_lock);
> +
> +	return cttype->content_id;
> +}
> +
> +unsigned int codetag_get_count(struct codetag_type *cttype)
> +{
> +	lockdep_assert_held(&cttype->mod_lock);
> +
> +	return cttype->count;
> +}
> +
>   struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype)
>   {
>   	struct codetag_iterator iter = {
> @@ -204,6 +220,7 @@ static int codetag_module_init(struct codetag_type *cttype, struct module *mod)
>   
>   	down_write(&cttype->mod_lock);
>   	cmod->mod_seq = ++cttype->next_mod_seq;
> +	++cttype->content_id;
>   	mod_id = idr_alloc(&cttype->mod_idr, cmod, 0, 0, GFP_KERNEL);
>   	if (mod_id >= 0) {
>   		if (cttype->desc.module_load) {
> @@ -368,6 +385,7 @@ void codetag_unload_module(struct module *mod)
>   			cttype->count -= range_size(cttype, &cmod->range);
>   			idr_remove(&cttype->mod_idr, mod_id);
>   			kfree(cmod);
> +			++cttype->content_id;
>   		}
>   		up_write(&cttype->mod_lock);
>   		if (found && cttype->desc.free_section_mem)

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox