Linux Trace Kernel
 help / color / mirror / Atom feed
* Re: [PATCHv3 bpf-next 06/24] bpf: Add multi tracing attach types
From: kernel test robot @ 2026-03-19 16:31 UTC (permalink / raw)
  To: Jiri Olsa, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: oe-kbuild-all, bpf, linux-trace-kernel, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, Menglong Dong,
	Steven Rostedt
In-Reply-To: <20260316075138.465430-7-jolsa@kernel.org>

Hi Jiri,

kernel test robot noticed the following build errors:

[auto build test ERROR on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Jiri-Olsa/ftrace-Add-ftrace_hash_count-function/20260316-160117
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20260316075138.465430-7-jolsa%40kernel.org
patch subject: [PATCHv3 bpf-next 06/24] bpf: Add multi tracing attach types
config: sh-allmodconfig (https://download.01.org/0day-ci/archive/20260320/202603200034.0g8Ml43R-lkp@intel.com/config)
compiler: sh4-linux-gcc (GCC) 15.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260320/202603200034.0g8Ml43R-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202603200034.0g8Ml43R-lkp@intel.com/

All errors (new ones prefixed by >>):

   kernel/bpf/syscall.c: In function 'bpf_prog_load':
>> kernel/bpf/syscall.c:2967:22: error: implicit declaration of function 'is_tracing_multi' [-Wimplicit-function-declaration]
    2967 |         multi_func = is_tracing_multi(attr->expected_attach_type);
         |                      ^~~~~~~~~~~~~~~~
--
   kernel/bpf/verifier.c: In function 'is_tracing_multi_id':
>> kernel/bpf/verifier.c:25059:16: error: implicit declaration of function 'is_tracing_multi'; did you mean 'is_tracing_multi_id'? [-Wimplicit-function-declaration]
   25059 |         return is_tracing_multi(prog->expected_attach_type) && bpf_multi_func_btf_id[0] == btf_id;
         |                ^~~~~~~~~~~~~~~~
         |                is_tracing_multi_id


vim +/is_tracing_multi +2967 kernel/bpf/syscall.c

  2890	
  2891	static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
  2892	{
  2893		enum bpf_prog_type type = attr->prog_type;
  2894		struct bpf_prog *prog, *dst_prog = NULL;
  2895		struct btf *attach_btf = NULL;
  2896		struct bpf_token *token = NULL;
  2897		bool bpf_cap;
  2898		int err;
  2899		char license[128];
  2900		bool multi_func;
  2901	
  2902		if (CHECK_ATTR(BPF_PROG_LOAD))
  2903			return -EINVAL;
  2904	
  2905		if (attr->prog_flags & ~(BPF_F_STRICT_ALIGNMENT |
  2906					 BPF_F_ANY_ALIGNMENT |
  2907					 BPF_F_TEST_STATE_FREQ |
  2908					 BPF_F_SLEEPABLE |
  2909					 BPF_F_TEST_RND_HI32 |
  2910					 BPF_F_XDP_HAS_FRAGS |
  2911					 BPF_F_XDP_DEV_BOUND_ONLY |
  2912					 BPF_F_TEST_REG_INVARIANTS |
  2913					 BPF_F_TOKEN_FD))
  2914			return -EINVAL;
  2915	
  2916		bpf_prog_load_fixup_attach_type(attr);
  2917	
  2918		if (attr->prog_flags & BPF_F_TOKEN_FD) {
  2919			token = bpf_token_get_from_fd(attr->prog_token_fd);
  2920			if (IS_ERR(token))
  2921				return PTR_ERR(token);
  2922			/* if current token doesn't grant prog loading permissions,
  2923			 * then we can't use this token, so ignore it and rely on
  2924			 * system-wide capabilities checks
  2925			 */
  2926			if (!bpf_token_allow_cmd(token, BPF_PROG_LOAD) ||
  2927			    !bpf_token_allow_prog_type(token, attr->prog_type,
  2928						       attr->expected_attach_type)) {
  2929				bpf_token_put(token);
  2930				token = NULL;
  2931			}
  2932		}
  2933	
  2934		bpf_cap = bpf_token_capable(token, CAP_BPF);
  2935		err = -EPERM;
  2936	
  2937		if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) &&
  2938		    (attr->prog_flags & BPF_F_ANY_ALIGNMENT) &&
  2939		    !bpf_cap)
  2940			goto put_token;
  2941	
  2942		/* Intent here is for unprivileged_bpf_disabled to block BPF program
  2943		 * creation for unprivileged users; other actions depend
  2944		 * on fd availability and access to bpffs, so are dependent on
  2945		 * object creation success. Even with unprivileged BPF disabled,
  2946		 * capability checks are still carried out for these
  2947		 * and other operations.
  2948		 */
  2949		if (sysctl_unprivileged_bpf_disabled && !bpf_cap)
  2950			goto put_token;
  2951	
  2952		if (attr->insn_cnt == 0 ||
  2953		    attr->insn_cnt > (bpf_cap ? BPF_COMPLEXITY_LIMIT_INSNS : BPF_MAXINSNS)) {
  2954			err = -E2BIG;
  2955			goto put_token;
  2956		}
  2957		if (type != BPF_PROG_TYPE_SOCKET_FILTER &&
  2958		    type != BPF_PROG_TYPE_CGROUP_SKB &&
  2959		    !bpf_cap)
  2960			goto put_token;
  2961	
  2962		if (is_net_admin_prog_type(type) && !bpf_token_capable(token, CAP_NET_ADMIN))
  2963			goto put_token;
  2964		if (is_perfmon_prog_type(type) && !bpf_token_capable(token, CAP_PERFMON))
  2965			goto put_token;
  2966	
> 2967		multi_func = is_tracing_multi(attr->expected_attach_type);
  2968	
  2969		/* attach_prog_fd/attach_btf_obj_fd can specify fd of either bpf_prog
  2970		 * or btf, we need to check which one it is
  2971		 */
  2972		if (attr->attach_prog_fd) {
  2973			dst_prog = bpf_prog_get(attr->attach_prog_fd);
  2974			if (IS_ERR(dst_prog)) {
  2975				dst_prog = NULL;
  2976				attach_btf = btf_get_by_fd(attr->attach_btf_obj_fd);
  2977				if (IS_ERR(attach_btf)) {
  2978					err = -EINVAL;
  2979					goto put_token;
  2980				}
  2981				if (!btf_is_kernel(attach_btf)) {
  2982					/* attaching through specifying bpf_prog's BTF
  2983					 * objects directly might be supported eventually
  2984					 */
  2985					btf_put(attach_btf);
  2986					err = -ENOTSUPP;
  2987					goto put_token;
  2988				}
  2989			}
  2990		} else if (attr->attach_btf_id || multi_func) {
  2991			/* fall back to vmlinux BTF, if BTF type ID is specified */
  2992			attach_btf = bpf_get_btf_vmlinux();
  2993			if (IS_ERR(attach_btf)) {
  2994				err = PTR_ERR(attach_btf);
  2995				goto put_token;
  2996			}
  2997			if (!attach_btf) {
  2998				err = -EINVAL;
  2999				goto put_token;
  3000			}
  3001			btf_get(attach_btf);
  3002		}
  3003	
  3004		if (bpf_prog_load_check_attach(type, attr->expected_attach_type,
  3005					       attach_btf, attr->attach_btf_id,
  3006					       dst_prog, multi_func)) {
  3007			if (dst_prog)
  3008				bpf_prog_put(dst_prog);
  3009			if (attach_btf)
  3010				btf_put(attach_btf);
  3011			err = -EINVAL;
  3012			goto put_token;
  3013		}
  3014	
  3015		/* plain bpf_prog allocation */
  3016		prog = bpf_prog_alloc(bpf_prog_size(attr->insn_cnt), GFP_USER);
  3017		if (!prog) {
  3018			if (dst_prog)
  3019				bpf_prog_put(dst_prog);
  3020			if (attach_btf)
  3021				btf_put(attach_btf);
  3022			err = -EINVAL;
  3023			goto put_token;
  3024		}
  3025	
  3026		prog->expected_attach_type = attr->expected_attach_type;
  3027		prog->sleepable = !!(attr->prog_flags & BPF_F_SLEEPABLE);
  3028		prog->aux->attach_btf = attach_btf;
  3029		prog->aux->attach_btf_id = multi_func ? bpf_multi_func_btf_id[0] : attr->attach_btf_id;
  3030		prog->aux->dst_prog = dst_prog;
  3031		prog->aux->dev_bound = !!attr->prog_ifindex;
  3032		prog->aux->xdp_has_frags = attr->prog_flags & BPF_F_XDP_HAS_FRAGS;
  3033	
  3034		/* move token into prog->aux, reuse taken refcnt */
  3035		prog->aux->token = token;
  3036		token = NULL;
  3037	
  3038		prog->aux->user = get_current_user();
  3039		prog->len = attr->insn_cnt;
  3040	
  3041		err = -EFAULT;
  3042		if (copy_from_bpfptr(prog->insns,
  3043				     make_bpfptr(attr->insns, uattr.is_kernel),
  3044				     bpf_prog_insn_size(prog)) != 0)
  3045			goto free_prog;
  3046		/* copy eBPF program license from user space */
  3047		if (strncpy_from_bpfptr(license,
  3048					make_bpfptr(attr->license, uattr.is_kernel),
  3049					sizeof(license) - 1) < 0)
  3050			goto free_prog;
  3051		license[sizeof(license) - 1] = 0;
  3052	
  3053		/* eBPF programs must be GPL compatible to use GPL-ed functions */
  3054		prog->gpl_compatible = license_is_gpl_compatible(license) ? 1 : 0;
  3055	
  3056		if (attr->signature) {
  3057			err = bpf_prog_verify_signature(prog, attr, uattr.is_kernel);
  3058			if (err)
  3059				goto free_prog;
  3060		}
  3061	
  3062		prog->orig_prog = NULL;
  3063		prog->jited = 0;
  3064	
  3065		atomic64_set(&prog->aux->refcnt, 1);
  3066	
  3067		if (bpf_prog_is_dev_bound(prog->aux)) {
  3068			err = bpf_prog_dev_bound_init(prog, attr);
  3069			if (err)
  3070				goto free_prog;
  3071		}
  3072	
  3073		if (type == BPF_PROG_TYPE_EXT && dst_prog &&
  3074		    bpf_prog_is_dev_bound(dst_prog->aux)) {
  3075			err = bpf_prog_dev_bound_inherit(prog, dst_prog);
  3076			if (err)
  3077				goto free_prog;
  3078		}
  3079	
  3080		/*
  3081		 * Bookkeeping for managing the program attachment chain.
  3082		 *
  3083		 * It might be tempting to set attach_tracing_prog flag at the attachment
  3084		 * time, but this will not prevent from loading bunch of tracing prog
  3085		 * first, then attach them one to another.
  3086		 *
  3087		 * The flag attach_tracing_prog is set for the whole program lifecycle, and
  3088		 * doesn't have to be cleared in bpf_tracing_link_release, since tracing
  3089		 * programs cannot change attachment target.
  3090		 */
  3091		if (type == BPF_PROG_TYPE_TRACING && dst_prog &&
  3092		    dst_prog->type == BPF_PROG_TYPE_TRACING) {
  3093			prog->aux->attach_tracing_prog = true;
  3094		}
  3095	
  3096		/* find program type: socket_filter vs tracing_filter */
  3097		err = find_prog_type(type, prog);
  3098		if (err < 0)
  3099			goto free_prog;
  3100	
  3101		prog->aux->load_time = ktime_get_boottime_ns();
  3102		err = bpf_obj_name_cpy(prog->aux->name, attr->prog_name,
  3103				       sizeof(attr->prog_name));
  3104		if (err < 0)
  3105			goto free_prog;
  3106	
  3107		err = security_bpf_prog_load(prog, attr, token, uattr.is_kernel);
  3108		if (err)
  3109			goto free_prog_sec;
  3110	
  3111		/* run eBPF verifier */
  3112		err = bpf_check(&prog, attr, uattr, uattr_size);
  3113		if (err < 0)
  3114			goto free_used_maps;
  3115	
  3116		prog = bpf_prog_select_runtime(prog, &err);
  3117		if (err < 0)
  3118			goto free_used_maps;
  3119	
  3120		err = bpf_prog_mark_insn_arrays_ready(prog);
  3121		if (err < 0)
  3122			goto free_used_maps;
  3123	
  3124		err = bpf_prog_alloc_id(prog);
  3125		if (err)
  3126			goto free_used_maps;
  3127	
  3128		/* Upon success of bpf_prog_alloc_id(), the BPF prog is
  3129		 * effectively publicly exposed. However, retrieving via
  3130		 * bpf_prog_get_fd_by_id() will take another reference,
  3131		 * therefore it cannot be gone underneath us.
  3132		 *
  3133		 * Only for the time /after/ successful bpf_prog_new_fd()
  3134		 * and before returning to userspace, we might just hold
  3135		 * one reference and any parallel close on that fd could
  3136		 * rip everything out. Hence, below notifications must
  3137		 * happen before bpf_prog_new_fd().
  3138		 *
  3139		 * Also, any failure handling from this point onwards must
  3140		 * be using bpf_prog_put() given the program is exposed.
  3141		 */
  3142		bpf_prog_kallsyms_add(prog);
  3143		perf_event_bpf_event(prog, PERF_BPF_EVENT_PROG_LOAD, 0);
  3144		bpf_audit_prog(prog, BPF_AUDIT_LOAD);
  3145	
  3146		err = bpf_prog_new_fd(prog);
  3147		if (err < 0)
  3148			bpf_prog_put(prog);
  3149		return err;
  3150	
  3151	free_used_maps:
  3152		/* In case we have subprogs, we need to wait for a grace
  3153		 * period before we can tear down JIT memory since symbols
  3154		 * are already exposed under kallsyms.
  3155		 */
  3156		__bpf_prog_put_noref(prog, prog->aux->real_func_cnt);
  3157		return err;
  3158	
  3159	free_prog_sec:
  3160		security_bpf_prog_free(prog);
  3161	free_prog:
  3162		free_uid(prog->aux->user);
  3163		if (prog->aux->attach_btf)
  3164			btf_put(prog->aux->attach_btf);
  3165		bpf_prog_free(prog);
  3166	put_token:
  3167		bpf_token_put(token);
  3168		return err;
  3169	}
  3170	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH mm-unstable v15 11/13] mm/khugepaged: avoid unnecessary mTHP collapse attempts
From: Lorenzo Stoakes (Oracle) @ 2026-03-19 15:59 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, lorenzo.stoakes,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
	pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang, rientjes,
	rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <4774da78-8349-4eda-a09b-7248e82cb26b@kernel.org>

On Wed, Mar 18, 2026 at 08:48:30PM +0100, David Hildenbrand (Arm) wrote:
> On 3/18/26 19:59, Nico Pache wrote:
> >
> >
> > On 3/17/26 4:35 AM, Lorenzo Stoakes (Oracle) wrote:
> >> On Wed, Feb 25, 2026 at 08:26:31PM -0700, Nico Pache wrote:
> >>> There are cases where, if an attempted collapse fails, all subsequent
> >>> orders are guaranteed to also fail. Avoid these collapse attempts by
> >>> bailing out early.
> >>>
> >>> Signed-off-by: Nico Pache <npache@redhat.com>
> >>
> >> With David's concern addressed:
> >>
> >> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> >>
> >>> ---
> >>>  mm/khugepaged.c | 35 ++++++++++++++++++++++++++++++++++-
> >>>  1 file changed, 34 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >>> index 1c3711ed4513..388d3f2537e2 100644
> >>> --- a/mm/khugepaged.c
> >>> +++ b/mm/khugepaged.c
> >>> @@ -1492,9 +1492,42 @@ static int mthp_collapse(struct mm_struct *mm, unsigned long address,
> >>>  			ret = collapse_huge_page(mm, collapse_address, referenced,
> >>>  						 unmapped, cc, mmap_locked,
> >>>  						 order);
> >>> -			if (ret == SCAN_SUCCEED) {
> >>> +
> >>> +			switch (ret) {
> >>> +			/* Cases were we continue to next collapse candidate */
> >>> +			case SCAN_SUCCEED:
> >>>  				collapsed += nr_pte_entries;
> >>> +				fallthrough;
> >>> +			case SCAN_PTE_MAPPED_HUGEPAGE:
> >>>  				continue;
> >>> +			/* Cases were lower orders might still succeed */
> >>> +			case SCAN_LACK_REFERENCED_PAGE:
> >>> +			case SCAN_EXCEED_NONE_PTE:
> >>> +			case SCAN_EXCEED_SWAP_PTE:
> >>> +			case SCAN_EXCEED_SHARED_PTE:
> >>> +			case SCAN_PAGE_LOCK:
> >>> +			case SCAN_PAGE_COUNT:
> >>> +			case SCAN_PAGE_LRU:
> >>> +			case SCAN_PAGE_NULL:
> >>> +			case SCAN_DEL_PAGE_LRU:
> >>> +			case SCAN_PTE_NON_PRESENT:
> >>> +			case SCAN_PTE_UFFD_WP:
> >>> +			case SCAN_ALLOC_HUGE_PAGE_FAIL:
> >>> +				goto next_order;
> >>> +			/* Cases were no further collapse is possible */
> >>> +			case SCAN_CGROUP_CHARGE_FAIL:
> >>> +			case SCAN_COPY_MC:
> >>> +			case SCAN_ADDRESS_RANGE:
> >>> +			case SCAN_NO_PTE_TABLE:
> >>> +			case SCAN_ANY_PROCESS:
> >>> +			case SCAN_VMA_NULL:
> >>> +			case SCAN_VMA_CHECK:
> >>> +			case SCAN_SCAN_ABORT:
> >>> +			case SCAN_PAGE_ANON:
> >>> +			case SCAN_PMD_MAPPED:
> >>> +			case SCAN_FAIL:
> >>> +			default:
> >>
> >> Agree with david, let's spell them out please :)
> >
> > I believe David is arguing for the opposite. To drop all these spelt out cases
> > and just leave the default case.
> >
> > @david is that correct or did I misunderstand that.
>
> Either spell all out (no default) OR add a default.
>
> I prefer to just ... use the default :)

I mean yup that's fine too I guess, all or nothing, something in between is
weird!

>
> --
> Cheers,
>
> David

Cheers, Lorenzo

^ permalink raw reply

* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Gregory Price @ 2026-03-19 15:09 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman
In-Reply-To: <3342acb5-8d34-4270-98a2-866b1ff80faf@kernel.org>

On Tue, Mar 17, 2026 at 02:25:29PM +0100, David Hildenbrand (Arm) wrote:
> On 2/22/26 09:48, Gregory Price wrote:
> > Topic type: MM
> 
> Hi Gregory,
> 
> stumbling over this again, some questions whereby I'll just ignore the
> compressed RAM bits for now and focus on use cases where promotion etc
> are not relevant :)

A more concrete example up your alley:

I've since been playing with a virtio-net private node.

Normally cloud-hypervisor VMs with virtio-net can't be subject to KSM
because the entire boot region gets marked shared.  If virtio-net has
its own private node / region separate from the boot region, the boot
region is now free to be subject to KSM.

I may have that up as an example sometime before LSF, but i need to
clean up some networking stack hacks i've made to make it work.

> > 
> > N_MEMORY_PRIVATE is all about isolating NUMA nodes and then punching
> > explicit holes in that isolation to do useful things we couldn't do
> > before without re-implementing entire portions of mm/ in a driver.
> 
> Just to clarify: we don't currently have any mechanism to expose, say,
> SPM/PMEM/whatsoever to the buddy allocator through the dax/kmem driver
> and *not* have random allocations end up on it, correct?
>
> Assume we online the memory to ZONE_MOVABLE, still other (fallback)
> allocations might end up on that memory.
> 

Correct, when you hotplug memory into a node, it's a free for all.
Fallbacks are going to happen.

I see you saw below that one of the extensions is removing the nodes
from the fallback list.  That is part one, but it's insufficient to
prevent complete leakage (someone might iterate over the nodes-possible
list and try migrating memory).

> How would we currently handle something like that? (do we have drivers
> for that? I'd assume that drivers would only migrate some user memory to
> ZONE_DEVICE memory.)
> 
> Assuming we don't have such a mechanism, I assume that part of your
> proposal would be very interesting: online the memory to a
> "special"/"restricted" (you call it private) NUMA node, whereby all
> memory of that NUMA node will only be consumable through
> mbind() and friends.
> 

Basically the only isolation mechanism we have today is ZONE_DEVICE.

Either via mbind and friends, or even just the driver itself managing it
directly via alloc_pages_node() and exposing some userland interface.

You can imagine a network driver providing an ioctl for a shared buffer
or a driver exposing a mmap'able file descriptor as the trivial case.

> Any other allocations (including automatic page migration etc) would not
> end up on that memory.

One of the complications of exposing this memory via mbind is that
mempolicy.c has a lot of migration mechanics, just to name two:

  - migrate on mbind
  - cpuset rebinds

So for a completely solution you need to support migration if you
support mempolicy.  But with the callbacks, you can control how/when
migration occurs.

tl;dr: many of mm/'s services are actually predicated on migration
support, so you have to manage that somehow.

> 
> Thinking of some "terribly slow" or "terribly fast" memory that we don't
> want to involve in automatic memory tiering, being able to just let
> selected workloads consume that memory sounds very helpful.
> 
> 
> (wondering if there could be some way allocations might get migrated out
> of the node, for example, during memory offlining etc, which might also
> not be desirable)
> 

in the NP_OPS_MIGRATION patch, this gets covered.

I'm not sure the NP_OPS_* pattern is what we actually want, it's just
what i came up with to make it clear what's being enabled.

Basically without NP_OPS_MIGRATION, this memory is completely
non-migratable.  The driver managing it therefore needs to control the
lifetime, and if hotplug is requested - kill anyone using it (which by
definition should not the kernel) and either release the pages or take
them so they can be released while hotplug is spinning.

> I am not sure if __GFP_PRIVATE etc is really required for that. But some
> mechanism to make that work seems extremely helpful.
> 
> Because ...
> 
> > /* And now I can use mempolicy with my memory */
> > buf = mmap(...);
> > mbind(buf, len, mode, private_node, ...);
> > buf[0] = 0xdeadbeef;  /* Faults onto private node */
> 
> ... just being able to consume that memory through mbind() and having
> guarantees sounds extremely helpful.
> 

Yes! :]

> > 
> >   - Filter allocation requests on __GFP_PRIVATE
> >     	numa_zone_allowed() excludes them otherwise. 
> 
> I think we discussed that in the past, but why can't we find a way that
> only people requesting __GFP_THISNODE could allocate that memory, for
> example? I guess we'd have to remove it from all "default NUMA bitmaps"
> somehow.
>

I experimented with this.  There were two concerns:

1) as you note, removing it from the default bitmaps, which is actually
   hard.  You can't remove it from the possible-node bitmap, so that
   just seemed non-tractable.

2) __GFP_THISNODE actually means (among other things) "don't fallback".
   And, in fact, there are some hotplug-time allocations that occur in
   SLAB (pglist_data) that target the private node that *must* fallback
   to successfully allocate for successful kernel operation.

So separating PRIVATE from THISNODE and allowing some use of fallback
mechanics resolves some problems here.

I think #2 is a solvable problem, but #1 i don't think can be addressed.
I need to investigate the slab interactions a little more.

> >   - Use standard struct page / folio.  No ZONE_DEVICE, no pgmap,
> >     no struct page metadata limitations.
> 
> Good.

Note: I've actually since explored merging this with pgmap, and
rebranding it as node-scope pgmap.

In that sense, you could think of this as NODE_DEVICE instead of
NODE_PRIVATE - but maybe I'm inviting too much baggage :]

> > 
> > Re-use of ZONE_DEVICE Hooks
> > ===
> 
> I think all of that might not be required for the simplistic use case I
> mentioned above (fast/slow memory only to be consumed by selected user
> space that opts in through mbind() and friends).
> 
> Or are there other use cases for these callbacks
> 

Many `folio_is_zone_device()` hooks result in the operations being
a no-op / failing.  We need all those same hooks.

Some hooks I added - such as migration hooks, are combined with the
zone_device hooks via i helper to demonstrate the pattern is the same
when the memory is opted into migration.

I do not think all of these hooks are required, I would think of this
more as an exploration of the whole space, and then we can throw what
does not have an active use case.

For the compressed ram component I've been designing, the needs are:

- Migration
- Reclaim
- Demotion
- Write Protect (maybe, possibly optional)

But you could argue another user might want the same device to have:
- Migration
- Mempolicy

Where they manage things from userland, rather than via reclaim.

The flexibility is kind of the point :]

> [...]
> > 
> > 
> > Flag-gated behavior (NP_OPS_*) controls:
> > ===
> > 
> > We use OPS flags to denote what mm/ services we want to allow on our
> > private node.   I've plumbed these through so far:
> > 
> >   NP_OPS_MIGRATION       - Node supports migration
> >   NP_OPS_MEMPOLICY       - Node supports mempolicy actions
> >   NP_OPS_DEMOTION        - Node appears in demotion target lists
> >   NP_OPS_PROTECT_WRITE   - Node memory is read-only (wrprotect)
> >   NP_OPS_RECLAIM         - Node supports reclaim
> >   NP_OPS_NUMA_BALANCING  - Node supports numa balancing
> >   NP_OPS_COMPACTION      - Node supports compaction
> >   NP_OPS_LONGTERM_PIN    - Node supports longterm pinning
> >   NP_OPS_OOM_ELIGIBLE	 - (MIGRATION | DEMOTION), node is reachable
> >                            as normal system ram storage, so it should
> > 			   be considered in OOM pressure calculations.
> 
> I have to think about all that, and whether that would be required as a
> first step. I'd assume in a simplistic use case mentioned above we might
> only forbid the memory to be used as a fallback for any oom etc.
> 
> Whether reclaim (e.g., swapout) makes sense is a good question.
> 

I would simply state: "That depends on the memory device"

Which is kind of the point.  The ability to isolate and poke holes in
that isolation explictly, while using the same mm/ code, creates a new
design space we haven't had before.

---

I think it would be fair to say all of these would not be required for
an MVP interface, and should require a use case to merge.  But the code
is here because I wanted to explore just how far it can go.

In fact, I believe I have gotten to the point where I could add:

  NP_OPS_FALLBACK_NODE  - re-add the node to the fallback list
                          do not require __GFP_PRIVATE for allocation

Which would require all of the other bits to be turned on.

The result of this is essentially a numa node with otherwise normal
memory, but for which a driver gets callbacks on certain operations
(migration, free, etc).  That ALSO seems useful.

It's... an interesting result of the whole exploration.

~Gregory

^ permalink raw reply

* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Gregory Price @ 2026-03-19 14:29 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: linux-kernel, linux-cxl, cgroups, linux-mm, linux-trace-kernel,
	damon, kernel-team, gregkh, rafael, dakr, dave, jonathan.cameron,
	dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
	dan.j.williams, longman, akpm, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, osalvador, ziy, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, ying.huang, apopple,
	axelrasmussen, yuanchu, weixugc, yury.norov, linux, mhiramat,
	mathieu.desnoyers, tj, hannes, mkoutny, jackmanb, sj, baolin.wang,
	npache, ryan.roberts, dev.jain, baohua, lance.yang, muchun.song,
	xu.xin16, chengming.zhou, jannh, linmiaohe, nao.horiguchi,
	pfalcato, rientjes, shakeel.butt, riel, harry.yoo, cl,
	roman.gushchin, chrisl, kasong, shikemeng, nphamcs, bhe,
	zhengqi.arch, terry.bowman
In-Reply-To: <049d056b-844b-4480-b90e-bf4c850fc70e@kernel.org>

On Tue, Mar 17, 2026 at 02:05:53PM +0100, David Hildenbrand (Arm) wrote:
> On 2/23/26 17:08, Gregory Price wrote:
> > On Mon, Feb 23, 2026 at 09:54:55AM -0500, Gregory Price wrote:
> >> On Mon, Feb 23, 2026 at 02:07:15PM +0100, David Hildenbrand (Arm) wrote:
> >>>
> >>> I'm concerned about adding more special-casing (similar to what we already
> >>> added for ZONE_DEVICE) all over the place.
> >>>
> >>> Like the whole folio_managed_() stuff in mprotect.c
> >>>
> >>> Having that said, sounds like a reasonable topic to discuss.
> >>>
> >>
> >> Another option would be to add the hook to vma_wants_writenotify()
> >> instead of the page table code - and mask MM_CP_TRY_CHANGE_WRITABLE.
> >>
> > 
> > scratch all this - existing hooks exist for exactly this purpose:
> > 
> > 	can_change_[pte|pmd]_writable()
> > 
> > Surprised I missed this.
> > 
> > I can clean this up to remove it from the page table walks.
> 
> Sorry for the late reply -- sounds like we can handle this cleaner.
> 
> But I am wondering: why is this even required?
> 
> Is it just for "Services that intercept write faults (e.g., for
> promotion tracking) need PTEs to stay read-only"
> 
> But that promotion tracking sounds like some orthogonal work to me. What
> am I missing that this is required in this patch set? (is it just for
> the special compressed RAM bits?)
> 

Yes, this was specific to the compressed ram bits - it allows for a
service to control where/when writes to the device can happen.  In this
case, I've limited writes to just the demotion step. (Although I have
since realized i need to not allow file-backed memory to be demoted).

There may be a better way to do this, but also it may very well be the
case that such a hook is just a bridge too far and isn't wanted. I think
this debate is warranted.

~Gregory

^ permalink raw reply

* Re: [PATCH 12/61] quota: Prefer IS_ERR_OR_NULL over manual NULL check
From: Jan Kara @ 2026-03-19 14:13 UTC (permalink / raw)
  To: Philipp Hahn
  Cc: amd-gfx, apparmor, bpf, ceph-devel, cocci, dm-devel, dri-devel,
	gfs2, intel-gfx, intel-wired-lan, iommu, kvm, linux-arm-kernel,
	linux-block, linux-bluetooth, linux-btrfs, linux-cifs, linux-clk,
	linux-erofs, linux-ext4, linux-fsdevel, linux-gpio, linux-hyperv,
	linux-input, linux-kernel, linux-leds, linux-media, linux-mips,
	linux-mm, linux-modules, linux-mtd, linux-nfs, linux-omap,
	linux-phy, linux-pm, linux-rockchip, linux-s390, linux-scsi,
	linux-sctp, linux-security-module, linux-sh, linux-sound,
	linux-stm32, linux-trace-kernel, linux-usb, linux-wireless,
	netdev, ntfs3, samba-technical, sched-ext, target-devel,
	tipc-discussion, v9fs, Jan Kara
In-Reply-To: <20260310-b4-is_err_or_null-v1-12-bd63b656022d@avm.de>

On Tue 10-03-26 12:48:38, Philipp Hahn wrote:
> Prefer using IS_ERR_OR_NULL() over using IS_ERR() and a manual NULL
> check.
> 
> Change generated with coccinelle.
> 
> To: Jan Kara <jack@suse.com>
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Philipp Hahn <phahn-oss@avm.de>

Thanks for the patch but frankly I find the original variant clearer wrt
what is going on. So I prefer to keep the code as is.

								Honza

> ---
>  fs/quota/quota.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/quota/quota.c b/fs/quota/quota.c
> index 33bacd70758007129e0375bab44d7431195ec441..2e09fc247d0cf45b9e83a4f8a0be7ea694c8c2a1 100644
> --- a/fs/quota/quota.c
> +++ b/fs/quota/quota.c
> @@ -965,7 +965,7 @@ SYSCALL_DEFINE4(quotactl, unsigned int, cmd, const char __user *, special,
>  	else
>  		drop_super_exclusive(sb);
>  out:
> -	if (pathp && !IS_ERR(pathp))
> +	if (!IS_ERR_OR_NULL(pathp))
>  		path_put(pathp);
>  	return ret;
>  }
> 
> -- 
> 2.43.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v4 0/5] mm: zone lock tracepoint instrumentation
From: Dmitry Ilvokhin @ 2026-03-19 13:22 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Matthew Wilcox, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Masami Hiramatsu, Mathieu Desnoyers, Rafael J. Wysocki,
	Pavel Machek, Len Brown, Brendan Jackman, Johannes Weiner, Zi Yan,
	Oscar Salvador, Qi Zheng, Shakeel Butt, linux-kernel, linux-mm,
	linux-trace-kernel, linux-pm
In-Reply-To: <abhAoF5EpiPigsx7@shell.ilvokhin.com>

On Mon, Mar 16, 2026 at 05:40:50PM +0000, Dmitry Ilvokhin wrote:

[...]

> A possible generic solution is a trace_contended_release() for spin
> locks, for example:
> 
>     if (trace_contended_release_enabled() &&
>         atomic_read(&lock->val) & ~_Q_LOCKED_MASK)
>         trace_contended_release(lock);
> 
> This might work on x86, but could increase code size and regress
> performance on arches where spin_unlock() is inlined, such as arm64
> under !PREEMPTION.

I took a stab at this idea and submitted an RFC [1].

The implementation builds on your earlier observation from Matthew that
_raw_spin_unlock() is not inlined in most configurations. In those
cases, when the tracepoint is disabled, this adds a single NOP on the
fast path, with the conditional check staying out of line. The measured
text size increase in this configuration is +983 bytes.

For configurations where _raw_spin_unlock() is inlined, the
instrumentation does increase code size more noticeably
(+71 KB in my measurements), since the check and out of line call is
replicated at each call site.

This provides a generic release-side signal for contended locks,
allowing: correlation of lock holders with waiters and measurement of
contended hold times

This RFC addressing the same visibility gap without introducing per-lock
instrumentation.

If this tradeoff is acceptable, this could be a generic alternative to
lock-specific tracepoints.

[1]: https://lore.kernel.org/all/51aad0415b78c5a39f2029722118fa01eac77538.1773858853.git.d@ilvokhin.com 

^ permalink raw reply

* Re: [PATCH v2] blk-mq: add tracepoint block_rq_tag_wait
From: Laurence Oberman @ 2026-03-19 12:02 UTC (permalink / raw)
  To: Johannes Thumshirn, Aaron Tomlin, axboe@kernel.dk,
	rostedt@goodmis.org, mhiramat@kernel.org,
	mathieu.desnoyers@efficios.com
  Cc: kch@nvidia.com, bvanassche@acm.org, dlemoal@kernel.org,
	ritesh.list@gmail.com, neelx@suse.com, sean@ashe.io,
	mproche@gmail.com, chjohnst@gmail.com,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-trace-kernel@vger.kernel.org
In-Reply-To: <63dfc26d-7d9d-4530-bf85-4f07fbcf240a@wdc.com>

On Thu, 2026-03-19 at 07:32 +0000, Johannes Thumshirn wrote:
> Looks good,
> 
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Looks good now.

Reviewed-by: Laurence Oberman <loberman@redhat.com>
Tested-by:   Laurence Oberman <loberman@redhat.com>


^ permalink raw reply

* [PATCH v11 5/5] ring-buffer: Add persistent ring buffer selftest
From: Masami Hiramatsu (Google) @ 2026-03-19  9:12 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Ian Rogers
In-Reply-To: <177391152793.193994.8986943289250629418.stgit@mhiramat.tok.corp.google.com>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Add a self-destractive test for the persistent ring buffer. This
will invalidate some sub-buffer pages in the persistent ring buffer
when kernel gets panic, and check whether the number of detected
invalid pages and the total entry_bytes are the same as record
after reboot.

This can ensure the kernel correctly recover partially corrupted
persistent ring buffer when boot.

The test only runs on the persistent ring buffer whose name is
"ptracingtest". And user has to fill it up with events before
kernel panics.

To run the test, enable CONFIG_RING_BUFFER_PERSISTENT_SELFTEST
and you have to setup the kernel cmdline;

 reserve_mem=20M:2M:trace trace_instance=ptracingtest^traceoff@trace
 panic=1

And run following commands after the 1st boot;

 cd /sys/kernel/tracing/instances/ptracingtest
 echo 1 > tracing_on
 echo 1 > events/enable
 sleep 3
 echo c > /proc/sysrq-trigger

After panic message, the kernel will reboot and run the verification
on the persistent ring buffer, e.g.

 Ring buffer meta [2] invalid buffer page detected
 Ring buffer meta [2] is from previous boot! (318 pages discarded)
 Ring buffer testing [2] invalid pages: PASSED (318/318)
 Ring buffer testing [2] entry_bytes: PASSED (1300476/1300476)

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v10:
  - Add entry_bytes test.
  - Do not compile test code if CONFIG_RING_BUFFER_PERSISTENT_SELFTEST=n.
 Changes in v9:
  - Test also reader pages.
---
 0 files changed

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 994f52b34344..0670742b2d60 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -238,6 +238,7 @@ int ring_buffer_subbuf_size_get(struct trace_buffer *buffer);
 
 enum ring_buffer_flags {
 	RB_FL_OVERWRITE		= 1 << 0,
+	RB_FL_TESTING		= 1 << 1,
 };
 
 #ifdef CONFIG_RING_BUFFER
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index e130da35808f..094d5511bb17 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -1202,6 +1202,21 @@ config RING_BUFFER_VALIDATE_TIME_DELTAS
 	  Only say Y if you understand what this does, and you
 	  still want it enabled. Otherwise say N
 
+config RING_BUFFER_PERSISTENT_SELFTEST
+	bool "Enable persistent ring buffer selftest"
+	depends on RING_BUFFER
+	help
+	  Run a selftest on the persistent ring buffer which names
+	  "ptracingtest" (and its backup) when panic_on_reboot by
+	  invalidating ring buffer pages.
+	  Note that user has to enable events on the persistent ring
+	  buffer manually to fill up ring buffers before rebooting.
+	  Since this invalidates the data on test target ring buffer,
+	  "ptracingtest" persistent ring buffer must not be used for
+	  actual tracing, but only for testing.
+
+	  If unsure, say N
+
 config MMIOTRACE_TEST
 	tristate "Test module for mmiotrace"
 	depends on MMIOTRACE && m
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index b436d2982c5e..cfd895d6b56e 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -64,6 +64,10 @@ struct ring_buffer_cpu_meta {
 	unsigned long	commit_buffer;
 	__u32		subbuf_size;
 	__u32		nr_subbufs;
+#ifdef CONFIG_RING_BUFFER_PERSISTENT_SELFTEST
+	__u32		nr_invalid;
+	__u32		entry_bytes;
+#endif
 	int		buffers[];
 };
 
@@ -2077,6 +2081,19 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 
 	pr_info("Ring buffer meta [%d] is from previous boot! (%d pages discarded)\n",
 		cpu_buffer->cpu, discarded);
+
+#ifdef CONFIG_RING_BUFFER_PERSISTENT_SELFTEST
+	if (meta->nr_invalid)
+		pr_info("Ring buffer testing [%d] invalid pages: %s (%d/%d)\n",
+			cpu_buffer->cpu,
+			(discarded == meta->nr_invalid) ? "PASSED" : "FAILED",
+			discarded, meta->nr_invalid);
+	if (meta->entry_bytes)
+		pr_info("Ring buffer testing [%d] entry_bytes: %s (%ld/%ld)\n",
+			cpu_buffer->cpu,
+			(entry_bytes == meta->entry_bytes) ? "PASSED" : "FAILED",
+			(long)entry_bytes, (long)meta->entry_bytes);
+#endif
 	return;
 
  invalid:
@@ -2557,12 +2574,64 @@ static void rb_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
 	kfree(cpu_buffer);
 }
 
+#ifdef CONFIG_RING_BUFFER_PERSISTENT_SELFTEST
+static void rb_test_inject_invalid_pages(struct trace_buffer *buffer)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_cpu_meta *meta;
+	struct buffer_data_page *dpage;
+	u32 entry_bytes = 0;
+	unsigned long ptr;
+	int subbuf_size;
+	int invalid = 0;
+	int cpu;
+	int i;
+
+	if (!(buffer->flags & RB_FL_TESTING))
+		return;
+
+	guard(preempt)();
+	cpu = smp_processor_id();
+
+	cpu_buffer = buffer->buffers[cpu];
+	meta = cpu_buffer->ring_meta;
+	ptr = (unsigned long)rb_subbufs_from_meta(meta);
+	subbuf_size = meta->subbuf_size;
+
+	for (i = 0; i < meta->nr_subbufs; i++) {
+		int idx = meta->buffers[i];
+
+		dpage = (void *)(ptr + idx * subbuf_size);
+		/* Skip unused pages */
+		if (!local_read(&dpage->commit))
+			continue;
+
+		/* Invalidate even pages. */
+		if (!(i & 0x1)) {
+			local_add(subbuf_size + 1, &dpage->commit);
+			invalid++;
+		} else {
+			/* Count total commit bytes. */
+			entry_bytes += local_read(&dpage->commit);
+		}
+	}
+
+	pr_info("Inject invalidated %d pages on CPU%d, total size: %ld\n",
+		invalid, cpu, (long)entry_bytes);
+	meta->nr_invalid = invalid;
+	meta->entry_bytes = entry_bytes;
+}
+#else /* !CONFIG_RING_BUFFER_PERSISTENT_SELFTEST */
+#define rb_test_inject_invalid_pages(buffer)	do { } while (0)
+#endif
+
 /* Stop recording on a persistent buffer and flush cache if needed. */
 static int rb_flush_buffer_cb(struct notifier_block *nb, unsigned long event, void *data)
 {
 	struct trace_buffer *buffer = container_of(nb, struct trace_buffer, flush_nb);
 
 	ring_buffer_record_off(buffer);
+	rb_test_inject_invalid_pages(buffer);
 	arch_ring_buffer_flush_range(buffer->range_addr_start, buffer->range_addr_end);
 	return NOTIFY_DONE;
 }
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 5e1129b011cb..dc23fa63c789 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -9349,6 +9349,8 @@ static void setup_trace_scratch(struct trace_array *tr,
 	memset(tscratch, 0, size);
 }
 
+#define TRACE_TEST_PTRACING_NAME	"ptracingtest"
+
 static int
 allocate_trace_buffer(struct trace_array *tr, struct array_buffer *buf, unsigned long size)
 {
@@ -9361,6 +9363,8 @@ allocate_trace_buffer(struct trace_array *tr, struct array_buffer *buf, unsigned
 	buf->tr = tr;
 
 	if (tr->range_addr_start && tr->range_addr_size) {
+		if (!strcmp(tr->name, TRACE_TEST_PTRACING_NAME))
+			rb_flags |= RB_FL_TESTING;
 		/* Add scratch buffer to handle 128 modules */
 		buf->buffer = ring_buffer_alloc_range(size, rb_flags, 0,
 						      tr->range_addr_start,


^ permalink raw reply related

* [PATCH v11 4/5] ring-buffer: Skip invalid sub-buffers when rewinding persistent ring buffer
From: Masami Hiramatsu (Google) @ 2026-03-19  9:12 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Ian Rogers
In-Reply-To: <177391152793.193994.8986943289250629418.stgit@mhiramat.tok.corp.google.com>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Skip invalid sub-buffers when rewinding the persistent ring buffer
instead of stopping the rewinding the ring buffer. The skipped
buffers are cleared.

To ensure the rewinding stops at the unused page, this also clears
buffer_data_page::time_stamp when tracing resets the buffer. This
allows us to identify unused pages and empty pages.

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v11:
   - Reset timestamp when the buffer is invalid.
   - When rewinding, skip subbuf page if timestamp is wrong and
     check timestamp after validating buffer data page.
 Changes in v10:
   - Newly added.
---
 0 files changed

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 67826021867b..b436d2982c5e 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -363,6 +363,7 @@ struct buffer_page {
 static void rb_init_page(struct buffer_data_page *bpage)
 {
 	local_set(&bpage->commit, 0);
+	bpage->time_stamp = 0;
 }
 
 static __always_inline unsigned int rb_page_commit(struct buffer_page *bpage)
@@ -1878,12 +1879,14 @@ static int rb_read_data_buffer(struct buffer_data_page *dpage, int tail, int cpu
 	return events;
 }
 
-static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu,
+static int rb_validate_buffer(struct buffer_page *bpage, int cpu,
 			      struct ring_buffer_cpu_meta *meta)
 {
+	struct buffer_data_page *dpage = bpage->page;
 	unsigned long long ts;
 	unsigned long tail;
 	u64 delta;
+	int ret = -1;
 
 	/*
 	 * When a sub-buffer is recovered from a read, the commit value may
@@ -1892,9 +1895,17 @@ static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu,
 	 * subbuf_size is considered invalid.
 	 */
 	tail = local_read(&dpage->commit) & ~RB_MISSED_MASK;
-	if (tail > meta->subbuf_size)
-		return -1;
-	return rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
+	if (tail <= meta->subbuf_size)
+		ret = rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
+
+	if (ret < 0) {
+		local_set(&bpage->entries, 0);
+		local_set(&bpage->page->commit, 0);
+	} else {
+		local_set(&bpage->entries, ret);
+	}
+
+	return ret;
 }
 
 /* If the meta data has been validated, now validate the events */
@@ -1915,18 +1926,14 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 	orig_head = head_page = cpu_buffer->head_page;
 
 	/* Do the reader page first */
-	ret = rb_validate_buffer(cpu_buffer->reader_page->page, cpu_buffer->cpu, meta);
+	ret = rb_validate_buffer(cpu_buffer->reader_page, cpu_buffer->cpu, meta);
 	if (ret < 0) {
 		pr_info("Ring buffer meta [%d] invalid reader page detected\n",
 			cpu_buffer->cpu);
 		discarded++;
-		/* Instead of discard whole ring buffer, discard only this sub-buffer. */
-		local_set(&cpu_buffer->reader_page->entries, 0);
-		local_set(&cpu_buffer->reader_page->page->commit, 0);
 	} else {
 		entries += ret;
 		entry_bytes += rb_page_size(cpu_buffer->reader_page);
-		local_set(&cpu_buffer->reader_page->entries, ret);
 	}
 
 	ts = head_page->page->time_stamp;
@@ -1945,26 +1952,33 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 		if (head_page == cpu_buffer->tail_page)
 			break;
 
-		/* Ensure the page has older data than head. */
-		if (ts < head_page->page->time_stamp)
+		/* Rewind until unused page (no timestamp, no commit). */
+		if (!head_page->page->time_stamp && rb_page_commit(head_page) == 0)
 			break;
 
-		ts = head_page->page->time_stamp;
-		/* Ensure the page has correct timestamp and some data. */
-		if (!ts || rb_page_commit(head_page) == 0)
-			break;
-
-		/* Stop rewind if the page is invalid. */
-		ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu, meta);
-		if (ret < 0)
-			break;
-
-		/* Recover the number of entries and update stats. */
-		local_set(&head_page->entries, ret);
-		if (ret)
-			local_inc(&cpu_buffer->pages_touched);
-		entries += ret;
-		entry_bytes += rb_page_commit(head_page);
+		/*
+		 * Skip if the page is invalid, or its timestamp is newer than the
+		 * previous valid page.
+		 */
+		ret = rb_validate_buffer(head_page, cpu_buffer->cpu, meta);
+		if (ret >= 0 && ts < head_page->page->time_stamp) {
+			local_set(&bpage->entries, 0);
+			local_set(&bpage->page->commit, 0);
+			head_page->page->time_stamp = ts;
+			ret = -1;
+		}
+		if (ret < 0) {
+			if (!discarded)
+				pr_info("Ring buffer meta [%d] invalid buffer page detected\n",
+					cpu_buffer->cpu);
+			discarded++;
+		} else {
+			entries += ret;
+			entry_bytes += rb_page_size(head_page);
+			if (ret > 0)
+				local_inc(&cpu_buffer->pages_touched);
+			ts = head_page->page->time_stamp;
+		}
 	}
 	if (i)
 		pr_info("Ring buffer [%d] rewound %d pages\n", cpu_buffer->cpu, i);
@@ -2034,15 +2048,12 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 		if (head_page == cpu_buffer->reader_page)
 			continue;
 
-		ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu, meta);
+		ret = rb_validate_buffer(head_page, cpu_buffer->cpu, meta);
 		if (ret < 0) {
 			if (!discarded)
 				pr_info("Ring buffer meta [%d] invalid buffer page detected\n",
 					cpu_buffer->cpu);
 			discarded++;
-			/* Instead of discard whole ring buffer, discard only this sub-buffer. */
-			local_set(&head_page->entries, 0);
-			local_set(&head_page->page->commit, 0);
 		} else {
 			/* If the buffer has content, update pages_touched */
 			if (ret)
@@ -2050,7 +2061,6 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 
 			entries += ret;
 			entry_bytes += rb_page_size(head_page);
-			local_set(&head_page->entries, ret);
 		}
 		if (head_page == cpu_buffer->commit_page)
 			break;
@@ -2081,7 +2091,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 	/* Reset all the subbuffers */
 	for (i = 0; i < meta->nr_subbufs - 1; i++, rb_inc_page(&head_page)) {
 		local_set(&head_page->entries, 0);
-		local_set(&head_page->page->commit, 0);
+		rb_init_page(head_page->page);
 	}
 }
 


^ permalink raw reply related

* [PATCH v11 3/5] ring-buffer: Skip invalid sub-buffers when validating persistent ring buffer
From: Masami Hiramatsu (Google) @ 2026-03-19  9:12 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Ian Rogers
In-Reply-To: <177391152793.193994.8986943289250629418.stgit@mhiramat.tok.corp.google.com>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Skip invalid sub-buffers when validating the persistent ring buffer
instead of discarding the entire ring buffer. Only skipped buffers
are invalidated (cleared).

If the cache data in memory fails to be synchronized during a reboot,
the persistent ring buffer may become partially corrupted, but other
sub-buffers may still contain readable event data. Only discard the
subbuffers that are found to be corrupted.

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
  Changes in v11:
  - Fix a typo.
  Changes in v9:
  - Add meta->subbuf_size check.
  - Fix a typo.
  - Handle invalid reader_page case.
  Changes in v8:
  - Add comment in rb_valudate_buffer()
  - Clear the RB_MISSED_* flags in rb_valudate_buffer() instead of
    skipping subbuf.
  - Remove unused subbuf local variable from rb_cpu_meta_valid().
  Changes in v7:
  - Combined with Handling RB_MISSED_* flags patch, focus on validation at boot.
  - Remove checking subbuffer data when validating metadata, because it should be done
    later.
  - Do not mark the discarded sub buffer page but just reset it.
  Changes in v6:
  - Show invalid page detection message once per CPU.
  Changes in v5:
  - Instead of showing errors for each page, just show the number
    of discarded pages at last.
  Changes in v3:
  - Record missed data event on commit.
---
 0 files changed

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 3d2acaf75e79..67826021867b 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -370,6 +370,12 @@ static __always_inline unsigned int rb_page_commit(struct buffer_page *bpage)
 	return local_read(&bpage->page->commit);
 }
 
+/* Size is determined by what has been committed */
+static __always_inline unsigned int rb_page_size(struct buffer_page *bpage)
+{
+	return rb_page_commit(bpage) & ~RB_MISSED_MASK;
+}
+
 static void free_buffer_page(struct buffer_page *bpage)
 {
 	/* Range pages are not to be freed */
@@ -1762,7 +1768,6 @@ static bool rb_cpu_meta_valid(struct ring_buffer_cpu_meta *meta, int cpu,
 			      unsigned long *subbuf_mask)
 {
 	int subbuf_size = PAGE_SIZE;
-	struct buffer_data_page *subbuf;
 	unsigned long buffers_start;
 	unsigned long buffers_end;
 	int i;
@@ -1770,6 +1775,11 @@ static bool rb_cpu_meta_valid(struct ring_buffer_cpu_meta *meta, int cpu,
 	if (!subbuf_mask)
 		return false;
 
+	if (meta->subbuf_size != PAGE_SIZE) {
+		pr_info("Ring buffer boot meta [%d] invalid subbuf_size\n", cpu);
+		return false;
+	}
+
 	buffers_start = meta->first_buffer;
 	buffers_end = meta->first_buffer + (subbuf_size * meta->nr_subbufs);
 
@@ -1786,11 +1796,12 @@ static bool rb_cpu_meta_valid(struct ring_buffer_cpu_meta *meta, int cpu,
 		return false;
 	}
 
-	subbuf = rb_subbufs_from_meta(meta);
-
 	bitmap_clear(subbuf_mask, 0, meta->nr_subbufs);
 
-	/* Is the meta buffers and the subbufs themselves have correct data? */
+	/*
+	 * Ensure the meta::buffers array has correct data. The data in each subbufs
+	 * are checked later in rb_meta_validate_events().
+	 */
 	for (i = 0; i < meta->nr_subbufs; i++) {
 		if (meta->buffers[i] < 0 ||
 		    meta->buffers[i] >= meta->nr_subbufs) {
@@ -1798,18 +1809,12 @@ static bool rb_cpu_meta_valid(struct ring_buffer_cpu_meta *meta, int cpu,
 			return false;
 		}
 
-		if ((unsigned)local_read(&subbuf->commit) > subbuf_size) {
-			pr_info("Ring buffer boot meta [%d] buffer invalid commit\n", cpu);
-			return false;
-		}
-
 		if (test_bit(meta->buffers[i], subbuf_mask)) {
 			pr_info("Ring buffer boot meta [%d] array has duplicates\n", cpu);
 			return false;
 		}
 
 		set_bit(meta->buffers[i], subbuf_mask);
-		subbuf = (void *)subbuf + subbuf_size;
 	}
 
 	return true;
@@ -1873,13 +1878,22 @@ static int rb_read_data_buffer(struct buffer_data_page *dpage, int tail, int cpu
 	return events;
 }
 
-static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu)
+static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu,
+			      struct ring_buffer_cpu_meta *meta)
 {
 	unsigned long long ts;
+	unsigned long tail;
 	u64 delta;
-	int tail;
 
-	tail = local_read(&dpage->commit);
+	/*
+	 * When a sub-buffer is recovered from a read, the commit value may
+	 * have RB_MISSED_* bits set, as these bits are reset on reuse.
+	 * Even after clearing these bits, a commit value greater than the
+	 * subbuf_size is considered invalid.
+	 */
+	tail = local_read(&dpage->commit) & ~RB_MISSED_MASK;
+	if (tail > meta->subbuf_size)
+		return -1;
 	return rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
 }
 
@@ -1890,6 +1904,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 	struct buffer_page *head_page, *orig_head;
 	unsigned long entry_bytes = 0;
 	unsigned long entries = 0;
+	int discarded = 0;
 	int ret;
 	u64 ts;
 	int i;
@@ -1900,14 +1915,19 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 	orig_head = head_page = cpu_buffer->head_page;
 
 	/* Do the reader page first */
-	ret = rb_validate_buffer(cpu_buffer->reader_page->page, cpu_buffer->cpu);
+	ret = rb_validate_buffer(cpu_buffer->reader_page->page, cpu_buffer->cpu, meta);
 	if (ret < 0) {
-		pr_info("Ring buffer reader page is invalid\n");
-		goto invalid;
+		pr_info("Ring buffer meta [%d] invalid reader page detected\n",
+			cpu_buffer->cpu);
+		discarded++;
+		/* Instead of discard whole ring buffer, discard only this sub-buffer. */
+		local_set(&cpu_buffer->reader_page->entries, 0);
+		local_set(&cpu_buffer->reader_page->page->commit, 0);
+	} else {
+		entries += ret;
+		entry_bytes += rb_page_size(cpu_buffer->reader_page);
+		local_set(&cpu_buffer->reader_page->entries, ret);
 	}
-	entries += ret;
-	entry_bytes += local_read(&cpu_buffer->reader_page->page->commit);
-	local_set(&cpu_buffer->reader_page->entries, ret);
 
 	ts = head_page->page->time_stamp;
 
@@ -1935,7 +1955,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 			break;
 
 		/* Stop rewind if the page is invalid. */
-		ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu);
+		ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu, meta);
 		if (ret < 0)
 			break;
 
@@ -2014,21 +2034,24 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 		if (head_page == cpu_buffer->reader_page)
 			continue;
 
-		ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu);
+		ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu, meta);
 		if (ret < 0) {
-			pr_info("Ring buffer meta [%d] invalid buffer page\n",
-				cpu_buffer->cpu);
-			goto invalid;
-		}
-
-		/* If the buffer has content, update pages_touched */
-		if (ret)
-			local_inc(&cpu_buffer->pages_touched);
-
-		entries += ret;
-		entry_bytes += local_read(&head_page->page->commit);
-		local_set(&head_page->entries, ret);
+			if (!discarded)
+				pr_info("Ring buffer meta [%d] invalid buffer page detected\n",
+					cpu_buffer->cpu);
+			discarded++;
+			/* Instead of discard whole ring buffer, discard only this sub-buffer. */
+			local_set(&head_page->entries, 0);
+			local_set(&head_page->page->commit, 0);
+		} else {
+			/* If the buffer has content, update pages_touched */
+			if (ret)
+				local_inc(&cpu_buffer->pages_touched);
 
+			entries += ret;
+			entry_bytes += rb_page_size(head_page);
+			local_set(&head_page->entries, ret);
+		}
 		if (head_page == cpu_buffer->commit_page)
 			break;
 	}
@@ -2042,7 +2065,8 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 	local_set(&cpu_buffer->entries, entries);
 	local_set(&cpu_buffer->entries_bytes, entry_bytes);
 
-	pr_info("Ring buffer meta [%d] is from previous boot!\n", cpu_buffer->cpu);
+	pr_info("Ring buffer meta [%d] is from previous boot! (%d pages discarded)\n",
+		cpu_buffer->cpu, discarded);
 	return;
 
  invalid:
@@ -3329,12 +3353,6 @@ rb_iter_head_event(struct ring_buffer_iter *iter)
 	return NULL;
 }
 
-/* Size is determined by what has been committed */
-static __always_inline unsigned rb_page_size(struct buffer_page *bpage)
-{
-	return rb_page_commit(bpage) & ~RB_MISSED_MASK;
-}
-
 static __always_inline unsigned
 rb_commit_index(struct ring_buffer_per_cpu *cpu_buffer)
 {


^ permalink raw reply related

* [PATCH v11 2/5] ring-buffer: Flush and stop persistent ring buffer on panic
From: Masami Hiramatsu (Google) @ 2026-03-19  9:12 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Ian Rogers
In-Reply-To: <177391152793.193994.8986943289250629418.stgit@mhiramat.tok.corp.google.com>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

On real hardware, panic and machine reboot may not flush hardware cache
to memory. This means the persistent ring buffer, which relies on a
coherent state of memory, may not have its events written to the buffer
and they may be lost. Moreover, there may be inconsistency with the
counters which are used for validation of the integrity of the
persistent ring buffer which may cause all data to be discarded.

To avoid this issue, stop recording of the ring buffer on panic and
flush the cache of the ring buffer's memory.

Fixes: e645535a954a ("tracing: Add option to use memmapped memory for trace boot instance")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v11:
   - Do nothing by default since flush_cache_vmap() does nothing on x86
     but it can cause deadlock on some architectures via on_each_cpu()
     because other CPUs will be stoppped when panic notifier is called.
 Changes in v9:
   - Fix typo of & to &&.
   - Fix typo of "Generic"
 Changes in v6:
   - Introduce asm/ring_buffer.h for arch_ring_buffer_flush_range().
   - Use flush_cache_vmap() instead of flush_cache_all().
 Changes in v5:
   - Use ring_buffer_record_off() instead of ring_buffer_record_disable().
   - Use flush_cache_all() to ensure flush all cache.
 Changes in v3:
   - update patch description.
---
 0 files changed

diff --git a/arch/alpha/include/asm/Kbuild b/arch/alpha/include/asm/Kbuild
index 483965c5a4de..b154b4e3dfa8 100644
--- a/arch/alpha/include/asm/Kbuild
+++ b/arch/alpha/include/asm/Kbuild
@@ -5,4 +5,5 @@ generic-y += agp.h
 generic-y += asm-offsets.h
 generic-y += kvm_para.h
 generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
 generic-y += text-patching.h
diff --git a/arch/arc/include/asm/Kbuild b/arch/arc/include/asm/Kbuild
index 4c69522e0328..483caacc6988 100644
--- a/arch/arc/include/asm/Kbuild
+++ b/arch/arc/include/asm/Kbuild
@@ -5,5 +5,6 @@ generic-y += extable.h
 generic-y += kvm_para.h
 generic-y += mcs_spinlock.h
 generic-y += parport.h
+generic-y += ring_buffer.h
 generic-y += user.h
 generic-y += text-patching.h
diff --git a/arch/arm/include/asm/Kbuild b/arch/arm/include/asm/Kbuild
index 03657ff8fbe3..decad5f2c826 100644
--- a/arch/arm/include/asm/Kbuild
+++ b/arch/arm/include/asm/Kbuild
@@ -3,6 +3,7 @@ generic-y += early_ioremap.h
 generic-y += extable.h
 generic-y += flat.h
 generic-y += parport.h
+generic-y += ring_buffer.h
 
 generated-y += mach-types.h
 generated-y += unistd-nr.h
diff --git a/arch/arm64/include/asm/ring_buffer.h b/arch/arm64/include/asm/ring_buffer.h
new file mode 100644
index 000000000000..62316c406888
--- /dev/null
+++ b/arch/arm64/include/asm/ring_buffer.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _ASM_ARM64_RING_BUFFER_H
+#define _ASM_ARM64_RING_BUFFER_H
+
+#include <asm/cacheflush.h>
+
+/* Flush D-cache on persistent ring buffer */
+#define arch_ring_buffer_flush_range(start, end)	dcache_clean_pop(start, end)
+
+#endif /* _ASM_ARM64_RING_BUFFER_H */
diff --git a/arch/csky/include/asm/Kbuild b/arch/csky/include/asm/Kbuild
index 3a5c7f6e5aac..7dca0c6cdc84 100644
--- a/arch/csky/include/asm/Kbuild
+++ b/arch/csky/include/asm/Kbuild
@@ -9,6 +9,7 @@ generic-y += qrwlock.h
 generic-y += qrwlock_types.h
 generic-y += qspinlock.h
 generic-y += parport.h
+generic-y += ring_buffer.h
 generic-y += user.h
 generic-y += vmlinux.lds.h
 generic-y += text-patching.h
diff --git a/arch/hexagon/include/asm/Kbuild b/arch/hexagon/include/asm/Kbuild
index 1efa1e993d4b..0f887d4238ed 100644
--- a/arch/hexagon/include/asm/Kbuild
+++ b/arch/hexagon/include/asm/Kbuild
@@ -5,4 +5,5 @@ generic-y += extable.h
 generic-y += iomap.h
 generic-y += kvm_para.h
 generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
 generic-y += text-patching.h
diff --git a/arch/loongarch/include/asm/Kbuild b/arch/loongarch/include/asm/Kbuild
index 9034b583a88a..7e92957baf6a 100644
--- a/arch/loongarch/include/asm/Kbuild
+++ b/arch/loongarch/include/asm/Kbuild
@@ -10,5 +10,6 @@ generic-y += qrwlock.h
 generic-y += user.h
 generic-y += ioctl.h
 generic-y += mmzone.h
+generic-y += ring_buffer.h
 generic-y += statfs.h
 generic-y += text-patching.h
diff --git a/arch/m68k/include/asm/Kbuild b/arch/m68k/include/asm/Kbuild
index b282e0dd8dc1..62543bf305ff 100644
--- a/arch/m68k/include/asm/Kbuild
+++ b/arch/m68k/include/asm/Kbuild
@@ -3,5 +3,6 @@ generated-y += syscall_table.h
 generic-y += extable.h
 generic-y += kvm_para.h
 generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
 generic-y += spinlock.h
 generic-y += text-patching.h
diff --git a/arch/microblaze/include/asm/Kbuild b/arch/microblaze/include/asm/Kbuild
index 7178f990e8b3..0030309b47ad 100644
--- a/arch/microblaze/include/asm/Kbuild
+++ b/arch/microblaze/include/asm/Kbuild
@@ -5,6 +5,7 @@ generic-y += extable.h
 generic-y += kvm_para.h
 generic-y += mcs_spinlock.h
 generic-y += parport.h
+generic-y += ring_buffer.h
 generic-y += syscalls.h
 generic-y += tlb.h
 generic-y += user.h
diff --git a/arch/mips/include/asm/Kbuild b/arch/mips/include/asm/Kbuild
index 684569b2ecd6..9771c3d85074 100644
--- a/arch/mips/include/asm/Kbuild
+++ b/arch/mips/include/asm/Kbuild
@@ -12,5 +12,6 @@ generic-y += mcs_spinlock.h
 generic-y += parport.h
 generic-y += qrwlock.h
 generic-y += qspinlock.h
+generic-y += ring_buffer.h
 generic-y += user.h
 generic-y += text-patching.h
diff --git a/arch/nios2/include/asm/Kbuild b/arch/nios2/include/asm/Kbuild
index 28004301c236..0a2530964413 100644
--- a/arch/nios2/include/asm/Kbuild
+++ b/arch/nios2/include/asm/Kbuild
@@ -5,6 +5,7 @@ generic-y += cmpxchg.h
 generic-y += extable.h
 generic-y += kvm_para.h
 generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
 generic-y += spinlock.h
 generic-y += user.h
 generic-y += text-patching.h
diff --git a/arch/openrisc/include/asm/Kbuild b/arch/openrisc/include/asm/Kbuild
index cef49d60d74c..8aa34621702d 100644
--- a/arch/openrisc/include/asm/Kbuild
+++ b/arch/openrisc/include/asm/Kbuild
@@ -8,4 +8,5 @@ generic-y += spinlock_types.h
 generic-y += spinlock.h
 generic-y += qrwlock_types.h
 generic-y += qrwlock.h
+generic-y += ring_buffer.h
 generic-y += user.h
diff --git a/arch/parisc/include/asm/Kbuild b/arch/parisc/include/asm/Kbuild
index 4fb596d94c89..d48d158f7241 100644
--- a/arch/parisc/include/asm/Kbuild
+++ b/arch/parisc/include/asm/Kbuild
@@ -4,4 +4,5 @@ generated-y += syscall_table_64.h
 generic-y += agp.h
 generic-y += kvm_para.h
 generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
 generic-y += user.h
diff --git a/arch/powerpc/include/asm/Kbuild b/arch/powerpc/include/asm/Kbuild
index 2e23533b67e3..805b5aeebb6f 100644
--- a/arch/powerpc/include/asm/Kbuild
+++ b/arch/powerpc/include/asm/Kbuild
@@ -5,4 +5,5 @@ generated-y += syscall_table_spu.h
 generic-y += agp.h
 generic-y += mcs_spinlock.h
 generic-y += qrwlock.h
+generic-y += ring_buffer.h
 generic-y += early_ioremap.h
diff --git a/arch/riscv/include/asm/Kbuild b/arch/riscv/include/asm/Kbuild
index bd5fc9403295..7721b63642f4 100644
--- a/arch/riscv/include/asm/Kbuild
+++ b/arch/riscv/include/asm/Kbuild
@@ -14,5 +14,6 @@ generic-y += ticket_spinlock.h
 generic-y += qrwlock.h
 generic-y += qrwlock_types.h
 generic-y += qspinlock.h
+generic-y += ring_buffer.h
 generic-y += user.h
 generic-y += vmlinux.lds.h
diff --git a/arch/s390/include/asm/Kbuild b/arch/s390/include/asm/Kbuild
index 80bad7de7a04..0c1fc47c3ba0 100644
--- a/arch/s390/include/asm/Kbuild
+++ b/arch/s390/include/asm/Kbuild
@@ -7,3 +7,4 @@ generated-y += unistd_nr.h
 generic-y += asm-offsets.h
 generic-y += mcs_spinlock.h
 generic-y += mmzone.h
+generic-y += ring_buffer.h
diff --git a/arch/sh/include/asm/Kbuild b/arch/sh/include/asm/Kbuild
index 4d3f10ed8275..f0403d3ee8ab 100644
--- a/arch/sh/include/asm/Kbuild
+++ b/arch/sh/include/asm/Kbuild
@@ -3,4 +3,5 @@ generated-y += syscall_table.h
 generic-y += kvm_para.h
 generic-y += mcs_spinlock.h
 generic-y += parport.h
+generic-y += ring_buffer.h
 generic-y += text-patching.h
diff --git a/arch/sparc/include/asm/Kbuild b/arch/sparc/include/asm/Kbuild
index 17ee8a273aa6..49c6bb326b75 100644
--- a/arch/sparc/include/asm/Kbuild
+++ b/arch/sparc/include/asm/Kbuild
@@ -4,4 +4,5 @@ generated-y += syscall_table_64.h
 generic-y += agp.h
 generic-y += kvm_para.h
 generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
 generic-y += text-patching.h
diff --git a/arch/um/include/asm/Kbuild b/arch/um/include/asm/Kbuild
index 1b9b82bbe322..2a1629ba8140 100644
--- a/arch/um/include/asm/Kbuild
+++ b/arch/um/include/asm/Kbuild
@@ -17,6 +17,7 @@ generic-y += module.lds.h
 generic-y += parport.h
 generic-y += percpu.h
 generic-y += preempt.h
+generic-y += ring_buffer.h
 generic-y += runtime-const.h
 generic-y += softirq_stack.h
 generic-y += switch_to.h
diff --git a/arch/x86/include/asm/Kbuild b/arch/x86/include/asm/Kbuild
index 4566000e15c4..078fd2c0d69d 100644
--- a/arch/x86/include/asm/Kbuild
+++ b/arch/x86/include/asm/Kbuild
@@ -14,3 +14,4 @@ generic-y += early_ioremap.h
 generic-y += fprobe.h
 generic-y += mcs_spinlock.h
 generic-y += mmzone.h
+generic-y += ring_buffer.h
diff --git a/arch/xtensa/include/asm/Kbuild b/arch/xtensa/include/asm/Kbuild
index 13fe45dea296..e57af619263a 100644
--- a/arch/xtensa/include/asm/Kbuild
+++ b/arch/xtensa/include/asm/Kbuild
@@ -6,5 +6,6 @@ generic-y += mcs_spinlock.h
 generic-y += parport.h
 generic-y += qrwlock.h
 generic-y += qspinlock.h
+generic-y += ring_buffer.h
 generic-y += user.h
 generic-y += text-patching.h
diff --git a/include/asm-generic/ring_buffer.h b/include/asm-generic/ring_buffer.h
new file mode 100644
index 000000000000..201d2aee1005
--- /dev/null
+++ b/include/asm-generic/ring_buffer.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Generic arch dependent ring_buffer macros.
+ */
+#ifndef __ASM_GENERIC_RING_BUFFER_H__
+#define __ASM_GENERIC_RING_BUFFER_H__
+
+#include <linux/cacheflush.h>
+
+/* Flush cache on ring buffer range if needed. Do nothing by default. */
+#define arch_ring_buffer_flush_range(start, end)	do { } while (0)
+
+#endif /* __ASM_GENERIC_RING_BUFFER_H__ */
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index d6bebb782efc..3d2acaf75e79 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -7,6 +7,7 @@
 #include <linux/ring_buffer_types.h>
 #include <linux/sched/isolation.h>
 #include <linux/trace_recursion.h>
+#include <linux/panic_notifier.h>
 #include <linux/trace_events.h>
 #include <linux/ring_buffer.h>
 #include <linux/trace_clock.h>
@@ -31,6 +32,7 @@
 #include <linux/oom.h>
 #include <linux/mm.h>
 
+#include <asm/ring_buffer.h>
 #include <asm/local64.h>
 #include <asm/local.h>
 #include <asm/setup.h>
@@ -559,6 +561,7 @@ struct trace_buffer {
 
 	unsigned long			range_addr_start;
 	unsigned long			range_addr_end;
+	struct notifier_block		flush_nb;
 
 	struct ring_buffer_meta		*meta;
 
@@ -2520,6 +2523,16 @@ static void rb_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
 	kfree(cpu_buffer);
 }
 
+/* Stop recording on a persistent buffer and flush cache if needed. */
+static int rb_flush_buffer_cb(struct notifier_block *nb, unsigned long event, void *data)
+{
+	struct trace_buffer *buffer = container_of(nb, struct trace_buffer, flush_nb);
+
+	ring_buffer_record_off(buffer);
+	arch_ring_buffer_flush_range(buffer->range_addr_start, buffer->range_addr_end);
+	return NOTIFY_DONE;
+}
+
 static struct trace_buffer *alloc_buffer(unsigned long size, unsigned flags,
 					 int order, unsigned long start,
 					 unsigned long end,
@@ -2650,6 +2663,12 @@ static struct trace_buffer *alloc_buffer(unsigned long size, unsigned flags,
 
 	mutex_init(&buffer->mutex);
 
+	/* Persistent ring buffer needs to flush cache before reboot. */
+	if (start && end) {
+		buffer->flush_nb.notifier_call = rb_flush_buffer_cb;
+		atomic_notifier_chain_register(&panic_notifier_list, &buffer->flush_nb);
+	}
+
 	return_ptr(buffer);
 
  fail_free_buffers:
@@ -2748,6 +2767,9 @@ ring_buffer_free(struct trace_buffer *buffer)
 {
 	int cpu;
 
+	if (buffer->range_addr_start && buffer->range_addr_end)
+		atomic_notifier_chain_unregister(&panic_notifier_list, &buffer->flush_nb);
+
 	cpuhp_state_remove_instance(CPUHP_TRACE_RB_PREPARE, &buffer->node);
 
 	irq_work_sync(&buffer->irq_work.work);


^ permalink raw reply related

* [PATCH v11 1/5] ring-buffer: Fix to update per-subbuf entries of persistent ring buffer
From: Masami Hiramatsu (Google) @ 2026-03-19  9:12 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Ian Rogers
In-Reply-To: <177391152793.193994.8986943289250629418.stgit@mhiramat.tok.corp.google.com>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Since the validation loop in rb_meta_validate_events() updates
the same cpu_buffer->head_page->entries, the other subbuf entries
are not updated.
Fix to use head_page to update the entries field, since it is the
cursor in this loop.

Fixes: 5f3b6e839f3c ("ring-buffer: Validate boot range memory events")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 0 files changed

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 96e0d80d492b..d6bebb782efc 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -2024,7 +2024,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 
 		entries += ret;
 		entry_bytes += local_read(&head_page->page->commit);
-		local_set(&cpu_buffer->head_page->entries, ret);
+		local_set(&head_page->entries, ret);
 
 		if (head_page == cpu_buffer->commit_page)
 			break;


^ permalink raw reply related

* [PATCH v11 0/5] ring-buffer: Making persistent ring buffers robust
From: Masami Hiramatsu (Google) @ 2026-03-19  9:12 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Ian Rogers

Hi,

Here is the 11th version of improvement patches for making persistent
ring buffers robust to failures.
The previous version is here:

https://lore.kernel.org/linux-trace-kernel/177374017536.2358053.12341235939816794384.stgit@mhiramat.tok.corp.google.com/

In this version, I updated [2/5] to do nothing by default since
flush_cache_vmap() does nothing on x86 but it can cause deadlock on
some architectures via on_each_cpu(), because other CPUs will be
stoppped when panic notifier is called.
Also update typo in [3/5], and fix to reset timestamp when invalid
whole ring buffer and skip pages which has invalid "timestamp"
instead of invalidating all ring buffers.

Thank you,

---

Masami Hiramatsu (Google) (5):
      ring-buffer: Fix to update per-subbuf entries of persistent ring buffer
      ring-buffer: Flush and stop persistent ring buffer on panic
      ring-buffer: Skip invalid sub-buffers when validating persistent ring buffer
      ring-buffer: Skip invalid sub-buffers when rewinding persistent ring buffer
      ring-buffer: Add persistent ring buffer selftest


 0 files changed

--
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH v2] blk-mq: add tracepoint block_rq_tag_wait
From: Johannes Thumshirn @ 2026-03-19  7:32 UTC (permalink / raw)
  To: Aaron Tomlin, axboe@kernel.dk, rostedt@goodmis.org,
	mhiramat@kernel.org, mathieu.desnoyers@efficios.com
  Cc: kch@nvidia.com, bvanassche@acm.org, dlemoal@kernel.org,
	ritesh.list@gmail.com, neelx@suse.com, sean@ashe.io,
	mproche@gmail.com, chjohnst@gmail.com,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-trace-kernel@vger.kernel.org
In-Reply-To: <20260319015300.287653-1-atomlin@atomlin.com>

Looks good,

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>


^ permalink raw reply

* Re: [PATCH v9 2/4] ring-buffer: Flush and stop persistent ring buffer on panic
From: Masami Hiramatsu @ 2026-03-19  3:36 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, Ian Rogers
In-Reply-To: <831523cf-110c-419d-9b22-e54f93a3bdb5@efficios.com>

On Wed, 18 Mar 2026 11:51:28 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> On 2026-03-18 11:29, Masami Hiramatsu (Google) wrote:
> >>
> >> - AFAIU, you are not trying to evince cache lines after creation
> >>     of a new virtual mapping (which is the documented intent of
> >>     flush_cache_vmap).
> > 
> > Ah, OK. That's a good point!
> > (anyway I will replace it with do { } while (0) in the next version.)
> > 
> >>     
> >> - AFAIU flush_cache_vmap maps to no-code on arm64 (asm-generic), what am
> >>     I missing ? It makes sense to be a no-op because AFAIR arm64 does not
> >>     have to deal with virtually aliasing caches.
> > 
> > Yeah, so my patch also introduces arm64 specific implementation.
> 
> Just make sure to call this something else than "flush_cache_vmap",
> because you don't want to slow down vmap on arm64 which does not
> require to evince and certainly not write back cache lines after
> creation of a new virtual mapping.

OK, I will just leave it an empty do-while in asm-generic instead of
flush_cache_vmap(). If any architecture finds persistent ring buffer
needs to write back caches, it can add its own flush implementation.

BTW, do we need dmb(osh)? This runs dcache_clean_pop() after atomic
operation in ring_buffer_record_off().

	ring_buffer_record_off(buffer);
	arch_ring_buffer_flush_range(buffer->range_addr_start, buffer->range_addr_end);

Thank you,

> 
> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH v2] blk-mq: add tracepoint block_rq_tag_wait
From: Damien Le Moal @ 2026-03-19  3:31 UTC (permalink / raw)
  To: Aaron Tomlin, axboe, rostedt, mhiramat, mathieu.desnoyers
  Cc: johannes.thumshirn, kch, bvanassche, ritesh.list, neelx, sean,
	mproche, chjohnst, linux-block, linux-kernel, linux-trace-kernel
In-Reply-To: <20260319015300.287653-1-atomlin@atomlin.com>

On 3/19/26 10:53, Aaron Tomlin wrote:
> In high-performance storage environments, particularly when utilising
> RAID controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe
> latency spikes can occur when fast devices (SSDs) are starved of hardware
> tags when sharing the same blk_mq_tag_set.
> 
> Currently, diagnosing this specific hardware queue contention is
> difficult. When a CPU thread exhausts the tag pool, blk_mq_get_tag()
> forces the current thread to block uninterruptible via io_schedule().
> While this can be inferred via sched:sched_switch or dynamically
> traced by attaching a kprobe to blk_mq_mark_tag_wait(), there is no
> dedicated, out-of-the-box observability for this event.
> 
> This patch introduces the block_rq_tag_wait static trace point in the
> tag allocation slow-path. It triggers immediately before the thread
> yields the CPU, exposing the exact hardware context (hctx) that is
> starved, the specific pool experiencing starvation (hardware or software
> scheduler), and the total pool depth.
> 
> This provides storage engineers and performance monitoring agents
> with a zero-configuration, low-overhead mechanism to definitively
> identify shared-tag bottlenecks and tune I/O schedulers or cgroup
> throttling accordingly.
> 
> Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
> ---
> Changes in v1 [1]:
>  - Improved the description of the trace point (Damien Le Moal)
>  - Removed the redundant "active requests" (Laurence Oberman)
>  - Introduced pool-specific starvation tracking
> 
> [1]: https://lore.kernel.org/lkml/20260317182835.258183-1-atomlin@atomlin.com/
> 
>  block/blk-mq-tag.c           |  4 ++++
>  include/trace/events/block.h | 43 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 47 insertions(+)
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 33946cdb5716..a6691a4fe7a7 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -13,6 +13,7 @@
>  #include <linux/kmemleak.h>
>  
>  #include <linux/delay.h>
> +#include <trace/events/block.h>
>  #include "blk.h"
>  #include "blk-mq.h"
>  #include "blk-mq-sched.h"
> @@ -187,6 +188,9 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
>  		if (tag != BLK_MQ_NO_TAG)
>  			break;
>  
> +		trace_block_rq_tag_wait(data->q, data->hctx,
> +					!!(data->rq_flags & RQF_SCHED_TAGS));

I do not think that the "!!" is needed here.

Other than this, this looks OK to me.

Reviewed-by: Damien Le Moal <dlemoal@kernel.org>

> +
>  		bt_prev = bt;
>  		io_schedule();
>  
> diff --git a/include/trace/events/block.h b/include/trace/events/block.h
> index 6aa79e2d799c..f7708d0d7a0c 100644
> --- a/include/trace/events/block.h
> +++ b/include/trace/events/block.h
> @@ -226,6 +226,49 @@ DECLARE_EVENT_CLASS(block_rq,
>  		  IOPRIO_PRIO_LEVEL(__entry->ioprio), __entry->comm)
>  );
>  
> +/**
> + * block_rq_tag_wait - triggered when a request is starved of a tag
> + * @q: request queue of the target device
> + * @hctx: hardware context of the request experiencing starvation
> + * @is_sched_tag: indicates whether the starved pool is the software scheduler
> + *
> + * Called immediately before the submitting context is forced to block due
> + * to the exhaustion of available tags (i.e., physical hardware driver tags
> + * or software scheduler tags). This trace point indicates that the context
> + * will be placed into an uninterruptible state via io_schedule() until an
> + * active request completes and relinquishes its assigned tag.
> + */
> +TRACE_EVENT(block_rq_tag_wait,
> +
> +	TP_PROTO(struct request_queue *q, struct blk_mq_hw_ctx *hctx, bool is_sched_tag),
> +
> +	TP_ARGS(q, hctx, is_sched_tag),
> +
> +	TP_STRUCT__entry(
> +		__field( dev_t,		dev			)
> +		__field( u32,		hctx_id			)
> +		__field( u32,		nr_tags			)
> +		__field( bool,		is_sched_tag		)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->dev		= disk_devt(q->disk);
> +		__entry->hctx_id	= hctx->queue_num;
> +		__entry->is_sched_tag	= is_sched_tag;
> +
> +		if (__entry->is_sched_tag)
> +			__entry->nr_tags = hctx->sched_tags->nr_tags;
> +		else
> +			__entry->nr_tags = hctx->tags->nr_tags;
> +	),
> +
> +	TP_printk("%d,%d hctx=%u starved on %s tags (depth=%u)",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> +		  __entry->hctx_id,
> +		  __entry->is_sched_tag ? "scheduler" : "hardware",
> +		  __entry->nr_tags)
> +);
> +
>  /**
>   * block_rq_insert - insert block operation request into queue
>   * @rq: block IO operation request


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply

* Re: [PATCH v2] blk-mq: add tracepoint block_rq_tag_wait
From: Chaitanya Kulkarni @ 2026-03-19  3:18 UTC (permalink / raw)
  To: Aaron Tomlin, axboe@kernel.dk, rostedt@goodmis.org,
	mhiramat@kernel.org, mathieu.desnoyers@efficios.com
  Cc: johannes.thumshirn@wdc.com, Chaitanya Kulkarni,
	bvanassche@acm.org, dlemoal@kernel.org, ritesh.list@gmail.com,
	neelx@suse.com, sean@ashe.io, mproche@gmail.com,
	chjohnst@gmail.com, linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org
In-Reply-To: <20260319015300.287653-1-atomlin@atomlin.com>

On 3/18/26 18:53, Aaron Tomlin wrote:
> In high-performance storage environments, particularly when utilising
> RAID controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe
> latency spikes can occur when fast devices (SSDs) are starved of hardware
> tags when sharing the same blk_mq_tag_set.
>
> Currently, diagnosing this specific hardware queue contention is
> difficult. When a CPU thread exhausts the tag pool, blk_mq_get_tag()
> forces the current thread to block uninterruptible via io_schedule().
> While this can be inferred viasched:sched_switch or dynamically
> traced by attaching a kprobe to blk_mq_mark_tag_wait(), there is no
> dedicated, out-of-the-box observability for this event.
>
> This patch introduces the block_rq_tag_wait static trace point in the
> tag allocation slow-path. It triggers immediately before the thread
> yields the CPU, exposing the exact hardware context (hctx) that is
> starved, the specific pool experiencing starvation (hardware or software
> scheduler), and the total pool depth.
>
> This provides storage engineers and performance monitoring agents
> with a zero-configuration, low-overhead mechanism to definitively
> identify shared-tag bottlenecks and tune I/O schedulers or cgroup
> throttling accordingly.
>
> Signed-off-by: Aaron Tomlin<atomlin@atomlin.com>
> ---
> Changes in v1 [1]:
>   - Improved the description of the trace point (Damien Le Moal)
>   - Removed the redundant "active requests" (Laurence Oberman)
>   - Introduced pool-specific starvation tracking
>
> [1]:https://lore.kernel.org/lkml/20260317182835.258183-1-atomlin@atomlin.com/


LGTM.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>

-ck



^ permalink raw reply

* [PATCH v2] blk-mq: add tracepoint block_rq_tag_wait
From: Aaron Tomlin @ 2026-03-19  1:53 UTC (permalink / raw)
  To: axboe, rostedt, mhiramat, mathieu.desnoyers
  Cc: johannes.thumshirn, kch, bvanassche, dlemoal, ritesh.list, neelx,
	sean, mproche, chjohnst, linux-block, linux-kernel,
	linux-trace-kernel

In high-performance storage environments, particularly when utilising
RAID controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe
latency spikes can occur when fast devices (SSDs) are starved of hardware
tags when sharing the same blk_mq_tag_set.

Currently, diagnosing this specific hardware queue contention is
difficult. When a CPU thread exhausts the tag pool, blk_mq_get_tag()
forces the current thread to block uninterruptible via io_schedule().
While this can be inferred via sched:sched_switch or dynamically
traced by attaching a kprobe to blk_mq_mark_tag_wait(), there is no
dedicated, out-of-the-box observability for this event.

This patch introduces the block_rq_tag_wait static trace point in the
tag allocation slow-path. It triggers immediately before the thread
yields the CPU, exposing the exact hardware context (hctx) that is
starved, the specific pool experiencing starvation (hardware or software
scheduler), and the total pool depth.

This provides storage engineers and performance monitoring agents
with a zero-configuration, low-overhead mechanism to definitively
identify shared-tag bottlenecks and tune I/O schedulers or cgroup
throttling accordingly.

Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
---
Changes in v1 [1]:
 - Improved the description of the trace point (Damien Le Moal)
 - Removed the redundant "active requests" (Laurence Oberman)
 - Introduced pool-specific starvation tracking

[1]: https://lore.kernel.org/lkml/20260317182835.258183-1-atomlin@atomlin.com/

 block/blk-mq-tag.c           |  4 ++++
 include/trace/events/block.h | 43 ++++++++++++++++++++++++++++++++++++
 2 files changed, 47 insertions(+)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 33946cdb5716..a6691a4fe7a7 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -13,6 +13,7 @@
 #include <linux/kmemleak.h>
 
 #include <linux/delay.h>
+#include <trace/events/block.h>
 #include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-sched.h"
@@ -187,6 +188,9 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 		if (tag != BLK_MQ_NO_TAG)
 			break;
 
+		trace_block_rq_tag_wait(data->q, data->hctx,
+					!!(data->rq_flags & RQF_SCHED_TAGS));
+
 		bt_prev = bt;
 		io_schedule();
 
diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index 6aa79e2d799c..f7708d0d7a0c 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -226,6 +226,49 @@ DECLARE_EVENT_CLASS(block_rq,
 		  IOPRIO_PRIO_LEVEL(__entry->ioprio), __entry->comm)
 );
 
+/**
+ * block_rq_tag_wait - triggered when a request is starved of a tag
+ * @q: request queue of the target device
+ * @hctx: hardware context of the request experiencing starvation
+ * @is_sched_tag: indicates whether the starved pool is the software scheduler
+ *
+ * Called immediately before the submitting context is forced to block due
+ * to the exhaustion of available tags (i.e., physical hardware driver tags
+ * or software scheduler tags). This trace point indicates that the context
+ * will be placed into an uninterruptible state via io_schedule() until an
+ * active request completes and relinquishes its assigned tag.
+ */
+TRACE_EVENT(block_rq_tag_wait,
+
+	TP_PROTO(struct request_queue *q, struct blk_mq_hw_ctx *hctx, bool is_sched_tag),
+
+	TP_ARGS(q, hctx, is_sched_tag),
+
+	TP_STRUCT__entry(
+		__field( dev_t,		dev			)
+		__field( u32,		hctx_id			)
+		__field( u32,		nr_tags			)
+		__field( bool,		is_sched_tag		)
+	),
+
+	TP_fast_assign(
+		__entry->dev		= disk_devt(q->disk);
+		__entry->hctx_id	= hctx->queue_num;
+		__entry->is_sched_tag	= is_sched_tag;
+
+		if (__entry->is_sched_tag)
+			__entry->nr_tags = hctx->sched_tags->nr_tags;
+		else
+			__entry->nr_tags = hctx->tags->nr_tags;
+	),
+
+	TP_printk("%d,%d hctx=%u starved on %s tags (depth=%u)",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->hctx_id,
+		  __entry->is_sched_tag ? "scheduler" : "hardware",
+		  __entry->nr_tags)
+);
+
 /**
  * block_rq_insert - insert block operation request into queue
  * @rq: block IO operation request
-- 
2.51.0


^ permalink raw reply related

* Re: [PATCH] blk-mq: add tracepoint block_rq_tag_wait
From: Aaron Tomlin @ 2026-03-19  0:22 UTC (permalink / raw)
  To: Damien Le Moal, loberman
  Cc: axboe, rostedt, mhiramat, mathieu.desnoyers, johannes.thumshirn,
	kch, bvanassche, ritesh.list, neelx, sean, mproche, chjohnst,
	linux-block, linux-kernel, linux-trace-kernel
In-Reply-To: <lrnjp7wcrrfita36onlxqihep44sgr4il57ccy4irf2mortdqi@46w7sl52whmr>

[-- Attachment #1: Type: text/plain, Size: 1710 bytes --]

On Wed, Mar 18, 2026 at 09:21:23AM -0400, Aaron Tomlin wrote:
> On Wed, Mar 18, 2026 at 08:38:20AM +0900, Damien Le Moal wrote:
> > Looks OK to me, but I have some suggestions below.

Hi Damien, Laurence,

Upon reviewing the source code once more, it becomes apparent that tracking
"active requests" within this specific trace point is essentially redundant.
If a thread is compelled to invoke io_schedule(), it is mathematically
certain that the number of active requests perfectly equals the total
number of tags.

Now, it would almost always print active=0 in the following scenarios:

    1.  "mq-deadline" Scheduler Starvation: The thread sleeps waiting for a
        scheduler tag. Because the request has not been dispatched to
        hardware yet, blk_mq_inc_active_requests() was never called.
        hctx->nr_active is 0.

    2.  NVMe Hardware Starvation, "none" scheduler: The thread sleeps
        waiting for a hardware tag. Because NVMe drives do not share tags,
        blk_mq_inc_active_requests() instantly aborts to save CPU-cycles.
        hctx->nr_active remains 0.

    3.  RAID Hardware Starvation, "none" scheduler: The thread sleeps
        waiting for a shared hardware tag. Because it is HCTX_SHARED, the
        kernel tracks the active requests in
        hctx->queue->nr_active_requests_shared_tags. The local
        hctx->nr_active counter is completely bypassed and remains 0.

Rather than attempting to print the active count, the trace point should be
modified to indicate exactly which pool experienced starvation: the
hardware pool or the software scheduler pool.

I will submit a follow-up patch.


Kind regards,
-- 
Aaron Tomlin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v2 1/2] kthread: remove kthread_exit()
From: Steven Rostedt @ 2026-03-18 23:12 UTC (permalink / raw)
  To: David Laight
  Cc: Christian Brauner, Linus Torvalds, linux-kernel, linux-modules,
	linux-nfs, bpf, kunit-dev, linux-doc, linux-trace-kernel, netfs,
	io-uring, audit, rcu, kvm, virtualization, netdev, linux-mm,
	linux-security-module, Christian Loehle, linux-fsdevel
In-Reply-To: <20260311104736.51b53405@pumpkin>

On Wed, 11 Mar 2026 10:47:36 +0000
David Laight <david.laight.linux@gmail.com> wrote:

> > -#define module_put_and_kthread_exit(code) kthread_exit(code)
> > +#define module_put_and_kthread_exit(code) do_exit(code)  
> 
> I'm intrigued...
> How does that actually know to do the module_put()?
> (I know it does one - otherwise my driver wouldn't unload.)

It's in the !CONFIG_MODULES section. No module_put() necessary. Only the
kthread_exit (do_exit) is needed.

-- Steve

^ permalink raw reply

* [PATCH] tracing: Fix trace_marker copy link list updates
From: Steven Rostedt @ 2026-03-18 22:55 UTC (permalink / raw)
  To: LKML, Linux Trace Kernel; +Cc: Masami Hiramatsu, Mathieu Desnoyers, Sasha Levin

From: Steven Rostedt <rostedt@goodmis.org>

When the "copy_trace_marker" option is enabled for an instance, anything
written into /sys/kernel/tracing/trace_marker is also copied into that
instances buffer. When the option is set, that instance's trace_array
descriptor is added to the marker_copies link list. This list is protected
by RCU, as all iterations uses an RCU protected list traversal.

When the instance is deleted, all the flags that were enabled are cleared.
This also clears the copy_trace_marker flag and removes the trace_array
descriptor from the list.

The issue is after the flags are called, a direct call to
update_marker_trace() is performed to clear the flag. This function
returns true if the state of the flag changed and false otherwise. If it
returns true here, synchronize_rcu() is called to make sure all readers
see that its removed from the list.

But since the flag was already cleared, the state does not change and the
synchronization is never called, leaving a possible UAF bug.

Move the clearing of all flags below the updating of the copy_trace_marker
option which then makes sure the synchronization is performed.

Also use the flag for checking the state in update_marker_trace() instead
of looking at if the list is empty.

Cc: stable@vger.kernel.org
Fixes: 7b382efd5e8a ("tracing: Allow the top level trace_marker to write into another instances")
Reported-by: Sasha Levin <sashal@kernel.org>
Closes: https://lore.kernel.org/all/20260225133122.237275-1-sashal@kernel.org/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/trace.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index bb4a62f4b953..a626211ceb9a 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -555,7 +555,7 @@ static bool update_marker_trace(struct trace_array *tr, int enabled)
 	lockdep_assert_held(&event_mutex);
 
 	if (enabled) {
-		if (!list_empty(&tr->marker_list))
+		if (tr->trace_flags & TRACE_ITER(COPY_MARKER))
 			return false;
 
 		list_add_rcu(&tr->marker_list, &marker_copies);
@@ -563,10 +563,10 @@ static bool update_marker_trace(struct trace_array *tr, int enabled)
 		return true;
 	}
 
-	if (list_empty(&tr->marker_list))
+	if (!(tr->trace_flags & TRACE_ITER(COPY_MARKER)))
 		return false;
 
-	list_del_init(&tr->marker_list);
+	list_del_rcu(&tr->marker_list);
 	tr->trace_flags &= ~TRACE_ITER(COPY_MARKER);
 	return true;
 }
@@ -9761,18 +9761,19 @@ static int __remove_instance(struct trace_array *tr)
 
 	list_del(&tr->list);
 
-	/* Disable all the flags that were enabled coming in */
-	for (i = 0; i < TRACE_FLAGS_MAX_SIZE; i++) {
-		if ((1ULL << i) & ZEROED_TRACE_FLAGS)
-			set_tracer_flag(tr, 1ULL << i, 0);
-	}
-
 	if (printk_trace == tr)
 		update_printk_trace(&global_trace);
 
+	/* Must be done before disabling all the flags */
 	if (update_marker_trace(tr, 0))
 		synchronize_rcu();
 
+	/* Disable all the flags that were enabled coming in */
+	for (i = 0; i < TRACE_FLAGS_MAX_SIZE; i++) {
+		if ((1ULL << i) & ZEROED_TRACE_FLAGS)
+			set_tracer_flag(tr, 1ULL << i, 0);
+	}
+
 	tracing_set_nop(tr);
 	clear_ftrace_function_probes(tr);
 	event_trace_del_tracer(tr);
-- 
2.51.0


^ permalink raw reply related

* [PATCH 9/8] memblock tests: add stubs required for free_reserved_area()
From: Mike Rapoport @ 2026-03-18 20:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Potapenko, Alexander Viro, Andreas Larsson,
	Ard Biesheuvel, Borislav Petkov, Brendan Jackman,
	Christophe Leroy (CS GROUP), Catalin Marinas, Christian Brauner,
	David S. Miller, Dave Hansen, David Hildenbrand, Dmitry Vyukov,
	Ilias Apalodimas, Ingo Molnar, Jan Kara, Johannes Weiner,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Marco Elver, Marek Szyprowski, Masami Hiramatsu, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, H. Peter Anvin,
	Rob Herring, Robin Murphy, Saravana Kannan, Suren Baghdasaryan,
	Thomas Gleixner, Vlastimil Babka, Will Deacon, Zi Yan, devicetree,
	iommu, kasan-dev, linux-arm-kernel, linux-efi, linux-fsdevel,
	linux-kernel, linux-mm, linux-trace-kernel, linuxppc-dev,
	sparclinux, x86
In-Reply-To: <20260318105827.1358927-1-rppt@kernel.org>

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

After moving free_reserved_area() function to mm/memblock.c memblock
tests lack stubs for several functions and macros this function calls.

Add them.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 tools/include/linux/mm.h          |  1 +
 tools/testing/memblock/internal.h | 28 +++++++++++++++++++++++++---
 2 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/tools/include/linux/mm.h b/tools/include/linux/mm.h
index 028f3faf46e7..4407d8396108 100644
--- a/tools/include/linux/mm.h
+++ b/tools/include/linux/mm.h
@@ -17,6 +17,7 @@
 
 #define __va(x) ((void *)((unsigned long)(x)))
 #define __pa(x) ((unsigned long)(x))
+#define __pa_symbol(x) ((unsigned long)(x))
 
 #define pfn_to_page(pfn) ((void *)((pfn) * PAGE_SIZE))
 
diff --git a/tools/testing/memblock/internal.h b/tools/testing/memblock/internal.h
index 009b97bbdd22..7ff61172ab24 100644
--- a/tools/testing/memblock/internal.h
+++ b/tools/testing/memblock/internal.h
@@ -11,9 +11,16 @@ static int memblock_debug = 1;
 
 #define pr_warn_ratelimited(fmt, ...)    printf(fmt, ##__VA_ARGS__)
 
+#define K(x) ((x) << (PAGE_SHIFT-10))
+
 bool mirrored_kernelcore = false;
 
 struct page {};
+static inline void *page_address(struct page *page)
+{
+	BUG();
+	return page;
+}
 
 void memblock_free_pages(unsigned long pfn, unsigned int order)
 {
@@ -23,10 +30,25 @@ static inline void accept_memory(phys_addr_t start, unsigned long size)
 {
 }
 
-static inline unsigned long free_reserved_area(void *start, void *end,
-					       int poison, const char *s)
+unsigned long free_reserved_area(void *start, void *end, int poison, const char *s);
+void free_reserved_page(struct page *page);
+
+static inline bool deferred_pages_enabled(void)
+{
+	return false;
+}
+
+#define for_each_valid_pfn(pfn, start_pfn, end_pfn)			 \
+	for ((pfn) = (start_pfn); (pfn) < (end_pfn); (pfn)++)
+
+static inline void *kasan_reset_tag(const void *addr)
+{
+	return (void *)addr;
+}
+
+static inline bool __is_kernel(unsigned long addr)
 {
-	return 0;
+	return false;
 }
 
 #endif
-- 
2.51.0


^ permalink raw reply related

* Re: [PATCH v8 05/13] lib/bootconfig: drop redundant memset of xbc_nodes
From: Markus Elfring @ 2026-03-18 20:22 UTC (permalink / raw)
  To: Josh Law, linux-trace-kernel, Andrew Morton, Masami Hiramatsu
  Cc: LKML, Steven Rostedt
In-Reply-To: <20260318155919.78168-6-objecting@objecting.org>

> memblock_alloc() already returns zeroed memory,

Interesting …


>                                                 so the explicit memset
> in xbc_init() is redundant. …

Would you like to reconsider this conclusion for the mentioned function implementation
in more detail?
https://elixir.bootlin.com/linux/v7.0-rc4/source/lib/bootconfig.c#L932-L998

Regards,
Markus

^ permalink raw reply

* Re: [PATCH v8 2/2] tools/bootconfig: fix fd leak in load_xbc_file() on fstat failure
From: Markus Elfring @ 2026-03-18 20:02 UTC (permalink / raw)
  To: Josh Law, linux-trace-kernel, Andrew Morton, Masami Hiramatsu
  Cc: LKML, Steven Rostedt
In-Reply-To: <20260318155847.78065-3-objecting@objecting.org>

> If fstat() fails after open() succeeds, the function returns without
> closing the file descriptor. Also preserve errno across close(), since
> close() may overwrite it before the error is returned.

I find such a change description improvable.

Did anything hinder to use a corresponding goto chain?
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/coding-style.rst?h=v7.0-rc4#n526
https://elixir.bootlin.com/linux/v7.0-rc4/source/tools/bootconfig/main.c#L155-L173

Regards,
Markus

^ permalink raw reply

* Re: [PATCH] Remove unused headers in x86/tools, scripts, pps, input
From: Nicolas Schier @ 2026-03-18 19:56 UTC (permalink / raw)
  To: Oli
  Cc: Thomas Gleixner, Ingo Molnar, Steven Rostedt, Mathieu Desnoyers,
	Masami Hiramatsu, Rodolfo Giometti, Henrik Rydberg,
	Dmitry Torokhov, Nathan Chancellor, linux-kernel,
	linux-trace-kernel, linux-kbuild, linux-input, x86
In-Reply-To: <CAOW84UxjnSDKSjsaaS9=DBquCk3SDfb74=OmkHTLUyq5qriYsA@mail.gmail.com>

Hi Oli,

thanks for your contribution.  Some comments below:

On Tue, Mar 10, 2026 at 10:01:41PM -0500, Oli wrote:
> From c78a0572f5ec2b927f9b723af687e6ef913561a4 Mon Sep 17 00:00:00 2001
> From: Eddie Hudgins <Oochiolio@gmail.com>
> Date: Tue, 10 Mar 2026 21:53:07 -0500
> Subject: [PATCH] Signed-off-by: Eddie Hudgins <Oochiolio@gmail.com>
>  arch/x86/tools: Removed headers in relocs_32.c scripts/basic: Removed
> headers
>  in fixdep.c drivers/pps: Removed headers in pps.c drivers/input: Removed
>  headers in input-mt.c

Usually, patch mails do not contain mail headers within their body; the
only possible exception is 'From:' if the sender is not the patch
author.  These additional headers prevent the usual patch application
(e.g. 'git am <mail').

> 
> These changes compile for x86, x86_64, and powerpc (Those were the only
> ones fairly tested) under defconfig. This aims to clean up code and
> simplify the files for developers. This will also contribute to start of
> decluttering the environment.

A commit subject should start with a subsystem identifier.  A commit
message should tell about the what and why of the patch, followed by a
'Signed-of-by'.  E.g.:

   kbuild: fixdep: Remove unused includes

   Remove unused #include statements for clean up.

   Signed-off-by: Your Name <your.e.mail@addre.ss>

(More complex changes require more details commit message).

Please check Documentation/process/submitting-patches.rst.

[...]
> diff --git a/scripts/basic/fixdep.c b/scripts/basic/fixdep.c
> index cdd5da7e009b..feb9e7d8984d 100644
> --- a/scripts/basic/fixdep.c
> +++ b/scripts/basic/fixdep.c
> @@ -89,7 +89,6 @@
>   *  but I don't think the added complexity is worth it)
>   */
> 
> -#include <sys/types.h>
>  #include <sys/stat.h>
>  #include <unistd.h>
>  #include <fcntl.h>
> --
> 2.43.0

The change in scripts/basic/fixdep.c looks good to me.  Do you want to
prepare a new kbuild-only patch and want me to take it for kbuild?

Kind regards,
Nicolas

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox